Commits · 1f60a5815bade268696d57452dfbfcbf0c655a23 · Kirill Smelkov / linux

22 May, 2017 23 commits

nfp: mark port state as stale after reconfig · 1f60a581

Jakub Kicinski authored May 22, 2017

After port configuration is performed mark it as changed. This
will close a window of time between configuration and async
state refresh which runs from a workqueue where old port state
would be reported.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f60a581

nfp: provide linking on port structures · 3eb3b74a

Jakub Kicinski authored May 22, 2017

Add link to nfp_ports to make it possible to iterate over all ports.
This will come in handy when some ports may be representors.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3eb3b74a

nfp: move refresh tracking into the port structure · 6d4f8cba

Jakub Kicinski authored May 22, 2017

Track whether physical port's state have changed since last refresh
inside the nfp_port structure instead of the vNIC structure.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d4f8cba

nfp: update port state in place · 3d4ed1e7

Jakub Kicinski authored May 22, 2017

Always updating port state in place by overriding values in exiting
pf->eth_tbl makes things easier to manage and allows us to have a
common helper for both full and per-port refresh.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d4ed1e7

nfp: introduce nfp_port · eb488c26

Jakub Kicinski authored May 22, 2017

Encapsulate port information into struct nfp_port. nfp_port will
soon be extended to contain devlink_port information. It also makes
it easier to reuse port-related code between vNICs and representors.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb488c26

nfp: disallow mixing vNICs with and without NSP port entry · d88b0a23

Jakub Kicinski authored May 22, 2017

We only support core NIC apps which have vNICs for each physical port/
split and no representors right now.  Enforce that either each vNIC has
a NSP eth_table entry or if NSP port table is not available none do.

One scenario this will prevent from happening is user force-loading
wrong firmware file if FW app requires different firmwares per media
config.

While at it move some code to nfp_net_pf_alloc_vnic() to make it
counter-match nfp_net_pf_free_vnic() better.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d88b0a23

nfp: introduce very minimal nfp_app · 7ac9ebd5

Jakub Kicinski authored May 22, 2017

Introduce a concept of an application.  For now it's just grouping
pointers and serving as a layer of indirection.  It will help us
weaken the dependency on nfp_net in ethtool code.  Later series
will flesh out support for different apps in the driver.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ac9ebd5

nfp: add nfp_net_pf_free_vnic() function · 9140b30d

Jakub Kicinski authored May 22, 2017

Soon a third place will need to free a struct nfp_net.  Add a free
counterpart to nfp_net_pf_alloc_vnic().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9140b30d

nfp: rename netdev/port to vNIC · d4e7f092

Jakub Kicinski authored May 22, 2017

vNIC is a PCIe-side abstraction NFP firmwares supported by this
driver use.  It was initially meant to represent a device port
and therefore a netdev but today should be thought of as a way
of grouping descriptor rings and associated state.  Advanced apps
will have vNICs without netdevs and ports without a vNIC (using
representors instead).

Make sure code refers to vNICs as vNICs and not ports or netdevs.
No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4e7f092

nfp: make nfp_net alloc/init/cleanup/free not depend on netdevs · beba69ca

Jakub Kicinski authored May 22, 2017

struct nfp_net represents a vNIC, we will be moving away from the
requirement for every vNIC to have a netdev associated with it.
Remove "netdev" from some function names and prefer passing
struct nfp_net pointer as argument instead of struct net_device *.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

beba69ca

nfp: add nfp_cppcore_pcie_unit() helper · cd6f8db9

Simon Horman authored May 22, 2017

Add nfp_cppcore_pcie_unit() helper to retrieve the PCIE unit of a CPP
handle and use the new helper as appropriate.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd6f8db9

bridge: fix hello and hold timers starting/stopping · bd080488

Ivan Vecera authored May 19, 2017

Current bridge code incorrectly handles starting/stopping of hello and
hold timers during STP enable/disable.

1. Timers are stopped in br_stp_start() during NO_STP->USER_STP
   transition. The timers are already stopped in NO_STP state so
   this is confusing no-op.

2. During USER_STP->NO_STP transition the timers are started. This
   does not make sense and is confusion because the timer should not be
   active in NO_STP state.

Cc: davem@davemloft.net
Cc: sashok@cumulusnetworks.com
Cc: stephen@networkplumber.org
Cc: bridge@lists.linux-foundation.org
Cc: lucien.xin@gmail.com
Cc: nikolay@cumulusnetworks.com
Signed-off-by: Ivan Vecera <cera@cera.cz>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bd080488

net/wan/fsl_ucc_hdlc: fix muram allocation error · 85deed56

Holger Brunck authored May 22, 2017

sizeof(priv->ucc_pram) is 4 as it is the size of a pointer, but we want
to reserve space for the struct ucc_hdlc_param.
Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85deed56

net: ipv4: tcp: fixed comment coding style issue · a777f715

Rohit Chavan authored May 22, 2017

Fixed a coding style issue
Signed-off-by: Rohit Chavan <roheetchavan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a777f715

net: socket: fix a typo in sockfd_lookup(). · 241c4667

Rosen, Rami authored May 21, 2017

This patch fixes a typo in sockfd_lookup() in net/socket.c.
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

241c4667

Merge branch 'netlink-extack-route-add-del' · acb5d48f

David S. Miller authored May 22, 2017

David Ahern says:

====================
net: Add extack for route add/delete failures

Use the extack feature to improve error messages to user on route
add and delete failures.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

acb5d48f

net: ipv6: Add extack messages for route add failures · d5d531cb

David Ahern authored May 21, 2017

Add messages for non-obvious errors (e.g, no need to add text for malloc
failures or ENODEV failures). This mostly covers the annoying EINVAL errors
Some message strings violate the 80-columns but searchable strings need to
trump that rule.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d5d531cb

net: ipv6: Plumb extack through route add functions · 333c4301

David Ahern authored May 21, 2017

Plumb extack argument down to route add functions.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

333c4301

net: ipv4: Add extack messages for route add failures · c3ab2b4e

David Ahern authored May 21, 2017

Add messages for non-obvious errors (e.g, no need to add text for malloc
failures or ENODEV failures). This mostly covers the annoying EINVAL errors
Some message strings violate the 80-columns but searchable strings need to
trump that rule.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3ab2b4e

net: ipv4: Plumb extack through route add functions · 6d8422a1

David Ahern authored May 21, 2017

Plumb extack argument down to route add functions.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d8422a1

macsec: double accounting of dropped rx/tx packets · 863483c9

Girish Moodalbail authored May 19, 2017

The macsec implementation shouldn't account for rx/tx packets that are
dropped in the netdev framework. The netdev framework itself accounts
for such packets by atomically updating struct net_device`rx_dropped and
struct net_device`tx_dropped fields. Later on when the stats for macsec
link is retrieved, the packets dropped in netdev framework will be
included in dev_get_stats() after calling macsec.c`macsec_get_stats64()
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

863483c9

net: Fix parisc SCM_TIMESTAMPING_PKTINFO value. · f5a64d64

David S. Miller authored May 22, 2017

Needs to follow the existing sequence.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

f5a64d64

net: Define SCM_TIMESTAMPING_PKTINFO on all architectures. · 1c4f676a

David S. Miller authored May 21, 2017

A definition was only provided for asm-generic/socket.h
using platforms, define it for the others as well
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c4f676a

21 May, 2017 17 commits

tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUT · 4ab68879

Eric Dumazet authored May 21, 2017

TCP_USER_TIMEOUT is still converted to jiffies value in
icsk_user_timeout

So we need to make a conversion for the cases HZ != 1000

Fixes: 9a568de4 ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4ab68879

ipv6: drop unused variables in seg6_genl_dumphac · 0a9fc39e

stephen hemminger authored May 19, 2017

THe seg6_pernet_data variable was set but never used.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a9fc39e

fou: make local function static · 9dc621af

stephen hemminger authored May 19, 2017

The build header functions are not used by any other code.

net/ipv6/fou6.c:36:5: warning: no previous prototype for ‘fou6_build_header’ [-Wmissing-prototypes]
net/ipv6/fou6.c:54:5: warning: no previous prototype for ‘gue6_build_header’ [-Wmissing-prototypes]

Need to do some code rearranging to satisfy different Kconfig possiblities.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9dc621af

tcpnv: do not export local function · c718c6d6

stephen hemminger authored May 19, 2017

The TCP New Vegas congestion control was exporting an internal
function tcpnv_get_info which is not used by any other in tree
kernel code. Make it static.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c718c6d6

inet: fix warning about missing prototype · 9691724e

stephen hemminger authored May 19, 2017

The prototype for inet_rcv_saddr_equal was not being included.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9691724e

ila: propagate error code in ila_output · 9e7b19c5

stephen hemminger authored May 19, 2017

This warning:
net/ipv6/ila/ila_lwt.c: In function ‘ila_output’:
net/ipv6/ila/ila_lwt.c:42:6: warning: variable ‘err’ set but not used [-Wunused-but-set-variable]

It looks like the code attempts to set propagate different error
values, but always returned -EINVAL.

Compile tested only. Needs review by original author.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e7b19c5

dcb: enforce minimum length on IEEE_APPS attribute · 332b4fc8

stephen hemminger authored May 19, 2017

Found by reviewing the warning about unused policy table.
The code implies that it meant to check for size, but since
it unrolled the loop for attribute validation that is never used.
Instead do explicit check for attribute.

Compile tested only. Needs review by original author.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

332b4fc8

Merge branch 'net-extend-socket-timestamping-API' · dae37055

David S. Miller authored May 21, 2017

Miroslav Lichvar says:

====================
Extend socket timestamping API

Changes v5->v6:
- fixed skb_is_swtx_tstamp() when OPT_TX_SWHW is disabled and improved
  its description
- improved OPT_PKTINFO documentation
- improved scm_timestamping documentation

Changes v4->v5:
- fixed initialization of reserved fields in struct scm_ts_pktinfo

Changes v3->v4:
- added reserved fields to struct scm_ts_pktinfo
- replaced patch fixing false SW timestamps with a documentation fix
- updated OPT_TX_SWHW patch to handle false SW timestamps

Changes v2->v3:
- modified struct scm_ts_pktinfo to use fixed-width integer types
- added WARN_ON_ONCE for missing RCU lock in dev_get_by_napi_id()
- modified dev_get_by_napi_id() to not return dev in unexpected branch
- modified recv to return SCM_TIMESTAMPING_PKTINFO even if the interface
  index is unknown

Changes v1->v2:
- added separate patch for new NAPI functions
- split code from __sock_recv_timestamp() for better readability
- fixed RCU locking
- fixed compiler warning (missing case in switch in first patch)
- inline sw_tx_timestamp() in its only user

Changes RFC->v1:
- reworked SOF_TIMESTAMPING_OPT_PKTINFO patch to not add new fields to
  skb shared info (net device is now looked up by napi_id), not require
  any changes in drivers, and restrict the cmsg to incoming packets
- renamed SOF_TIMESTAMPING_OPT_MULTIMSG to SOF_TIMESTAMPING_OPT_TX_SWHW
  and fixed its description
- moved struct scm_ts_pktinfo from errqueue.h to net_tstamp.h as it
  can't be received from the error queue anymore
- improved commit descriptions and removed incorrect comment

This patchset adds new options to the timestamping API that will be
useful for NTP implementations and possibly other applications.

The first patch specifies a timestamp filter for NTP packets. The second
patch updates drivers that can timestamp all packets, or need to list
the filter as unsupported. There is no attempt to add the support to the
phyter driver.

The third patch adds two helper functions working with NAPI ID, which is
needed by the next patch. The fourth patch adds a new option to get a
new control message with the L2 length and interface index for incoming
packets with hardware timestamps.

The fifth patch fixes documentation on number of non-zero fields in
scm_timestamping and warns about false software timestamps when
SO_TIMESTAMP(NS) is combined with SCM_TIMESTAMPING.

The sixth patch adds a new option to request both software and hardware
timestamps for outgoing packets. The seventh patch updates drivers that
assumed software timestamping cannot be used together with hardware
timestamping.

The patches have been tested on x86_64 machines with igb and e1000e
drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dae37055

net: ethernet: update drivers to make both SW and HW TX timestamps · 74abc9b1

Miroslav Lichvar authored May 19, 2017

Some drivers were calling the skb_tx_timestamp() function only when
a hardware timestamp was not requested. Now that applications can use
the SOF_TIMESTAMPING_OPT_TX_SWHW option to request both software and
hardware timestamps, the drivers need to be modified to unconditionally
call skb_tx_timestamp().

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74abc9b1

net: allow simultaneous SW and HW transmit timestamping · b50a5c70

Miroslav Lichvar authored May 19, 2017

Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to
be looped to the socket's error queue with a software timestamp even
when a hardware transmit timestamp is expected to be provided by the
driver.

Applications using this option will receive two separate messages from
the error queue, one with a software timestamp and the other with a
hardware timestamp. As the hardware timestamp is saved to the shared skb
info, which may happen before the first message with software timestamp
is received by the application, the hardware timestamp is copied to the
SCM_TIMESTAMPING control message only when the skb has no software
timestamp or it is an incoming packet.

While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as
there are no other users.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b50a5c70

net: fix documentation of struct scm_timestamping · 67953d47

Miroslav Lichvar authored May 19, 2017

The scm_timestamping struct may return multiple non-zero fields, e.g.
when both software and hardware RX timestamping is enabled, or when the
SO_TIMESTAMP(NS) option is combined with SCM_TIMESTAMPING and a false
software timestamp is generated in the recvmsg() call in order to always
return a SCM_TIMESTAMP(NS) message.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67953d47

net: add new control message for incoming HW-timestamped packets · aad9c8c4

Miroslav Lichvar authored May 19, 2017

Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
for incoming packets with hardware timestamps. It contains the index of
the real interface which received the packet and the length of the
packet at layer 2.

The index is useful with bonding, bridges and other interfaces, where
IP_PKTINFO doesn't allow applications to determine which PHC made the
timestamp. With the L2 length (and link speed) it is possible to
transpose preamble timestamps to trailer timestamps, which are used in
the NTP protocol.

While this information could be provided by two new socket options
independently from timestamping, it doesn't look like they would be very
useful. With this option any performance impact is limited to hardware
timestamping.

Use dev_get_by_napi_id() to get the device and its index. On kernels
with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero
index will be returned in the control message.

CC: Richard Cochran <richardcochran@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aad9c8c4

net: add function to retrieve original skb device using NAPI ID · 90b602f8

Miroslav Lichvar authored May 19, 2017

Since commit b6858177 ("net: Make skb->skb_iif always track
skb->dev") skbs don't have the original index of the interface which
received the packet. This information is now needed for a new control
message related to hardware timestamping.

Instead of adding a new field to skb, we can find the device by the NAPI
ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the
driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to
hide the CONFIG_NET_RX_BUSY_POLL ifdef.

CC: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90b602f8

net: ethernet: update drivers to handle HWTSTAMP_FILTER_NTP_ALL · e3412575

Miroslav Lichvar authored May 19, 2017

Include HWTSTAMP_FILTER_NTP_ALL in net_hwtstamp_validate() as a valid
filter and update drivers which can timestamp all packets, or which
explicitly list unsupported filters instead of using a default case, to
handle the filter.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e3412575

net: define receive timestamp filter for NTP · b8210a9e

Miroslav Lichvar authored May 19, 2017

Add HWTSTAMP_FILTER_NTP_ALL to the hwtstamp_rx_filters enum for
timestamping of NTP packets. There is currently only one driver
(phyter) that could support it directly.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b8210a9e

cxgb4 : retrieve port information from firmware · 2061ec3f

Ganesh Goudar authored May 19, 2017

issue get port information command to firmware to retrieve port
information and update if it is different from what was last
recorded and also add indication for supported link modes for
firmware port types FW_PORT_TYPE_SFP28, FW_PORT_TYPE_KR_SFP28,
FW_PORT_TYPE_CR4_QSFP.

Based on the original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2061ec3f

ibmveth: Support to enable LSO/CSO for Trunk VEA. · 66aa0678

Sivakumar Krishnasamy authored May 19, 2017

Current largesend and checksum offload feature in ibmveth driver,
 - Source VM sends the TCP packets with ip_summed field set as
   CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
   checksum field
 - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
   "no checksum" and "checksum good" bits in transmit buffer descriptor
   before the packet is delivered to pseries PowerVM Hypervisor
 - If ibmveth has largesend capability enabled, transmit buffer descriptors
   are market accordingly before packet is delivered to Hypervisor
   (along with mss value for packets with length > MSS)
 - Destination VM's ibmveth driver receives the packet with "checksum good"
   bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
 - If "largesend" bit was on, mss value is copied from receive descriptor
   into SKB's gso_size and other flags are appropriately set for
   packets > MSS size
 - The packet is now successfully delivered up the stack in destination VM

The offloads described above works fine for TCP communication among VMs in
the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )

We are now enabling support for OVS in pseries PowerVM environment. One of
our requirements is to have ibmveth driver configured in "Trunk" mode, when
they are used with OVS. This is because, PowerVM Hypervisor will no more
bridge the packets between VMs, instead the packets are delivered to
IO Server which hosts OVS to bridge them between VMs or to external
networks (flow shown below),
  VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
                                                                   <=> VM B
In "IO server" the packet is received by inbound Trunk ibmveth and then
delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
below),
        Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth

In this model, we hit the following issues which impacted the VM
communication performance,

 - Issue 1: ibmveth doesn't support largesend and checksum offload features
   when configured as "Trunk". Driver has explicit checks to prevent
   enabling these offloads.

 - Issue 2: SYN packet drops seen at destination VM. When the packet
   originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
   IO server's inbound Trunk ibmveth, on validating "checksum good" bits
   in ibmveth receive routine, SKB's ip_summed field is set with
   CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
   Bridge) and delivered to outbound Trunk ibmveth. At this point the
   outbound ibmveth transmit routine will not set "no checksum" and
   "checksum good" bits in transmit buffer descriptor, as it does so only
   when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
   delivered to destination VM, TCP layer receives the packet with checksum
   value of 0 and with no checksum related flags in ip_summed field. This
   leads to packet drops. So, TCP connections never goes through fine.

 - Issue 3: First packet of a TCP connection will be dropped, if there is
   no OVS flow cached in datapath. OVS while trying to identify the flow,
   computes the checksum. The computed checksum will be invalid at the
   receiving end, as ibmveth transmit routine zeroes out the pseudo
   checksum value in the packet. This leads to packet drop.

 - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
   When Physical NIC has GRO enabled and when OVS bridges these packets,
   OVS vport send code will end up calling dev_queue_xmit, which in turn
   calls validate_xmit_skb.
   In validate_xmit_skb routine, the larger packets will get segmented into
   MSS sized segments, if SKB has a frag_list and if the driver to which
   they are delivered to doesn't support NETIF_F_FRAGLIST feature.

This patch addresses the above four issues, thereby enabling end to end
largesend and checksum offload support for better performance.

 - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
   checksum offloads.
 - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
   bit set and if its configured in Trunk mode, set appropriate SKB fields
   using skb_partial_csum_set (ip_summed field is set with
   CHECKSUM_PARTIAL)
 - Fix for Issue 3: Recompute the pseudo header checksum before sending the
   SKB up the stack.
 - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
   allocating buffers and copying data, this fix gives
   upto 4X throughput increase.

Note: All these fixes need to be dropped together as fixing just one of
them will lead to other issues immediately (especially for Issues 1,2 & 3).
Signed-off-by: Sivakumar Krishnasamy <ksiva@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

66aa0678