Commits · 90fdc561b08ce292f1d39a62f70012f150583b98 · nexedi / linux

24 Apr, 2017 40 commits

nfp: remove the refresh of all ports optimization · 90fdc561

Jakub Kicinski authored Apr 22, 2017

The code refreshing the eth port state was trying to update state
of all ports of the card.  Unfortunately to safely walk the port
list we would have to hold the port lock, which we can't due to
lock ordering constraints against rtnl.

Make the per-port sync refresh and async refresh of all ports
completely separate routines.

Fixes: 172f638c ("nfp: add port state refresh")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90fdc561

nfp: fix free list buffer size reporting · ee200a73

Jakub Kicinski authored Apr 22, 2017

XDP headroom should not be included in free list buffer size.

Fixes: 6fe0c3b4 ("nfp: add support for xdp_adjust_head()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee200a73

nfp: add NSP routine to get static information · 010e2f9c

David Brunecz authored Apr 22, 2017

Retrieve identifying information from the NSP.  For now it only
contains versions of firmware subcomponents.
Signed-off-by: David Brunecz <david.brunecz@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

010e2f9c

nfp: parse metadata prepend before XDP runs · e524a6a9

Jakub Kicinski authored Apr 22, 2017

Calling memcpy to shift metadata out of the way for XDP to run
seems like an overkill.  The most common metadata contents are
8 bytes containing type and flow hash.  Simply parse the metadata
before we run XDP.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e524a6a9

nfp: make use of the DMA_ATTR_SKIP_CPU_SYNC attr · 5cd4fbea

Jakub Kicinski authored Apr 22, 2017

DMA unmap may destroy changes CPU made to the buffer.  To make XDP
run correctly on non-x86 platforms we should use the
DMA_ATTR_SKIP_CPU_SYNC attribute.

Thanks to using the attribute we can now push the sync operation to the
common code path from XDP handler.

A little bit of variable name reshuffling is required to bring the
code back to readable state.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5cd4fbea

Merge branch 'cls_flower-MPLS' · 0d688a0d

David S. Miller authored Apr 24, 2017

Benjamin LaHaise says:

====================
flower: add MPLS matching support

This patch series adds support for parsing MPLS flows in the flow dissector
and the flower classifier.  Each of the MPLS TTL, BOS, TC and Label fields
can be used for matching.

v2: incorporate style feedback, move #defines to linux/include/mpls.h
Note: this omits Jiri's request to remove tabs between the type and
field names in struct declarations.  This would be inconsistent with
numerous other struct definitions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0d688a0d

cls_flower: add support for matching MPLS fields (v2) · a577d8f7

Benjamin LaHaise authored Apr 22, 2017

Add support to the tc flower classifier to match based on fields in MPLS
labels (TTL, Bottom of Stack, TC field, Label).
Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a577d8f7

flow_dissector: add mpls support (v2) · 029c1ecb

Benjamin LaHaise authored Apr 22, 2017

Add support for parsing MPLS flows to the flow dissector in preparation for
adding MPLS match support to cls_flower.
Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <jhs@mojatatu.com>
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Gao Feng <fgao@ikuai8.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

029c1ecb

Merge branch 'tcp-fastopen-middlebox-fixes' · 3ec21b65

David S. Miller authored Apr 24, 2017

Wei Wang says:

====================
net/tcp_fastopen: Fix for various TFO firewall issues

Currently there are still some firewall issues in the middlebox
which make the middlebox drop packets silently for TFO sockets.
This kind of issue is hard to be detected by the end client.

This patch series tries to detect such issues in the kernel and disable
TFO temporarily.
More details about the issues and the fixes are included in the following
patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3ec21b65

net/tcp_fastopen: Remove mss check in tcp_write_timeout() · 59450f8d

Wei Wang authored Apr 20, 2017

Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.

The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

59450f8d

net/tcp_fastopen: Add snmp counter for blackhole detection · 46c2fa39

Wei Wang authored Apr 20, 2017

This counter records the number of times the firewall blackhole issue is
detected and active TFO is disabled.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46c2fa39

net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0

Wei Wang authored Apr 20, 2017

Middlebox firewall issues can potentially cause server's data being
blackholed after a successful 3WHS using TFO. Following are the related
reports from Apple:
https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
Slide 31 identifies an issue where the client ACK to the server's data
sent during a TFO'd handshake is dropped.
C ---> syn-data ---> S
C <--- syn/ack ----- S
C (accept & write)
C <---- data ------- S
C ----- ACK -> X     S
		[retry and timeout]

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
Slide 5 shows a similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C?  X <- data ------ S
		[retry and timeout]

This is the worst failure b/c the client can not detect such behavior to
mitigate the situation (such as disabling TFO). Failing to proceed, the
application (e.g., SSL library) may simply timeout and retry with TFO
again, and the process repeats indefinitely.

The proposed solution is to disable active TFO globally under the
following circumstances:
1. client side TFO socket detects out of order FIN
2. client side TFO socket receives out of order RST

We disable active side TFO globally for 1hr at first. Then if it
happens again, we disable it for 2h, then 4h, 8h, ...
And we reset the timeout to 1hr if a client side TFO sockets not opened
on loopback has successfully received data segs from server.
And we examine this condition during close().

The rational behind it is that when such firewall issue happens,
application running on the client should eventually close the socket as
it is not able to get the data it is expecting. Or application running
on the server should close the socket as it is not able to receive any
response from client.
In both cases, out of order FIN or RST will get received on the client
given that the firewall will not block them as no data are in those
frames.
And we want to disable active TFO globally as it helps if the middle box
is very close to the client and most of the connections are likely to
fail.

Also, add a debug sysctl:
  tcp_fastopen_blackhole_detect_timeout_sec:
    the initial timeout to use when firewall blackhole issue happens.
    This can be set and read.
    When setting it to 0, it means to disable the active disable logic.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf1ef3f0

Merge tag 'mlx5-updates-2017-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · bc95cd8e

David S. Miller authored Apr 24, 2017

Saeed Mahameed says:

====================
mlx5-updates-2017-04-22

Sparse and compiler warnings fixes from Stephen Hemminger.

From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling
E-Switch encapsulation mode, this knob will enable HW support for applying
encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bc95cd8e

net: add rcu locking when changing early demux · 58c4c6a3

David Ahern authored Apr 22, 2017

systemd-sysctl is triggering a suspicious RCU usage message when
net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
a sysctl config file:

[   33.896184] ===============================
[   33.899558] [ ERR: suspicious RCU usage.  ]
[   33.900624] 4.11.0-rc7+ #104 Not tainted
[   33.901698] -------------------------------
[   33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
[   33.905724]
other info that might help us debug this:

[   33.907656]
rcu_scheduler_active = 2, debug_locks = 0
[   33.909288] 1 lock held by systemd-sysctl/143:
[   33.910373]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48
[   33.912407]
stack backtrace:
[   33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
[   33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[   33.917870] Call Trace:
[   33.918431]  dump_stack+0x81/0xb6
[   33.919241]  lockdep_rcu_suspicious+0x10f/0x118
[   33.920263]  proc_configure_early_demux+0x65/0x10a
[   33.921391]  proc_udp_early_demux+0x3a/0x41

add rcu locking to proc_configure_early_demux.

Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

58c4c6a3

Merge branch 'for-upstream' of... · ad0cb27c

David S. Miller authored Apr 24, 2017

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2017-04-22

Here are some more Bluetooth patches (and one 802.15.4 patch) in the
bluetooth-next tree targeting the 4.12 kernel. Most of them are pure
fixes.

Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ad0cb27c

net: netcp: fix spelling mistake: "memomry" -> "memory" · 11a9ec43

Colin Ian King authored Apr 22, 2017

Trivial fix to spelling mistake in dev_err message and rejoin
line.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

11a9ec43

net: atheros: atl1: use offset_in_page() macro · 4cc17bcf

Geliang Tang authored Apr 22, 2017

Use offset_in_page() macro instead of open-coding.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cc17bcf

Merge branch 'bnxt_en-misc-next' · 4dd42b94

David S. Miller authored Apr 24, 2017

Michael Chan says:

====================
bnxt_en: Updates for net-next.

Miscellaneous updates include passing DCBX RoCE VLAN priority to firmware,
checking one more new firmware flag before allowing DCBX to run on the host,
adding 100Gbps speed support, adding check to disallow speed settings on
Multi-host NICs, and a minor fix for reporting VF attributes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4dd42b94

bnxt_en: Restrict a PF in Multi-Host mode from changing port PHY configuration · 9e54e322

Deepak Khungar authored Apr 21, 2017

This change restricts the PF in multi-host mode from setting any port
level PHY configuration.  The settings are controlled by firmware in
Multi-Host mode.
Signed-off-by: Deepak Khungar <deepak.khungar@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e54e322

bnxt_en: Check the FW_LLDP_AGENT flag before allowing DCBX host agent. · 7d63818a

Michael Chan authored Apr 21, 2017

Check the additional flag in bnxt_hwrm_func_qcfg() before allowing
DCBX to be done in host mode.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d63818a

bnxt_en: Add 100G link speed reporting for BCM57454 ASIC in ethtool · 38a21b34

Deepak Khungar authored Apr 21, 2017

Added support for 100G link speed reporting for Broadcom BCM57454
ASIC in ethtool command.
Signed-off-by: Deepak Khungar <deepak.khungar@broadcom.com>
Signed-off-by: Ray Jui <ray.jui@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38a21b34

bnxt_en: Fix VF attributes reporting. · f0249056

Michael Chan authored Apr 21, 2017

The .ndo_get_vf_config() is returning the wrong qos attribute.  Fix
the code that checks and reports the qos and spoofchk attributes.  The
BNXT_VF_QOS and BNXT_VF_LINK_UP flags should not be set by default
during init. time.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0249056

bnxt_en: Pass DCB RoCE app priority to firmware. · a82fba8d

Michael Chan authored Apr 21, 2017

When the driver gets the RoCE app priority set/delete call through DCBNL,
the driver will send the information to the firmware to set up the
priority VLAN tag for RDMA traffic.

[ New version using the common ETH_P_IBOE constant in if_ether.h ]
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a82fba8d

openvswitch: Add eventmask support to CT action. · 12064551

Jarno Rajahalme authored Apr 21, 2017

Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups.  Default behavior depends on the system
configuration, but typically a lot of events are delivered.  This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value.  However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

12064551

openvswitch: Typo fix. · abd0a4f2

Jarno Rajahalme authored Apr 21, 2017

Fix typo in a comment.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

abd0a4f2

Merge branch 'ibmvnic-updates-and-fixes' · 6df3d9e2

David S. Miller authored Apr 24, 2017

Nathan Fontenot says:

====================
ibmvnic: Additional updates and bug fixes

This set of patches is an additional set of updates and bug fixes to
the ibmvnic driver which applies on top of the previous set of updates
sent out on 4/19.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6df3d9e2

ibmvnic: Free skb's in cases of failure in transmit · 7f5b0308

Thomas Falcon authored Apr 21, 2017

When an error is encountered during transmit we need to free the
skb instead of returning TX_BUSY.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f5b0308

ibmvnic: Validate napi exist before disabling them · 3ca19932

Nathan Fontenot authored Apr 21, 2017

Validate that the napi structs exist before trying to disable them
at driver close.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ca19932

ibmvnic: Add set_link_state routine for setting adapter link state · 53da09e9

Nathan Fontenot authored Apr 21, 2017

Create a common routine for setting the link state for the vnic adapter.
This update moves the sending of the crq and waiting for the link state
response to a common place. The new routine also adds handling of
resending the crq in cases of getting a partial success response.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

53da09e9

ibmvnic: Move initialization of the stats token to ibmvnic_open · 5d5e84eb

Nathan Fontenot authored Apr 21, 2017

We should be initializing the stats token in the same place we
initialize the other resources for the driver.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d5e84eb

ibmvnic: Only retrieve error info if present · 2f9de9ba

Nathan Fontenot authored Apr 21, 2017

When handling a fatal error in the driver, there can be additional
error information provided by the vios. This information is not
always present, so only retrieve the additional error information
when present.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f9de9ba

ibmvnic: Insert header on VLAN tagged received frame · 6052d5e2

Murilo Fossa Vicentini authored Apr 21, 2017

This patch addresses a modification in the PAPR+ specification which now
defines a previously reserved value for vNIC capabilities. It indicates
whether the system firmware performs a VLAN header stripping on all VLAN
tagged received frames, in case it does, the behavior expected is for
the ibmvnic driver to be responsible for inserting the VLAN header.
Reported-by: Manvanthara B. Puttashankar <mputtash@in.ibm.com>
Signed-off-by: Murilo Fossa Vicentini <muvic@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6052d5e2

ibmvnic: Set real number of rx queues · 7f3c6e6b

Thomas Falcon authored Apr 21, 2017

Along with 5 TX queues, 5 RX queues are allocated at the beginning of
device probe. However, only the real number of TX queues is set. Configure
the real number of RX queues as well.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f3c6e6b

Merge branch 'packet-fanout-unique-id' · 6a32a44d

David S. Miller authored Apr 24, 2017

Mike Maloney says:

====================
packet: Add option to create new fanout group with unique id.

Fanout uses a per net global namespace. A process that intends to create a
new fanout group can accidentally join an existing group. It is
not possible to detect this.

Add a socket option to specify on the first call to
setsockopt(..., PACKET_FANOUT, ...) to ensure that a new group is created.
Also add tests.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6a32a44d

selftests/net: add tests for PACKET_FANOUT_FLAG_UNIQUEID · 28be04f5

Mike Maloney authored Apr 21, 2017

Create two groups with PACKET_FANOUT_FLAG_UNIQUEID, add a socket to one.
Ensure that the groups can only be joined if all options are consistent
with the original except for this flag.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28be04f5

packet: add PACKET_FANOUT_FLAG_UNIQUEID to assign new fanout group id. · 4a69a864

Mike Maloney authored Apr 21, 2017

Fanout uses a per net global namespace. A process that intends to create
a new fanout group can accidentally join an existing group. It is not
possible to detect this.

Add socket option PACKET_FANOUT_FLAG_UNIQUEID.  When specified the
supplied fanout group id must be set to 0, and the kernel chooses an id
that is not already in use.  This is an ephemeral flag so that
other sockets can be added to this group using setsockopt, but NOT
specifying this flag.  The current getsockopt(..., PACKET_FANOUT, ...)
can be used to retrieve the new group id.

We assume that there are not a lot of fanout groups and that this is not
a high frequency call.

The method assigns ids starting at zero and increases until it finds an
unused id.  It keeps track of the last assigned id, and uses it as a
starting point to find new ids.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a69a864

selftests/net: cleanup unused parameter in psock_fanout · 2e7a7217

Mike Maloney authored Apr 21, 2017

sock_fanout_open no longer sets the size of packet_socket ring, so stop
passing the parameter.
Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e7a7217

mdio_bus: Issue GPIO RESET to PHYs. · 69226896

Roger Quadros authored Apr 21, 2017

Some boards [1] leave the PHYs at an invalid state
during system power-up or reset thus causing unreliability
issues with the PHY which manifests as PHY not being detected
or link not functional. To fix this, these PHYs need to be RESET
via a GPIO connected to the PHY's RESET pin.

Some boards have a single GPIO controlling the PHY RESET pin of all
PHYs on the bus whereas some others have separate GPIOs controlling
individual PHY RESETs.

In both cases, the RESET de-assertion cannot be done in the PHY driver
as the PHY will not probe till its reset is de-asserted.
So do the RESET de-assertion in the MDIO bus driver.

[1] - am572x-idk, am571x-idk, a437x-idk
Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69226896

Merge branch 'VSOCK-add-vsockmon' · 15769ff8

David S. Miller authored Apr 24, 2017

Stefan Hajnoczi says:

====================
VSOCK: vsockmon virtual device to monitor AF_VSOCK sockets.

v5:
 * Change vsock_deliver_tap() API to avoid unnecessary skb creation
   [Jorgen]
 * Fix skb leak when no taps are registered [Jorgen]
 * s/cpu_to_le16(pkt->hdr.op)/le16_to_cpu(pkt->hdr.op)/ [Michael]
 * Add af_vsock_tap.c and vsockmon.[ch] to MAINTAINERS
 * checkpatch.pl and sparse fixes

v4:
 * Add explicit reserved padding field to struct af_vsockmon_hdr and
   drop __attribute__((packed)) [Michael, DaveM]
 * Call synchronize_net() before module_put() [Michael]

v3:
 * Hook virtio_transport.c (guest driver), not just drivers/vhost/vsock.c (host
   driver)
 * Fix DEFAULT_MTU macro definition [Zhu Yanjun]
 * Rename af_vsockmon_hdr->t field ->transport for clarity
 * Update .ndo_get_stats64() return type since it has changed
 * Include missing <linux/module.h> header in af_vsock_tap.c

This is a continuation of Gerard Garcia's work on the vsockmon packet capture
interface for AF_VSOCK.  Packet capture is an essential feature for network
communication.  Gerard began addressing this feature gap in his Google Summer
of Code 2016 project.  I have cleaned up, rebased, and retested the v2 series
he posted previously.

The design follows the nlmon packet capture interface closely.  This is because
vsock has the same problem as netlink: there is no netdev on which packets can
be captured.  The nlmon driver is a synthetic netdev purely for the purpose of
enabling packet capture.  We follow the same approach here with vsockmon.

See include/uapi/linux/vsockmon.h in this series for details on the packet
layout.

How to try it:

1. Build tcpdump with vsockmon patches:

  $ git clone -b vsock https://github.com/stefanha/libpcap
  $ (cd libcap && ./configure && make)
  $ git clone -b vsock https://github.com/stefanha/tcpdump
  $ (cd tcpdump && ./configure && make)

2. Build nc-vsock (a netcat-like tool):

  $ git clone https://github.com/stefanha/nc-vsock
  $ (cd nc-vsock && make)

3. Launch a virtual machine:

  # modprobe vhost_vsock
  # qemu-system-x86_64 -M accel=kvm -m 1024 -cpu host \
      -drive if=virtio,file=test.img,format=raw \
      -device vhost-vsock-pci,guest-cid=3

  (Assumes guest is running a kernel with this patch)

4. Capture AF_VSOCK traffic in guest and/or host:

  # modprobe vsockmon
  # ip link add type vsockmon
  # ip link set vsockmon0 up
  # tcpdump -i vsockmon0 -vvv

5. Communicate!

  (host)$ nc-vsock -l 1234
  (guest)$ nc-vsock 2 1234
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

15769ff8

VSOCK: Add virtio vsock vsockmon hooks · 82dfb540

Gerard Garcia authored Apr 21, 2017

The virtio drivers deal with struct virtio_vsock_pkt.  Add
virtio_transport_deliver_tap_pkt(pkt) for handing packets to the
vsockmon device.

We call virtio_transport_deliver_tap_pkt(pkt) from
net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of
common code.  This is because the drivers may drop packets before
handing them to common code - we still want to capture them.
Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

82dfb540