- 13 May, 2015 40 commits
-
-
David S. Miller authored
John W. Linville says: ==================== add GENEVE netdev tunnel driver This 5-patch kernel series adds a netdev implementation of a GENEVE tunnel driver, and the single iproute2 patch enables creation and such for those netdevs. This makes use of the existing GENEVE infrastructure already used by the OVS code. The net/ipv4/geneve.c file is renamed as net/ipv4/geneve_core.c as part of these changes. drivers/net/Kconfig | 14 + drivers/net/Makefile | 1 drivers/net/geneve.c | 503 +++++++++++++++++++++++++++++++++++++++++ include/net/geneve.h | 5 include/uapi/linux/if_link.h | 9 net/ipv4/Kconfig | 4 net/ipv4/Makefile | 2 net/ipv4/geneve.c | 6 net/ipv4/geneve_core.c | 4 net/openvswitch/Kconfig | 2 net/openvswitch/vport-geneve.c | 5 11 files changed, 538 insertions(+), 17 deletions(-) The overall structure of the GENEVE netdev driver is strongly influenced by the VXLAN netdev driver. This is not surprising, as the two drivers are intended to serve similar purposes. As development of the GENEVE driver continues, it is likely that those similarities will grow stronger. This will include both simple configuration options (e.g. TOS and TTL settings) and new control plane support. The current implementation is very simple, restricting itself to point to point links over IPv4. This is due only to the simplicity of the implementation, and no such limit is inherent to GENEVE in any way. Support for IPv6 links and more sophisticated control plane options are predictable enhancements. Using the included iproute2 patch, a GENEVE tunnel is created thusly: ip link add dev gnv0 type geneve remote 192.168.22.1 vni 1234 ip link set gnv0 up ip addr add 10.1.1.1/24 dev gnv0 After a corresponding tunnel interface is created at the link partner, traffic should proceed as expected. Please let me know if anyone has problems...thanks! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
This is an initial implementation of a netdev driver for GENEVE tunnels. This implementation uses a fixed UDP port, and only supports point-to-point links with specific partner endpoints. Only IPv4 links are supported at this time. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
net/ipv4/geneve.c -> net/ipv4/geneve_core.c This name better reflects the purpose of the module. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
This is a static inline with identical definitions in multiple places... Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
This file is essentially a library for implementing the geneve encapsulation protocol. The file does not register any rtnl_link_ops, so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pablo Neira authored
This fixes 4577139b ("net: use jump label patching for ingress qdisc in __netif_receive_skb_core"). The only client of this is sch_ingress and it depends on NET_CLS_ACT. So there is no way these definition can be of any help. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Willem de Bruijn says: ==================== refine packet socket rollover: 1. mitigate a case of lock contention 2. avoid exporting resource exhaustion to other sockets, by migrating only to a victim socket that has ample room 3. avoid reordering of most flows on the socket, by migrating first the flow responsible for load imbalance 4. help processes detect load imbalance, by exporting rollover counters Context: rollover implements flow migration in packet socket fanout groups in case of extreme load imbalance. It is a specific implementation of migration that minimizes reordering by selecting the same victim socket when possible (and by selecting subsequent victims in a round robin fashion, from which its name derives). Changes: v2 -> v3: - statistics: replace unsigned long with __aligned_u64 v1 -> v2: - huge flow detection: run lockless - huge flow detection: replace stored index with random - contention avoidance: test in packet_poll while lock held - contention avoidance: clear pressure sooner packet_poll and packet_recvmsg would clear only if the sock is empty to avoid taking the necessary lock. But, * packet_poll already holds this lock, so a lockless variant __packet_rcv_has_room is cheap. * packet_recvmsg is usually called only for non-ring sockets, which also runs lockless. - preparation: drop "single return" patch packet_rcv_has_room is now a locked wrapper around __packet_rcv_has_room, achieving the same (single footer). The benchmark mentioned in the patches is at https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Rollover indicates exceptional conditions. Export a counter to inform socket owners of this state. If no socket with sufficient room is found, rollover fails. Also count these events. Finally, also count when flows are rolled over early thanks to huge flow detection, to validate its correctness. Tested: Read counters in bench_rollover on all other tests in the patchset Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Migrate flows from a socket to another socket in the fanout group not only when the socket is full. Start migrating huge flows early, to divert possible 4-tuple attacks without affecting normal traffic. Introduce fanout_flow_is_huge(). This detects huge flows, which are defined as taking up more than half the load. It does so cheaply, by storing the rxhashes of the N most recent packets. If over half of these are the same rxhash as the current packet, then drop it. This only protects against 4-tuple attacks. N is chosen to fit all data in a single cache line. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s cpu rx rx.k drop.k rollover r.huge r.failed 0 14 14 0 0 0 0 1 20 20 0 0 0 0 2 16 16 0 0 0 0 3 6168824 6168824 0 4867721 4867721 0 4 4867741 4867741 0 0 0 0 5 12 12 0 0 0 0 6 15 15 0 0 0 0 7 17 17 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Rollover has to call packet_rcv_has_room on sockets in the fanout group to find a socket to migrate to. This operation is expensive especially if the packet sockets use rings, when a lock has to be acquired. Avoid pounding on the lock by all sockets by temporarily marking a socket as "under memory pressure" when such pressure is detected. While set, only the socket owner may call packet_rcv_has_room on the socket. Once it detects normal conditions, it clears the flag. The socket is not used as a victim by any other socket in the meantime. Under reasonably balanced load, each socket writer frequently calls packet_rcv_has_room and clears its own pressure field. As a backup for when the socket is rarely written to, also clear the flag on reading (packet_recvmsg, packet_poll) if this can be done cheaply (i.e., without calling packet_rcv_has_room). This is only for edge cases. Tested: Ran bench_rollover: a process with 8 sockets in a single fanout group, each pinned to a single cpu that receives one nic recv interrupt. RPS and RFS are disabled. The benchmark uses packet rx_ring, which has to take a lock when determining whether a socket has room. Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread uniformly across the packet sockets (and inserted an iptables rule to drop in PREROUTING to avoid protocol stack processing). Without this patch, all sockets try to migrate traffic to neighbors, causing lock contention when searching for a non- empty neighbor. The lock is the top 9 entries. perf record -a -g sleep 5 - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock - 99.00% spin_lock + 81.77% packet_rcv_has_room.isra.41 + 18.23% tpacket_rcv + 0.84% packet_rcv_has_room.isra.41 + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41 On net-next with this patch, this lock contention is no longer a top entry. Most time is spent in the actual read function. Next up are other locks: + 15.52% bench_rollover bench_rollover [.] reader + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51 + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq Looking closer at the remaining _raw_spin_lock, the cost of probing in rollover is now comparable to the cost of taking the lock later in tpacket_rcv. - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 33.41% packet_rcv_has_room + 28.15% tpacket_rcv + 19.54% enqueue_to_backlog + 6.45% __free_pages_ok + 2.78% packet_rcv_fanout + 2.13% fanout_demux_rollover + 2.01% netif_receive_skb_internal Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Only migrate flows to sockets that have sufficient headroom, where sufficient is defined as having at least 25% empty space. The kernel has three different buffer types: a regular socket, a ring with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The latter two do not expose a read pointer to the kernel, so headroom is not computed easily. All three needs a different implementation to estimate free space. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. bench_rollover has as many sockets as there are NIC receive queues in the system. Each socket is owned by a process that is pinned to one of the receive cpus. RFS is disabled. RPS is enabled with an identity mapping (cpu x -> cpu x), to count drops with softnettop. lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s Press [Enter] to exit cpu rx rx.k drop.k rollover r.huge r.failed 0 16 16 0 0 0 0 1 21 21 0 0 0 0 2 5227502 5227502 0 0 0 0 3 18 18 0 0 0 0 4 6083289 6083289 0 5227496 0 0 5 22 22 0 0 0 0 6 21 21 0 0 0 0 7 9 9 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Replace rollover state per fanout group with state per socket. Future patches will add fields to the new structure. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover logic into the callee to simplify these callsites, especially with upcoming changes. The main differences between the two callsites is that the FLAG variant tests whether the socket previously selected by another mode (RR, RND, HASH, ..) has room before migrating flows, whereas the rollover mode has no original socket to test. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
__ip_local_out_sk() is only used from net/ipv4/ip_output.c net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not declared. Should it be static? Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jiri Pirko says: ==================== introduce programable flow dissector and cls_flower Per Davem's request, I prepared this patchset which introduces programmable flow dissector. For current users of flow_keys, there is a wrapper skb_flow_dissect_flow_keys which maintains the previous behaviour. For purposes of cls_flower, couple of new dissection keys were introduced. Note that this dissector can be also eventually used by openvswitch code. Also, as a next step, I plan to get rid of *skb_flow_get_ports(export) and *__skb_get_poff as their functionality can be now implemented by skb_flow_dissect as well. v2->v3: - remove TCA_FLOWER_POLICE attr suggested by Jamal v1->v2: - move __skb_tx_hash rather to dev.c as suggested by Alex ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
This patch introduces a flow-based filter. So far, the very essential packet fields are supported. This patch is only the first step. There is a lot of potential performance improvements possible to implement. Also a lot of features are missing now. They will be addressed in follow-up patches. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
So far, only hashes made out of ipv6 addresses could be dissected. This patch introduces support for dissection of full ipv6 addresses. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Introduce dissector infrastructure which allows user to specify which parts of skb he wants to dissect. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
next to its user. No relation to flow_dissector so it makes no sense to have it in flow_dissector.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
__skb_tx_hash function has no relation to flow_dissect so just move it to dev.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Since the definition of the function is in flow_dissector.c, it makes sense to have the declaration in flow_dissector.h Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Since these functions are defined in flow_dissector.c, move header declarations from skbuff.h into flow_dissector.h Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
commit 56193d1b ("net: Add function for parsing the header length out of linear ethernet frames") added this function declaration but it is defined nowhere. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
add couple of empty lines on the way. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Edward Cree says: ==================== sfc: Bowdlerise PTP MCDI errors When the NIC doesn't support PTP, probe-time MCDI commands fail in predictable ways. Instead of logging cryptic MCDI errors, just log that PTP isn't supported. v2: Hopefully stop Thunderbird mangling the patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
Also, remove a needless netif_err() from efx_ptp_update_stats() - if the MCDI fails it'll print its own error message, we don't need another that adds no information. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Seems all we want here is to avoid endless 'goto reclassify' loop. tc_classify_compat even resets this counter when something other than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't break hypothetical loops induced by something other than perpetual TC_ACT_RECLASSIFY return values. skb_act_clone is now identical to skb_clone, so just use that. Tested with following (bogus) filter: tc filter add dev eth0 parent ffff: \ protocol ip u32 match u32 0 0 police rate 10Kbit burst \ 64000 mtu 1500 action reclassify Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller authored
Four minor merge conflicts: 1) qca_spi.c renamed the local variable used for the SPI device from spi_device to spi, meanwhile the spi_set_drvdata() call got moved further up in the probe function. 2) Two changes were both adding new members to codel params structure, and thus we had overlapping changes to the initializer function. 3) 'net' was making a fix to sk_release_kernel() which is completely removed in 'net-next'. 4) In net_namespace.c, the rtnl_net_fill() call for GET operations had the command value fixed, meanwhile 'net-next' adjusted the argument signature a bit. This also matches example merge resolutions posted by Stephen Rothwell over the past two days. Signed-off-by: David S. Miller <davem@davemloft.net>
-
Justin Cormack authored
Fix missing copy_from_user in macvtap SIOCSIFHWADDR ioctl. Signed-off-by: Justin Cormack <justin@netbsd.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions which was causing build issues on the newly added switchdev attr/obj structs. Fix this by using named union on structs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Scott Feldman says: ==================== switchdev: more (minor) cleanups Fix some sparse warnings and include some documentation review comments that didn't get picked up in the switchdev Spring Cleanup series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-