Commits · 59af132bb68860de035388dcb531ac4ef46ccfa5 · nexedi / linux

13 May, 2015 40 commits

Merge branch 'geneve_tunnel_driver' · 59af132b

David S. Miller authored May 13, 2015

John W. Linville says:

====================
add GENEVE netdev tunnel driver

This 5-patch kernel series adds a netdev implementation of a GENEVE
tunnel driver, and the single iproute2 patch enables creation and
such for those netdevs.  This makes use of the existing GENEVE
infrastructure already used by the OVS code.  The net/ipv4/geneve.c
file is renamed as net/ipv4/geneve_core.c as part of these changes.

 drivers/net/Kconfig            |   14 +
 drivers/net/Makefile           |    1
 drivers/net/geneve.c           |  503 +++++++++++++++++++++++++++++++++++++++++
 include/net/geneve.h           |    5
 include/uapi/linux/if_link.h   |    9
 net/ipv4/Kconfig               |    4
 net/ipv4/Makefile              |    2
 net/ipv4/geneve.c              |    6
 net/ipv4/geneve_core.c         |    4
 net/openvswitch/Kconfig        |    2
 net/openvswitch/vport-geneve.c |    5
 11 files changed, 538 insertions(+), 17 deletions(-)

The overall structure of the GENEVE netdev driver is strongly
influenced by the VXLAN netdev driver.  This is not surprising, as the
two drivers are intended to serve similar purposes.  As development of
the GENEVE driver continues, it is likely that those similarities will
grow stronger.  This will include both simple configuration options
(e.g. TOS and TTL settings) and new control plane support.

The current implementation is very simple, restricting itself to point
to point links over IPv4.  This is due only to the simplicity of the
implementation, and no such limit is inherent to GENEVE in any way.
Support for IPv6 links and more sophisticated control plane options
are predictable enhancements.

Using the included iproute2 patch, a GENEVE tunnel is created thusly:

        ip link add dev gnv0 type geneve remote 192.168.22.1 vni 1234
        ip link set gnv0 up
        ip addr add 10.1.1.1/24 dev gnv0

After a corresponding tunnel interface is created at the link partner,
traffic should proceed as expected.

Please let me know if anyone has problems...thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

59af132b

geneve: add initial netdev driver for GENEVE tunnels · 2d07dc79

John W. Linville authored May 13, 2015

This is an initial implementation of a netdev driver for GENEVE
tunnels.  This implementation uses a fixed UDP port, and only supports
point-to-point links with specific partner endpoints.  Only IPv4
links are supported at this time.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d07dc79

geneve_core: identify as driver library in modules description · d37d29c3
John W. Linville authored May 13, 2015
```
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
d37d29c3

geneve: Rename support library as geneve_core · 11e1fa46

John W. Linville authored May 13, 2015

net/ipv4/geneve.c -> net/ipv4/geneve_core.c

This name better reflects the purpose of the module.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

11e1fa46

geneve: move definition of geneve_hdr() to geneve.h · 35d32e8f

John W. Linville authored May 13, 2015

This is a static inline with identical definitions in multiple places...
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35d32e8f

geneve: remove MODULE_ALIAS_RTNL_LINK from net/ipv4/geneve.c · 125907ae

John W. Linville authored May 13, 2015

This file is essentially a library for implementing the geneve
encapsulation protocol.  The file does not register any rtnl_link_ops,
so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

125907ae

net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset · f0b5e8a4

Pablo Neira authored May 12, 2015

This fixes 4577139b ("net: use jump label patching for ingress qdisc in
__netif_receive_skb_core").

The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
there is no way these definition can be of any help.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0b5e8a4

Merge branch 'packet_rollover' · 9f0a74d7

David S. Miller authored May 13, 2015

Willem de Bruijn says:

====================
refine packet socket rollover:

1. mitigate a case of lock contention
2. avoid exporting resource exhaustion to other sockets,
   by migrating only to a victim socket that has ample room
3. avoid reordering of most flows on the socket,
   by migrating first the flow responsible for load imbalance
4. help processes detect load imbalance,
   by exporting rollover counters

Context: rollover implements flow migration in packet socket fanout
groups in case of extreme load imbalance. It is a specific
implementation of migration that minimizes reordering by selecting
the same victim socket when possible (and by selecting subsequent
victims in a round robin fashion, from which its name derives).

Changes:
  v2 -> v3:
    - statistics: replace unsigned long with __aligned_u64
  v1 -> v2:
    - huge flow detection: run lockless
    - huge flow detection: replace stored index with random
    - contention avoidance: test in packet_poll while lock held
    - contention avoidance: clear pressure sooner

          packet_poll and packet_recvmsg would clear only if the sock
          is empty to avoid taking the necessary lock. But,
          * packet_poll already holds this lock, so a lockless variant
            __packet_rcv_has_room is cheap.
          * packet_recvmsg is usually called only for non-ring sockets,
            which also runs lockless.

    - preparation: drop "single return" patch

          packet_rcv_has_room is now a locked wrapper around
          __packet_rcv_has_room, achieving the same (single footer).

The benchmark mentioned in the patches is at
https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9f0a74d7

packet: rollover statistics · a9b63918

Willem de Bruijn authored May 12, 2015

Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
  Read counters in bench_rollover on all other tests in the patchset
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a9b63918

packet: rollover huge flows before small flows · 3b3a5b0a

Willem de Bruijn authored May 12, 2015

Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
cpu rx rx.k drop.k rollover r.huge r.failed
0 14 14 0 0 0 0
1 20 20 0 0 0 0
2 16 16 0 0 0 0
3 6168824 6168824 0 4867721 4867721 0
4 4867741 4867741 0 0 0 0
5 12 12 0 0 0 0
6 15 15 0 0 0 0
7 17 17 0 0 0 0
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b3a5b0a

packet: rollover lock contention avoidance · 2ccdbaa6

Willem de Bruijn authored May 12, 2015

Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  15.52%  bench_rollover  bench_rollover     [.] reader
    +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
    +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
    +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
    +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
    +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq

  Looking closer at the remaining _raw_spin_lock, the cost of probing
  in rollover is now comparable to the cost of taking the lock later
  in tpacket_rcv.

    -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
       - _raw_spin_lock
          + 33.41% packet_rcv_has_room
          + 28.15% tpacket_rcv
          + 19.54% enqueue_to_backlog
          + 6.45% __free_pages_ok
          + 2.78% packet_rcv_fanout
          + 2.13% fanout_demux_rollover
          + 2.01% netif_receive_skb_internal
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2ccdbaa6

packet: rollover only to socket with headroom · 9954729b

Willem de Bruijn authored May 12, 2015

Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.

lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit

cpu rx rx.k drop.k rollover r.huge r.failed
0 16 16 0 0 0 0
1 21 21 0 0 0 0
2 5227502 5227502 0 0 0 0
3 18 18 0 0 0 0
4 6083289 6083289 0 5227496 0 0
5 22 22 0 0 0 0
6 21 21 0 0 0 0
7 9 9 0 0 0 0
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9954729b

packet: rollover prepare: per-socket state · 0648ab70

Willem de Bruijn authored May 12, 2015

Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0648ab70

packet: rollover prepare: move code out of callsites · ad377cab

Willem de Bruijn authored May 12, 2015

packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.

The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad377cab

ipv4: __ip_local_out_sk() is static · 7d771aaa

Eric Dumazet authored May 12, 2015

__ip_local_out_sk() is only used from net/ipv4/ip_output.c

net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not
declared. Should it be static?

Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d771aaa

tcp/dccp: tw_timer_handler() is static · 216f8bb9

Eric Dumazet authored May 12, 2015

tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c

Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

216f8bb9

Merge branch 'cls_flower' · dd58c635

David S. Miller authored May 13, 2015

Jiri Pirko says:

====================
introduce programable flow dissector and cls_flower

Per Davem's request, I prepared this patchset which introduces programmable
flow dissector. For current users of flow_keys, there is a wrapper
skb_flow_dissect_flow_keys which maintains the previous behaviour.
For purposes of cls_flower, couple of new dissection keys were introduced.

Note that this dissector can be also eventually used by openvswitch code.

Also, as a next step, I plan to get rid of *skb_flow_get_ports(export)
and *__skb_get_poff as their functionality can be now implemented by
skb_flow_dissect as well.

v2->v3:
- remove TCA_FLOWER_POLICE attr suggested by Jamal

v1->v2:
- move __skb_tx_hash rather to dev.c as suggested by Alex
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dd58c635