Commits · ee40fb2e1eb5bc0ddd3f2f83c6e39a454ef5a741 · Kirill Smelkov / linux

29 Oct, 2017 9 commits

l2tp: protect sock pointer of struct pppol2tp_session with RCU · ee40fb2e

Guillaume Nault authored Oct 27, 2017

pppol2tp_session_create() registers sessions that can't have their
corresponding socket initialised. This socket has to be created by
userspace, then connected to the session by pppol2tp_connect().
Therefore, we need to protect the pppol2tp socket pointer of L2TP
sessions, so that it can safely be updated when userspace is connecting
or closing the socket. This will eventually allow pppol2tp_connect()
to avoid generating transient states while initialising its parts of the
session.

To this end, this patch protects the pppol2tp socket pointer using RCU.

The pppol2tp socket pointer is still set in pppol2tp_connect(), but
only once we know the function isn't going to fail. It's eventually
reset by pppol2tp_release(), which now has to wait for a grace period
to elapse before it can drop the last reference on the socket. This
ensures that pppol2tp_session_get_sock() can safely grab a reference
on the socket, even after ps->sk is reset to NULL but before this
operation actually gets visible from pppol2tp_session_get_sock().

The rest is standard RCU conversion: pppol2tp_recv(), which already
runs in atomic context, is simply enclosed by rcu_read_lock() and
rcu_read_unlock(), while other functions are converted to use
pppol2tp_session_get_sock() followed by sock_put().
pppol2tp_session_setsockopt() is a special case. It used to retrieve
the pppol2tp socket from the L2TP session, which itself was retrieved
from the pppol2tp socket. Therefore we can just avoid dereferencing
ps->sk and directly use the original socket pointer instead.

With all users of ps->sk now handling NULL and concurrent updates, the
L2TP ->ref() and ->deref() callbacks aren't needed anymore. Therefore,
rather than converting pppol2tp_session_sock_hold() and
pppol2tp_session_sock_put(), we can just drop them.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee40fb2e

l2tp: initialise l2tp_eth sessions before registering them · ee28de6b

Guillaume Nault authored Oct 27, 2017

Sessions must be initialised before being made externally visible by
l2tp_session_register(). Otherwise the session may be concurrently
deleted before being initialised, which can confuse the deletion path
and eventually lead to kernel oops.

Therefore, we need to move l2tp_session_register() down in
l2tp_eth_create(), but also handle the intermediate step where only the
session or the netdevice has been registered.

We can't just call l2tp_session_register() in ->ndo_init() because
we'd have no way to properly undo this operation in ->ndo_uninit().
Instead, let's register the session and the netdevice in two different
steps and protect the session's device pointer with RCU.

And now that we allow the session's .dev field to be NULL, we don't
need to prevent the netdevice from being removed anymore. So we can
drop the dev_hold() and dev_put() calls in l2tp_eth_create() and
l2tp_eth_dev_uninit().

Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee28de6b

l2tp: don't register sessions in l2tp_session_create() · 3953ae7b

Guillaume Nault authored Oct 27, 2017

Sessions created by l2tp_session_create() aren't fully initialised:
some pseudo-wire specific operations need to be done before making the
session usable. Therefore the PPP and Ethernet pseudo-wires continue
working on the returned l2tp session while it's already been exposed to
the rest of the system.
This can lead to various issues. In particular, the session may enter
the deletion process before having been fully initialised, which will
confuse the session removal code.

This patch moves session registration out of l2tp_session_create(), so
that callers can control when the session is exposed to the rest of the
system. This is done by the new l2tp_session_register() function.

Only pppol2tp_session_create() can be easily converted to avoid
modifying its session after registration (the debug message is dropped
in order to avoid the need for holding a reference on the session).

For pppol2tp_connect() and l2tp_eth_create()), more work is needed.
That'll be done in followup patches. For now, let's just register the
session right after its creation, like it was done before. The only
difference is that we can easily take a reference on the session before
registering it, so, at least, we're sure it's not going to be freed
while we're working on it.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

3953ae7b

tcp: Remove "linux/unaligned/access_ok.h" include. · 949cf8b1

David S. Miller authored Oct 29, 2017

This causes build failures:

In file included from net/ipv4/tcp_input.c:79:0:
./include/linux/unaligned/access_ok.h:7:28: error: redefinition of
'get_unaligned_le16'
In file included from ./include/asm-generic/unaligned.h:17:0,
                 from ./arch/arm/include/generated/asm/unaligned.h:1,
                 from net/ipv4/tcp_input.c:76:
./include/linux/unaligned/le_struct.h:6:19: note: previous definition
of 'get_unaligned_le16' was here
In file included from net/ipv4/tcp_input.c:79:0:
./include/linux/unaligned/access_ok.h:12:28: error: redefinition of
'get_unaligned_le32'

Plain "asm/access_ok.h", which is already included, is
sufficient.

Fixes: 60e2a778 ("tcp: TCP experimental option for SMC")
Reported-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

949cf8b1

cxgb3: Check and handle the dma mapping errors · c69fe407

Arjun Vynipadath authored Oct 27, 2017

This patch adds checks at approprate places whether *dma_map*() call has
succeeded or not.

Original Work by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c69fe407

r8169: Add support for interrupt coalesce tuning (ethtool -C) · 50970831

Francois Romieu authored Oct 27, 2017

Kirr: In particular with

	ethtool -C <ifname> rx-usecs 0 rx-frames 0

now it is possible to disable RX delays when NIC usage requires low-latency.

See this thread for context:

	https://www.spinics.net/lists/netdev/msg217665.html

My specific case is that:

We have many computers with gigabit Realtek NICs. For 2 such computers
connected to a gigabit store-and-forward switch the minimum round-trip
time for small pings (`ping -i 0 -w 3 -s 56 -q peer`) is ~ 30μs.

However it turned out that when Ethernet frame length transitions 127 ->
128 bytes (`ping -i 0 -w 3 -s {81 -> 82} -q peer`) the lowest RTT
transitions step-wise to ~ 270μs.

As David Light said this is RX interrupt mitigation done by NIC which creates
the latency. For workloads when low-latency is required with e.g. Intel,
BCM etc NIC drivers one just uses `ethtool -C rx-usecs ...` to reduce
the time NIC delays before interrupting CPU, but it turned out
`ethtool -C` is not supported by r8169 driver.

Like Stéphane ANCELOT I've traced the problem down to IntrMitigate being
hardcoded to != 0 for our chips (we have 8168 based NICs):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n5460
static void rtl_hw_start_8169(struct net_device *dev) {
        ...
        /*
         * Undocumented corner. Supposedly:
         * (TxTimer << 12) | (TxPackets << 8) | (RxTimer << 4) | RxPackets
         */
        RTL_W16(IntrMitigate, 0x0000);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n6346
static void rtl_hw_start_8168(struct net_device *dev) {
        ...
        RTL_W16(IntrMitigate, 0x5151);

and then I've also found

	https://www.spinics.net/lists/netdev/msg217665.html

and original Francois' patch:

	https://www.spinics.net/lists/netdev/msg217984.html
	https://www.spinics.net/lists/netdev/msg218207.html

So could we please finally get support for tuning r8169 interrupt
coalescing in tree? (so that next poor soul who hits the problem does
not need to go all the way to dig into driver sources and internet
wildly and finally patch locally

        -RTL_W16(IntrMitigate, 0x5151);
        +RTL_W16(IntrMitigate, 0x5100);

guessing whether it is right or not and also having to care to deploy
the patch everywhere it needs to be used, etc...).

To do so I've took original Francois's patch from 2012 and reworked it a bit:

- updated to latest net-next.git;
- adjusted scaling setup based on feedback from Hayes to pick up scaling
  vector depending not only on link speed but also on CPlusCmd[0:1] and to
  adjust CPlusCmd[0:1] correspondingly when setting timings;
- improved a bit (I think so) error handling.

I've tested the patch on "RTL8168d/8111d" (XID 083000c0) and with it and
`ethtool -C rx-usecs 0 rx-frames 0` on both ends it improves:

- minimum RTT latency:

        ~270μs ->  ~30μs (small packet),
        ~330μs -> ~110μs (full 1.5K ethernet frame)

- average RTT latency:

        ~480μs ->  ~50μs (small packet),
        ~560μs -> ~125μs (full 1.5K ethernet frame)

( before:

        root@neo1:# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5906 packets transmitted, 5905 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.274/0.485/0.607/0.026 ms, ipg/ewma 0.508/0.489 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5073 packets transmitted, 5073 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.330/0.566/0.710/0.028 ms, ipg/ewma 0.591/0.544 ms

  after:

        root@neo1# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        45815 packets transmitted, 45815 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.036/0.051/0.368/0.010 ms, ipg/ewma 0.065/0.053 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        21250 packets transmitted, 21250 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.112/0.125/0.390/0.007 ms, ipg/ewma 0.141/0.125 ms

  the small -> 1.5K latency growth is understandable as it takes ~15μs
  to transmit 1.5K on 1Gbps on the wire and with 2 hosts and 1 switch
  and ICMP ECHO + ECHO reply the packet has to travel 4 ethernet
  segments which is already 60μs;

  probably something a bit else is also there as e.g. on Linux, even
  with `cpupower frequency-set -g performance`, on some computers I've
  noticed the kernel can be spending more time in software-only mode
  when incoming packets go in less frequently. E.g. this program can
  demonstrate the effect for ICMP ECHO processing:

  https://lab.nexedi.com/kirr/bcc/blob/43cfc13b/tools/pinglat.py

  (later this was found to be partly due to C-states exit latencies) )

We have this patch running in our testing setup for 1 months already
without any issues observed.

It remains to be clarified whether RX and TX timers use the same base.
For now I've set them equally, but Francois's original patch version
suggests it could be not the same.

I've got no feedback at all to my original posting of this patch and questions

	https://www.spinics.net/lists/netdev/msg457173.html

neither from Francois, nor from any people from Realtek during one month.

So I suggest we simply apply it to net-next.git now.

Cc: Francois Romieu <romieu@fr.zoreil.com>
Cc: Hayes Wang <hayeswang@realtek.com>
Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Stéphane ANCELOT <sancelot@free.fr>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50970831

Merge branch 'bridge-make-setlink-dellink-notifications-more-accurate' · 8ef2097e

David S. Miller authored Oct 29, 2017

Nikolay Aleksandrov says:

====================
bridge: make setlink/dellink notifications more accurate

Before this set the bridge would generate a notification on vlan add or del
even if they didn't actually do any changes, which confuses listeners and
is generally not preferred. We could also lose notifications on actual
changes if one adds a range of vlans and there's an error in the middle.
The problem with just breaking and returning an error is that we could
break existing user-space scripts which rely on the vlan delete to clear
all existing entries in the specified range and ignore the non-existing
errors (typically used to clear the current vlan config).
So in order to make the notifications more accurate while keeping backwards
compatibility we add a boolean that tracks if anything actually changed
during the config calls.

The vlan add is more difficult to fix because it always returns 0 even if
nothing changed, but we cannot use a specific error because the drivers
can return anything and we may mask it, also we'd need to update all places
that directly return the add result, thus to signal that a vlan was created
or updated and in order not to break overlapping vlan range add we pass
down the new boolean that tracks changes to the add functions to check
if anything was actually updated.

v6: moved "changed" in else branch in br|nbp_vlan_add, thanks to
    Toshiaki Makita and retested everything again
v5: fix br_vlan_add return (v1 leftover) spotted by Toshiaki Makita
v4: set changed always to false in the non-vlan config case and retested
v3: rebased to latest net-next and fixed non-vlan config functions reported
    by kbuild test bot
v2: pass changed down to vlan add instead of masking errors
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8ef2097e

bridge: vlan: signal if anything changed on vlan add · f418af63

Nikolay Aleksandrov authored Oct 27, 2017

Before this patch there was no way to tell if the vlan add operation
actually changed anything, thus we would always generate a notification
on adds. Let's make the notifications more precise and generate them
only if anything changed, so use the new bool parameter to signal that the
vlan was updated. We cannot return an error because there are valid use
cases that will be broken (e.g. overlapping range add) and also we can't
risk masking errors due to calls into drivers for vlan add which can
potentially return anything.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

f418af63

bridge: netlink: make setlink/dellink notifications more accurate · e19b42a1

Nikolay Aleksandrov authored Oct 27, 2017

Before this patch we had cases that either sent notifications when there
were in fact no changes (e.g. non-existent vlan delete) or didn't send
notifications when there were changes (e.g. vlan add range with an error in
the middle, port flags change + vlan update error). This patch sends down
a boolean to the functions setlink/dellink use and if there is even a
single configuration change (port flag, vlan add/del, port state) then
we always send a notification. This is all done to keep backwards
compatibility with the opportunistic vlan delete, where one could
specify a vlan range that has missing vlans inside and still everything
in that range will be cleared, this is mostly used to clear the whole
vlan config with a single call, i.e. range 1-4094.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e19b42a1

28 Oct, 2017 31 commits

tcp: remove unnecessary include · 5b52a4c3

Alexei Starovoitov authored Oct 27, 2017

two extra #include are not necessary in tcp.h
Remove them.

Fixes: 40304b2a ("bpf: BPF support for sock_ops")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b52a4c3

Merge branch 'tcp-more-perns-sysctls' · 871da0a7

David S. Miller authored Oct 28, 2017

Eric Dumazet says:

====================
tcp: move 12 sysctls to namespaces

Ideally all TCP sysctls should be per netns.
This patch series takes care of 12 sysctls.

Remains the ones that need discussion :

sysctl_tcp_mem, sysctl_tcp_rmem, sysctl_tcp_wmem, and sysctl_tcp_max_orphans
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

871da0a7

tcp: Namespace-ify sysctl_tcp_pacing_ca_ratio · c26e91f8

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c26e91f8

tcp: Namespace-ify sysctl_tcp_pacing_ss_ratio · 23a7102a

Eric Dumazet authored Oct 27, 2017

Also remove an obsolete comment about TCP pacing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

23a7102a

tcp: Namespace-ify sysctl_tcp_invalid_ratelimit · 4170ba6b

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4170ba6b

tcp: Namespace-ify sysctl_tcp_autocorking · 790f00e1

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

790f00e1

tcp: Namespace-ify sysctl_tcp_min_rtt_wlen · bd239704

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bd239704

tcp: Namespace-ify sysctl_tcp_min_tso_segs · 26e9596e

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26e9596e

tcp: Namespace-ify sysctl_tcp_challenge_ack_limit · b530b681

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b530b681

tcp: Namespace-ify sysctl_tcp_limit_output_bytes · 9184d8bb

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9184d8bb

tcp: Namespace-ify sysctl_tcp_workaround_signed_windows · ceef9ab6

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ceef9ab6

tcp: Namespace-ify sysctl_tcp_tso_win_divisor · d06a9904

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d06a9904

tcp: Namespace-ify sysctl_tcp_moderate_rcvbuf · 4540c0cf

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4540c0cf

tcp: Namespace-ify sysctl_tcp_nometrics_save · ec36e416

Eric Dumazet authored Oct 27, 2017

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec36e416