Commits · 83697fd2b910cf59c65e7ad35bd0da19b79f569e · Kirill Smelkov / linux

14 Mar, 2019 40 commits

ARM: OMAP2+: hwmod: Fix some section annotations · 83697fd2

Nathan Chancellor authored Oct 17, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit c10b26ab ]

When building the kernel with Clang, the following section mismatch
warnings appears:

WARNING: vmlinux.o(.text+0x2d398): Section mismatch in reference from
the function _setup() to the function .init.text:_setup_iclk_autoidle()
The function _setup() references
the function __init _setup_iclk_autoidle().
This is often because _setup lacks a __init
annotation or the annotation of _setup_iclk_autoidle is wrong.

WARNING: vmlinux.o(.text+0x2d3a0): Section mismatch in reference from
the function _setup() to the function .init.text:_setup_reset()
The function _setup() references
the function __init _setup_reset().
This is often because _setup lacks a __init
annotation or the annotation of _setup_reset is wrong.

WARNING: vmlinux.o(.text+0x2d408): Section mismatch in reference from
the function _setup() to the function .init.text:_setup_postsetup()
The function _setup() references
the function __init _setup_postsetup().
This is often because _setup lacks a __init
annotation or the annotation of _setup_postsetup is wrong.

_setup is used in omap_hwmod_allocate_module, which isn't marked __init
and looks like it shouldn't be, meaning to fix these warnings, those
functions must be moved out of the init section, which this patch does.
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

83697fd2

staging: iio: ad7780: update voltage on read · 3183b36a

Renato Lui Geh authored Nov 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 336650c7 ]

The ad7780 driver previously did not read the correct device output, as
it read an outdated value set at initialization. It now updates its
voltage on read.
Signed-off-by: Renato Lui Geh <renatogeh@gmail.com>
Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

3183b36a

staging:iio:ad2s90: Make probe handle spi_setup failure · b5a3391b

Matheus Tavares authored Nov 03, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit b3a3eafe ]

Previously, ad2s90_probe ignored the return code from spi_setup, not
handling its possible failure. This patch makes ad2s90_probe check if
the code is an error code and, if so, do the following:

- Call dev_err with an appropriate error message.
- Return the spi_setup's error code.

Note: The 'return ret' statement could be out of the 'if' block, but
this whole block will be moved up in the function in the patch:
'staging:iio:ad2s90: Move device registration to the end of probe'.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

b5a3391b

ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl · d16f68b2

Miroslav Lichvar authored Nov 09, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 83d0bdc7 ]

If a gettime64 call fails, return the error and avoid copying data back
to user.

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d16f68b2

serial: fsl_lpuart: clear parity enable bit when disable parity · f15bd878

Andy Duan authored Oct 16, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 397bd921 ]

Current driver only enable parity enable bit and never clear it
when user set the termios. The fix clear the parity enable bit when
PARENB flag is not set in termios->c_cflag.

Cc: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Andy Duan <fugang.duan@nxp.com>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

f15bd878

powerpc/pseries: add of_node_put() in dlpar_detach_node() · 9fec272c

Frank Rowand authored Oct 04, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 5b3f5c40 ]

The previous commit, "of: overlay: add missing of_node_get() in
__of_attach_node_sysfs" added a missing of_node_get() to
__of_attach_node_sysfs().  This results in a refcount imbalance
for nodes attached with dlpar_attach_node().  The calling sequence
from dlpar_attach_node() to __of_attach_node_sysfs() is:

   dlpar_attach_node()
      of_attach_node()
         __of_attach_node_sysfs()

For more detailed description of the node refcount, see
commit 68baf692 ("powerpc/pseries: Fix of_node_put() underflow
during DLPAR remove").
Tested-by: Alan Tull <atull@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Frank Rowand <frank.rowand@sony.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

9fec272c

x86/PCI: Fix Broadcom CNB20LE unintended sign extension (redux) · 725b61df

Colin Ian King authored Oct 25, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 53bb565f ]

In the expression "word1 << 16", word1 starts as u16, but is promoted to a
signed int, then sign-extended to resource_size_t, which is probably not
what was intended.  Cast to resource_size_t to avoid the sign extension.

This fixes an identical issue as fixed by commit 0b2d7076 ("x86/PCI:
Fix Broadcom CNB20LE unintended sign extension") back in 2014.

Detected by CoverityScan, CID#138749, 138750 ("Unintended sign extension")

Fixes: 3f6ea84a ("PCI: read memory ranges out of Broadcom CNB20LE host bridge")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

725b61df

dlm: Don't swamp the CPU with callbacks queued during recovery · a29725c4

Bob Peterson authored Nov 08, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 216f0efd ]

Before this patch, recovery would cause all callbacks to be delayed,
put on a queue, and afterward they were all queued to the callback
work queue. This patch does the same thing, but occasionally takes
a break after 25 of them so it won't swamp the CPU at the expense
of other RT processes like corosync.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a29725c4

ARM: 8808/1: kexec:offline panic_smp_self_stop CPU · 2683de55

Yufen Wang authored Nov 02, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 82c08c3e ]

In case panic() and panic() called at the same time on different CPUS.
For example:
CPU 0:
  panic()
     __crash_kexec
       machine_crash_shutdown
         crash_smp_send_stop
       machine_kexec
         BUG_ON(num_online_cpus() > 1);

CPU 1:
  panic()
    local_irq_disable
    panic_smp_self_stop

If CPU 1 calls panic_smp_self_stop() before crash_smp_send_stop(), kdump
fails. CPU1 can't receive the ipi irq, CPU1 will be always online.
To fix this problem, this patch split out the panic_smp_self_stop()
and add set_cpu_online(smp_processor_id(), false).
Signed-off-by: Yufen Wang <wangyufen@huawei.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

2683de55

scsi: lpfc: Correct LCB RJT handling · d19cc6a6

James Smart authored Oct 23, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit b114d900 ]

When LCB's are rejected, if beaconing was already in progress, the
Reason Code Explanation was not being set. Should have been set to
command in progress.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d19cc6a6

ASoC: Intel: mrfld: fix uninitialized variable access · 0495cec5

Arnd Bergmann authored Nov 03, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 1539c7f2 ]

Randconfig testing revealed a very old bug, with gcc-8:

sound/soc/intel/atom/sst/sst_loader.c: In function 'sst_load_fw':
sound/soc/intel/atom/sst/sst_loader.c:357:5: error: 'fw' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  if (fw == NULL) {
     ^
sound/soc/intel/atom/sst/sst_loader.c:354:25: note: 'fw' was declared here
  const struct firmware *fw;

We must check the return code of request_firmware() before we look at the
pointer result that may be uninitialized when the function fails.

Fixes: 9012c954 ("ASoC: Intel: mrfld - Add DSP load and management")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0495cec5

staging: iio: adc: ad7280a: handle error from __ad7280_read32() · 50c39545

Slawomir Stepien authored Oct 20, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit 0559ef7f ]

Inside __ad7280_read32(), the spi_sync_transfer() can fail with negative
error code. This change will ensure that this error is being passed up
in the call stack, so it can be handled.
Signed-off-by: Slawomir Stepien <sst@poczta.fm>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

50c39545

drm/bufs: Fix Spectre v1 vulnerability · 97d0c63e

Gustavo A. R. Silva authored Oct 16, 2018

BugLink: https://bugs.launchpad.net/bugs/1818813

[ Upstream commit a3780509 ]

idx can be indirectly controlled by user-space, hence leading to a
potential exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/gpu/drm/drm_bufs.c:1420 drm_legacy_freebufs() warn: potential
spectre issue 'dma->buflist' [r] (local cap)

Fix this by sanitizing idx before using it to index dma->buflist

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20181016095549.GA23586@embeddedor.comSigned-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

97d0c63e

Linux 4.4.174 · e8885ad3

Greg Kroah-Hartman authored Feb 08, 2019

BugLink: https://bugs.launchpad.net/bugs/1818806Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

e8885ad3

rcu: Force boolean subscript for expedited stall warnings · b81f029c

Paul E. McKenney authored Jan 11, 2016

BugLink: https://bugs.launchpad.net/bugs/1818806

commit ec3833ed upstream.

The cpu_online() function can return values other than 0 and 1, which
can result in subscript overflow when applied to a two-element array.
This commit allows for this behavior by using "!!" on the return value
from cpu_online() when used as a subscript.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "Rantala, Tommi" <tommi.t.rantala@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

b81f029c

net: ipv4: do not handle duplicate fragments as overlapping · a3514d79

Michal Kubecek authored Dec 13, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit ade44640 upstream.

Since commit 7969e5c4 ("ip: discard IPv4 datagrams with overlapping
segments.") IPv4 reassembly code drops the whole queue whenever an
overlapping fragment is received. However, the test is written in a way
which detects duplicate fragments as overlapping so that in environments
with many duplicate packets, fragmented packets may be undeliverable.

Add an extra test and for (potentially) duplicate fragment, only drop the
new fragment rather than the whole queue. Only starting offset and length
are checked, not the contents of the fragments as that would be too
expensive. For similar reason, linear list ("run") of a rbtree node is not
iterated, we only check if the new fragment is a subset of the interval
covered by existing consecutive fragments.

v2: instead of an exact check iterating through linear list of an rbtree
node, only check if the new fragment is subset of the "run" (suggested
by Eric Dumazet)

Fixes: 7969e5c4 ("ip: discard IPv4 datagrams with overlapping segments.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
[bwh: Backported to 4.4:
 - goto discard_qp, not err, in case of overlap
 - Set err earlier variable, as done upstream in commit 0ff89efb
   "ip: fail fast on IP defrag errors"]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a3514d79

net: fix pskb_trim_rcsum_slow() with odd trim offset · 11b2fe18

Dimitris Michailidis authored Oct 19, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit d55bef50 upstream.

We've been getting checksum errors involving small UDP packets, usually
59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
has been complaining that HW is providing bad checksums. Turns out the
problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98
("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").

The source of the problem is that when the bytes we are trimming start
at an odd address, as in the case of the 1 padding byte above,
skb_checksum() returns a byte-swapped value. We cannot just combine this
with skb->csum using csum_sub(). We need to use csum_block_sub() here
that takes into account the parity of the start address and handles the
swapping.

Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().

Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Signed-off-by: Dimitris Michailidis <dmichail@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

11b2fe18

inet: frags: better deal with smp races · 12864d50

Eric Dumazet authored Nov 08, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 0d5b9311 upstream.

Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/

We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.

Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

12864d50

ipv4: frags: precedence bug in ip_expire() · 5a2327ed

Dan Carpenter authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 70837ffe upstream.

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f5273 ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

5a2327ed

ip: frags: fix crash in ip_do_fragment() · 9dad9d6a

Taehee Yoo authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 5d407b07 upstream.

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f5273 ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

9dad9d6a

ip: process in-order fragments efficiently · 97097bc5

Peter Oskolkov authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit a4fd284a upstream.

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

97097bc5

ip: add helpers to process in-order fragments faster. · f7c81167

Peter Oskolkov authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 353c9cb3 upstream.

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

f7c81167

ip: use rb trees for IP frag queue. · bee79060

Peter Oskolkov authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit fa0f5273 upstream.

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
[bwh: Backported to 4.4:
 - Keep using frag_kfree_skb() in inet_frag_destroy()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

bee79060

net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends · 15f54a59

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 88078d98 upstream.

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

15f54a59

ipv6: defrag: drop non-last frags smaller than min mtu · be703cb3

Florian Westphal authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 0ed4229b upstream.

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
    There were concerns that there could be even smaller frags
    generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
[bwh: Backported to 4.4: In nf_ct_frag6_gather() use clone instead of skb,
 and goto ret_orig in case of error]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

be703cb3

net: modify skb_rbtree_purge to return the truesize of all purged skbs. · 6bece1aa

Peter Oskolkov authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 385114de upstream.

Tested: see the next patch is the series.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

6bece1aa

ip: discard IPv4 datagrams with overlapping segments. · d999a205

Peter Oskolkov authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 7969e5c4 upstream.

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
[bwh: Backported to 4.4:
 - s/__IP_INC_STATS/IP_INC_STATS_BH/
 - Deleted code is slightly different]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d999a205

inet: frags: fix ip6frag_low_thresh boundary · 3b350356

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 3d234012 upstream.

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

3b350356

inet: frags: get rid of ipfrag_skb_cb/FRAG_CB · 1313c893

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit bf663371 upstream.

ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca9 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1313c893

inet: frags: reorganize struct netns_frags · 45ea3cd9

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit c2615cf5 upstream.

Put the read-mostly fields in a separate cache line
at the beginning of struct netns_frags, to reduce
false sharing noticed in inet_frag_kill()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

45ea3cd9

rhashtable: reorganize struct rhashtable layout · 53614870

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit e5d672a0 upstream.

While under frags DDOS I noticed unfortunate false sharing between
@nelems and @params.automatic_shrinking

Move @nelems at the end of struct rhashtable so that first cache line
is shared between all cpus, because almost never dirtied.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

53614870

ipv6: frags: rewrite ip6_expire_frag_queue() · 96aeccfc

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 05c0b86b upstream.

Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

96aeccfc

inet: frags: do not clone skb in ip_expire() · 575eef76

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 1eec5d56 upstream.

An skb_clone() was added in commit ec4fbd64 ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

575eef76

inet: frags: break the 2GB limit for frags storage · e9ca325b

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 3e67f106 upstream.

Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.

Current memory tracking is using one atomic_t and integers.

Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.

Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :

if (... || frag_mem_limit(nf) > nf->high_thresh)

Tested:

$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

<frag DDOS>

$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880

$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds                    3317150            0.0
IpReasmFails                    3317112            0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

e9ca325b

inet: frags: remove inet_frag_maybe_warn_overflow() · 39967bfd

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 2d44ed22 upstream.

This function is obsolete, after rhashtable addition to inet defrag.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

39967bfd

inet: frags: get rif of inet_frag_evicting() · 0368e11e

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 399d1404 upstream.

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64
("inet: frag: release spinlock before calling icmp_send()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0368e11e

inet: frags: remove some helpers · 3400ccb0

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 6befe4a7 upstream.

Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d3054 ("inet: frag: don't account number
of fragment queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

3400ccb0

ipfrag: really prevent allocation on netns exit · 219bc1a0

Paolo Abeni authored Jul 06, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit f6f2a4a2 upstream.

Setting the low threshold to 0 has no effect on frags allocation,
we need to clear high_thresh instead.

The code was pre-existent to commit 648700f7 ("inet: frags:
use rhashtables for reassembly units"), but before the above,
such assignment had a different role: prevent concurrent eviction
from the worker and the netns cleanup helper.

Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

219bc1a0

net: ieee802154: 6lowpan: fix frag reassembly · d25680a1

Alexander Aring authored Apr 20, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit f18fa5de upstream.

This patch initialize stack variables which are used in
frag_lowpan_compare_key to zero. In my case there are padding bytes in the
structures ieee802154_addr as well in frag_lowpan_compare_key. Otherwise
the key variable contains random bytes. The result is that a compare of
two keys by memcmp works incorrect.

Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reported-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d25680a1

inet: frags: use rhashtables for reassembly units · 06aa593d

Eric Dumazet authored Oct 10, 2018

BugLink: https://bugs.launchpad.net/bugs/1818806

commit 648700f7 upstream.

Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

06aa593d