1. 13 Aug, 2018 8 commits
    • Heiner Kallweit's avatar
      r8169: don't use MSI-X on RTL8168g · 7c53a722
      Heiner Kallweit authored
      There have been two reports that network doesn't come back on resume
      from suspend when using MSI-X. Both cases affect the same chip version
      (RTL8168g - version 40), on different systems. Falling back to MSI
      fixes the issue.
      Even though we don't really have a proof yet that the network chip
      version is to blame, let's disable MSI-X for this version.
      Reported-by: default avatarSteve Dodd <steved424@gmail.com>
      Reported-by: default avatarLou Reed <gogen@disroot.org>
      Tested-by: default avatarSteve Dodd <steved424@gmail.com>
      Tested-by: default avatarLou Reed <gogen@disroot.org>
      Fixes: 6c6aa15f ("r8169: improve interrupt handling")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c53a722
    • David S. Miller's avatar
      Merge branch 'nixge-Minor-cleanups' · 9ebcc22c
      David S. Miller authored
      Moritz Fischer says:
      
      ====================
      net: nixge: Minor cleanups
      
      in preparation of my 64-bit support series, here's some
      minor cleanup in preparation that gets rid of unneccesary
      accesses to the descriptor application fields.
      
      I've confirmed that the hardware does not access the fields
      in all our configurations.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ebcc22c
    • Moritz Fischer's avatar
      net: nixge: Don't store skb in app4 field of descriptor · fd5cf434
      Moritz Fischer authored
      Don't store skb in app4 field of descriptor since it is
      not being used anywhere (including hardware).
      Signed-off-by: default avatarMoritz Fischer <mdf@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd5cf434
    • Moritz Fischer's avatar
      net: nixge: Do not zero application specific fields in desc · e158770e
      Moritz Fischer authored
      Do not zero application specific fields in DMA descriptors.
      The hardware does ignore them, so should software.
      Signed-off-by: default avatarMoritz Fischer <mdf@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e158770e
    • Wei Wang's avatar
      l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache · 6d37fa49
      Wei Wang authored
      In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a
      UDP socket. User could call sendmsg() on both this tunnel and the UDP
      socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call
      __sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is
      lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there
      could be a race and cause the dst cache to be freed multiple times.
      So we fix l2tp side code to always call sk_dst_check() to garantee
      xchg() is called when refreshing sk->sk_dst_cache to avoid race
      conditions.
      
      Syzkaller reported stack trace:
      BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: use-after-free in atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
      BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 [inline]
      BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline]
      BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
      Read of size 4 at addr ffff8801aea9a880 by task syz-executor129/4829
      
      CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
       atomic_add_unless include/linux/atomic.h:597 [inline]
       dst_hold_safe include/net/dst.h:308 [inline]
       ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
       rt6_get_pcpu_route net/ipv6/route.c:1249 [inline]
       ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922
       ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098
       fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122
       ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126
       ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978
       ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
       ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117
       udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:622 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:632
       ___sys_sendmsg+0x51d/0x930 net/socket.c:2115
       __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210
       __do_sys_sendmmsg net/socket.c:2239 [inline]
       __se_sys_sendmmsg net/socket.c:2236 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x446a29
      Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f4de5532db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000006dcc38 RCX: 0000000000446a29
      RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
      RBP: 00000000006dcc30 R08: 00007f4de5533700 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dcc3c
      R13: 00007ffe2b830fdf R14: 00007f4de55339c0 R15: 0000000000000001
      
      Fixes: 71b1391a ("l2tp: ensure sk->dst is still valid")
      Reported-by: syzbot+05f840f3b04f211bad55@syzkaller.appspotmail.com
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Guillaume Nault <g.nault@alphalink.fr>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d37fa49
    • Virgile Jarry's avatar
      ipv6: Add icmp_echo_ignore_all support for ICMPv6 · e6f86b0f
      Virgile Jarry authored
      Preventing the kernel from responding to ICMP Echo Requests messages
      can be useful in several ways. The sysctl parameter
      'icmp_echo_ignore_all' can be used to prevent the kernel from
      responding to IPv4 ICMP echo requests. For IPv6 pings, such
      a sysctl kernel parameter did not exist.
      
      Add the ability to prevent the kernel from responding to IPv6
      ICMP echo requests through the use of the following sysctl
      parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
      Update the documentation to reflect this change.
      Signed-off-by: default avatarVirgile Jarry <virgile@acceis.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6f86b0f
    • David S. Miller's avatar
      Merge branch 'net-tls-Combined-memory-allocation-for-decryption-request' · 8f780044
      David S. Miller authored
      Vakul Garg says:
      
      ====================
      net/tls: Combined memory allocation for decryption request
      
      This patch does a combined memory allocation from heap for scatterlists,
      aead_request, aad and iv for the tls record decryption path. In present
      code, aead_request is allocated from heap, scatterlists on a conditional
      basis are allocated on heap or on stack. This is inefficient as it may
      requires multiple kmalloc/kfree.
      
      The initialization vector passed in cryption request is allocated on
      stack. This is a problem since the stack memory is not dma-able from
      crypto accelerators.
      
      Doing one combined memory allocation for each decryption request fixes
      both the above issues. It also paves a way to be able to submit multiple
      async decryption requests while the previous one is pending i.e. being
      processed or queued.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f780044
    • Vakul Garg's avatar
      net/tls: Combined memory allocation for decryption request · 0b243d00
      Vakul Garg authored
      For preparing decryption request, several memory chunks are required
      (aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to
      an accelerator, it is required that the buffers which are read by the
      accelerator must be dma-able and not come from stack. The buffers for
      aad and iv can be separately kmalloced each, but it is inefficient.
      This patch does a combined allocation for preparing decryption request
      and then segments into aead_req || sgin || sgout || iv || aad.
      Signed-off-by: default avatarVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b243d00
  2. 12 Aug, 2018 4 commits
    • David S. Miller's avatar
      Merge branch 'ip-faster-in-order-IP-fragments' · 78cbac64
      David S. Miller authored
      Peter Oskolkov says:
      
      ====================
      ip: faster in-order IP fragments
      
      Added "Signed-off-by" in v2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78cbac64
    • Peter Oskolkov's avatar
      ip: process in-order fragments efficiently · a4fd284a
      Peter Oskolkov authored
      This patch changes the runtime behavior of IP defrag queue:
      incoming in-order fragments are added to the end of the current
      list/"run" of in-order fragments at the tail.
      
      On some workloads, UDP stream performance is substantially improved:
      
      RX: ./udp_stream -F 10 -T 2 -l 60
      TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60
      
      with this patchset applied on a 10Gbps receiver:
      
        throughput=9524.18
        throughput_units=Mbit/s
      
      upstream (net-next):
      
        throughput=4608.93
        throughput_units=Mbit/s
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4fd284a
    • Peter Oskolkov's avatar
      ip: add helpers to process in-order fragments faster. · 353c9cb3
      Peter Oskolkov authored
      This patch introduces several helper functions/macros that will be
      used in the follow-up patch. No runtime changes yet.
      
      The new logic (fully implemented in the second patch) is as follows:
      
      * Nodes in the rb-tree will now contain not single fragments, but lists
        of consecutive fragments ("runs").
      
      * At each point in time, the current "active" run at the tail is
        maintained/tracked. Fragments that arrive in-order, adjacent
        to the previous tail fragment, are added to this tail run without
        triggering the re-balancing of the rb-tree.
      
      * If a fragment arrives out of order with the offset _before_ the tail run,
        it is inserted into the rb-tree as a single fragment.
      
      * If a fragment arrives after the current tail fragment (with a gap),
        it starts a new "tail" run, as is inserted into the rb-tree
        at the end as the head of the new run.
      
      skb->cb is used to store additional information
      needed here (suggested by Eric Dumazet).
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      353c9cb3
    • David S. Miller's avatar
  3. 11 Aug, 2018 28 commits
    • David S. Miller's avatar
      Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations' · 9a95d9c6
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Remove rtnl lock dependency from all action implementations
      
      Currently, all netlink protocol handlers for updating rules, actions and
      qdiscs are protected with single global rtnl lock which removes any
      possibility for parallelism. This patch set is a second step to remove
      rtnl lock dependency from TC rules update path.
      
      Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
      Handlers registered with this flag are called without RTNL taken. End
      goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
      etc.) to be registered with UNLOCKED flag to allow parallel execution.
      However, there is no intention to completely remove or split rtnl lock
      itself. This patch set addresses specific problems in implementation of
      tc actions that prevent their control path from being executed
      concurrently. Additional changes are required to refactor classifiers
      API and individual classifiers for parallel execution. This patch set
      lays groundwork to eventually register rule update handlers as
      rtnl-unlocked.
      
      Action API is already prepared for parallel execution with previous
      patch set, which means that action ops that use action API for their
      implementation do not require additional modifications. (delete, search,
      etc.) Action API implements concurrency-safe reference counting and
      guarantees that cleanup/delete is called only once, after last reference
      to action is released.
      
      The goal of this change is to update specific actions APIs that access
      action private state directly, in order to be independent from external
      locking. General approach is to re-use existing tcf_lock spinlock (used
      by some action implementation to synchronize control path with data
      path) to protect action private state from concurrent modification. If
      action has rcu-protected pointer, tcf spinlock is used to protect its
      update code, instead of relying on rtnl lock.
      
      Some actions need to determine rtnl mutex status in order to release it.
      For example, ife action can load additional kernel modules(meta ops) and
      must make sure that no locks are held during module load. In such cases
      'rtnl_held' argument is used to conditionally release rtnl mutex.
      
      Changes from V1 to V2:
      - Patch 12:
        - new patch
      - Patch 14:
        - refactor gen_new_estimator() to reuse stats_lock when re-assigning
          rate estimator statistics pointer
      - Remove mirred and tunnel_key helper function changes. (to be submitted
        and standalone patch)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a95d9c6
    • Vlad Buslov's avatar
      net: sched: act_police: remove dependency on rtnl lock · e329bc42
      Vlad Buslov authored
      Use tcf spinlock to protect police action private data from concurrent
      modification during dump. (init already uses tcf spinlock when changing
      police action state)
      
      Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
      during action init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e329bc42
    • Vlad Buslov's avatar
      net: core: protect rate estimator statistics pointer with lock · 51a9f5ae
      Vlad Buslov authored
      Extend gen_new_estimator() to also take stats_lock when re-assigning rate
      estimator statistics pointer. (to be used by unlocked actions)
      
      Rename 'stats_lock' to 'lock' and change argument description to explain
      that it is now also used for control path.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51a9f5ae
    • Vlad Buslov's avatar
      net: sched: act_mirred: remove dependency on rtnl lock · 4e232818
      Vlad Buslov authored
      Re-introduce mirred list spinlock, that was removed some time ago, in order
      to protect it from concurrent modifications, instead of relying on rtnl
      lock.
      
      Use tcf spinlock to protect mirred action private data from concurrent
      modification in init and dump. Rearrange access to mirred data in order to
      be performed only while holding the lock.
      
      Rearrange net dev access to always hold reference while working with it,
      instead of relying on rntl lock.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e232818
    • Vlad Buslov's avatar
      net: sched: extend action ops with put_dev callback · 84a75b32
      Vlad Buslov authored
      As a preparation for removing dependency on rtnl lock from rules update
      path, all users of shared objects must take reference while working with
      them.
      
      Extend action ops with put_dev() API to be used on net device returned by
      get_dev().
      
      Modify mirred action (only action that implements get_dev callback):
      - Take reference to net device in get_dev.
      - Implement put_dev API that releases reference to net device.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84a75b32
    • Vlad Buslov's avatar
      net: sched: act_vlan: remove dependency on rtnl lock · 764e9a24
      Vlad Buslov authored
      Use tcf spinlock to protect vlan action private data from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      764e9a24
    • Vlad Buslov's avatar
      net: sched: act_tunnel_key: remove dependency on rtnl lock · 729e0126
      Vlad Buslov authored
      Use tcf lock to protect tunnel key action struct private data from
      concurrent modification in init and dump. Use rcu swap operation to
      reassign params pointer under protection of tcf lock. (old params value is
      not used by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl lock assertion that is no longer required.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      729e0126
    • Vlad Buslov's avatar
      net: sched: act_skbmod: remove dependency on rtnl lock · c8814552
      Vlad Buslov authored
      Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
      spinlock to protect private skbmod data from concurrent modification during
      dump.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8814552
    • Vlad Buslov's avatar
      net: sched: act_simple: remove dependency on rtnl lock · 5e48180e
      Vlad Buslov authored
      Use tcf spinlock to protect private simple action data from concurrent
      modification during dump. (simple init already uses tcf spinlock when
      changing action state)
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e48180e
    • Vlad Buslov's avatar
      net: sched: act_sample: remove dependency on rtnl lock · d7728495
      Vlad Buslov authored
      Use tcf spinlock to protect private sample action data from concurrent
      modification during dump and init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7728495
    • Vlad Buslov's avatar
      net: sched: act_pedit: remove dependency on rtnl lock · 67b0c1a3
      Vlad Buslov authored
      Rearrange pedit init code to only access pedit action data while holding
      tcf spinlock. Change keys allocation type to atomic to allow it to execute
      while holding tcf spinlock. Take tcf spinlock in dump function when
      accessing pedit action data.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67b0c1a3
    • Vlad Buslov's avatar
      net: sched: act_ipt: remove dependency on rtnl lock · ff25276d
      Vlad Buslov authored
      Use tcf spinlock to protect ipt action private data from concurrent
      modification during dump. Ipt init already takes tcf spinlock when
      modifying ipt state.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff25276d
    • Vlad Buslov's avatar
      net: sched: act_ife: remove dependency on rtnl lock · 54d0d423
      Vlad Buslov authored
      Use tcf spinlock and rcu to protect params pointer from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Ife action has meta-actions that are compiled as standalone modules. Rtnl
      mutex must be released while loading a kernel module. In order to support
      execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
      loading functions. When requesting meta action module, conditionally
      release rtnl lock depending on 'rtnl_held' argument.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54d0d423
    • Vlad Buslov's avatar
      net: sched: act_gact: remove dependency on rtnl lock · e8917f43
      Vlad Buslov authored
      Use tcf spinlock to protect gact action private state from concurrent
      modification during dump and init. Remove rtnl assertion that is no longer
      necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8917f43
    • Vlad Buslov's avatar
      net: sched: act_csum: remove dependency on rtnl lock · b6a2b971
      Vlad Buslov authored
      Use tcf lock to protect csum action struct private data from concurrent
      modification in init and dump. Use rcu swap operation to reassign params
      pointer under protection of tcf lock. (old params value is not used by
      init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6a2b971
    • Vlad Buslov's avatar
      net: sched: act_bpf: remove dependency on rtnl lock · 2142236b
      Vlad Buslov authored
      Use tcf spinlock to protect bpf action private data from concurrent
      modification during dump and init. Remove rtnl lock assertion that is no
      longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2142236b
    • David S. Miller's avatar
      Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc' · 2b14e1ea
      David S. Miller authored
      Konstantin Khorenko says:
      
      ====================
      net/sctp: Avoid allocating high order memory with kmalloc()
      
      Each SCTP association can have up to 65535 input and output streams.
      For each stream type an array of sctp_stream_in or sctp_stream_out
      structures is allocated using kmalloc_array() function. This function
      allocates physically contiguous memory regions, so this can lead
      to allocation of memory regions of very high order, i.e.:
      
        sizeof(struct sctp_stream_out) == 24,
        ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
        which means 9th memory order.
      
      This can lead to a memory allocation failures on the systems
      under a memory stress.
      
      We actually do not need these arrays of memory to be physically
      contiguous. Possible simple solution would be to use kvmalloc()
      instread of kmalloc() as kvmalloc() can allocate physically scattered
      pages if contiguous pages are not available. But the problem
      is that the allocation can happed in a softirq context with
      GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.
      
      So the other possible solution is to use flexible arrays instead of
      contiguios arrays of memory so that the memory would be allocated
      on a per-page basis.
      
      This patchset replaces kvmalloc() with flex_array usage.
      It consists of two parts:
      
        * First patch is preparatory - it mechanically wraps all direct
          access to assoc->stream.out[] and assoc->stream.in[] arrays
          with SCTP_SO() and SCTP_SI() wrappers so that later a direct
          array access could be easily changed to an access to a
          flex_array (or any other possible alternative).
        * Second patch replaces kmalloc_array() with flex_array usage.
      
      v2 changes:
       sctp_stream_in() users are updated to provide stream as an argument,
       sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}().
      
      v3 changes:
       Move type chages struct sctp_stream_out -> flex_array to next patch.
       Make sctp_stream_{in,out}() static incline and move them to a header.
      
      Performance results (single stream):
      ====================================
        * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
        * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                RAM: 32 Gb
      
        * netperf: taken from https://github.com/HewlettPackard/netperf.git,
      	     compiled from sources with sctp support
        * netperf server and client are run on the same node
        * ip link set lo mtu 1500
      
      The script used to run tests:
       # cat run_tests.sh
       #!/bin/bash
      
      for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
        echo "TEST: $test";
        for i in `seq 1 3`; do
          echo "Iteration: $i";
          set -x
          netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \
                  -l 60 -- -m 1452;
          set +x
        done
      done
      ================================================
      
      Results (a bit reformatted to be more readable):
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      				v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_STREAM
      212992 212992   1452    60.21	1125.52		1247.04
      212992 212992   1452    60.20	1376.38		1149.95
      212992 212992   1452    60.20	1131.40		1163.85
      TEST: SCTP_STREAM_MANY
      212992 212992   1452    60.00	1111.00		1310.05
      212992 212992   1452    60.00	1188.55		1130.50
      212992 212992   1452    60.00	1108.06		1162.50
      
      ===========
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      					v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_RR
      212992 212992 1        1       60.00	45486.98	46089.43
      212992 212992 1        1       60.00	45584.18	45994.21
      212992 212992 1        1       60.00	45703.86	45720.84
      TEST: SCTP_RR_MANY
      212992 212992 1        1       60.00	40.75		40.77
      212992 212992 1        1       60.00	40.58		40.08
      212992 212992 1        1       60.00	39.98		39.97
      
      Performance results for many streams:
      =====================================
         * Kernel: v4.18-rc8 - stock and with 2 patches v3
         * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                 RAM: 32 Gb
      
         * sctp_test: https://github.com/sctp/lksctp-tools
         * both server and client are run on the same node
         * ip link set lo mtu 1500
         * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented)
      
      The script used to run tests:
      =============================
       # cat run_sctp_test.sh
       #!/bin/bash
      
      set -x
      
      uname -r
      ip link set lo mtu 1500
      swapoff -a
      
      free
      cat /proc/buddyinfo
      
      ./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 &
      sleep 3
      
      time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \
               -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null
      
      killall -9 lt-sctp_test
      ===============================
      
      Results (a bit reformatted to be more readable):
      
      1) ms stock kernel v4.18-rc8, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.715s	0m14.593s	0m15.954s
      user    0m0.954s	0m0.955s	0m0.854s
      sys     0m13.388s	0m12.537s	0m13.749s
      
      2) kernel with fixes, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.959s	0m14.693s	0m14.762s
      user    0m0.948s	0m0.921s	0m0.929s
      sys     0m13.538s	0m13.225s	0m13.217s
      
      3) kernel with fixes, memory fragmented
      'free':
                     total        used        free      shared  buff/cache   available
      Mem:       32906008    30555200      302740         764     2048068      266452
      Mem:       32906008    30379948      541436         764     1984624      442376
      Mem:       32906008    30717312      262380         764     1926316      109908
      
      /proc/buddyinfo:
      Node 0, zone   Normal  40773     37     34     29      0      0      0      0      0      0      0
      Node 0, zone   Normal 100332     68      8      4      2      1      1      0      0      0      0
      Node 0, zone   Normal  31113      7      2      1      0      0      0      0      0      0      0
      
      	test 1		test 2		test 3
      real    0m14.159s	0m15.252s	0m15.826s
      user    0m0.839s	0m1.004s	0m1.048s
      sys     0m11.827s	0m14.240s	0m14.778s
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b14e1ea
    • Konstantin Khorenko's avatar
      net/sctp: Replace in/out stream arrays with flex_array · 0d493b4d
      Konstantin Khorenko authored
      This path replaces physically contiguous memory arrays
      allocated using kmalloc_array() with flexible arrays.
      This enables to avoid memory allocation failures on the
      systems under a memory stress.
      Signed-off-by: default avatarOleg Babin <obabin@virtuozzo.com>
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d493b4d
    • Konstantin Khorenko's avatar
      net/sctp: Make wrappers for accessing in/out streams · 05364ca0
      Konstantin Khorenko authored
      This patch introduces wrappers for accessing in/out streams indirectly.
      This will enable to replace physically contiguous memory arrays
      of streams with flexible arrays (or maybe any other appropriate
      mechanism) which do memory allocation on a per-page basis.
      Signed-off-by: default avatarOleg Babin <obabin@virtuozzo.com>
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05364ca0
    • Keara Leibovitz's avatar
      tc: Update README and add config · b70f1f3a
      Keara Leibovitz authored
      Updated README.
      
      Added config file that contains the minimum required features enabled to
      run the tests currently present in the kernel.
      This must be updated when new unittests are created and require their own
      modules.
      Signed-off-by: default avatarKeara Leibovitz <kleib@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b70f1f3a
    • David S. Miller's avatar
      Merge branch 'l2tp-rework-pppol2tp-ioctl-handling' · 3305f9a9
      David S. Miller authored
      Guillaume Nault says:
      
      ====================
      l2tp: rework pppol2tp ioctl handling
      
      The current ioctl() handling code can be simplified. It tests for
      non-relevant conditions and uselessly holds sockets. Once useless
      code is removed, it becomes even simpler to let pppol2tp_ioctl() handle
      commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl()
      or pppol2tp_session_ioctl(). That is the approach taken by this series.
      
      Patch #1 and #2 define helper functions aimed at simplifying the rest
      of the patch set.
      
      Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a
      refcount on the socket.
      
      Patches #4, #5 and #6 are the core of the series. They let
      pppol2tp_ioctl() handle all ioctls and drop the tunnel and session
      specific functions.
      
      Then patch #6 brings a little bit of consolidation.
      
      Finally, patch #7 takes advantage of the simplified code to make
      pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer
      feature, but it is trivial and it is always nice to see l2tp getting
      better integration with the rest of the stack.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3305f9a9
    • Guillaume Nault's avatar
      l2tp: let pppol2tp_ioctl() fallback to dev_ioctl() · 4f5f85e9
      Guillaume Nault authored
      Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
      handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
      PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
      this mechanism.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f5f85e9
    • Guillaume Nault's avatar
      l2tp: zero out stats in pppol2tp_copy_stats() · 7390ed8a
      Guillaume Nault authored
      Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
      manually every time.
      
      While there, constify 'stats'.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7390ed8a
    • Guillaume Nault's avatar
      l2tp: remove pppol2tp_session_ioctl() · b0e29063
      Guillaume Nault authored
      pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
      on session sockets. We just need to copy the stats and set ->session_id.
      
      As a side effect of sharing session and tunnel code, ->using_ipsec is
      properly set even when the request was made using a session socket.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0e29063
    • Guillaume Nault's avatar
      l2tp: remove pppol2tp_tunnel_ioctl() · 528534f0
      Guillaume Nault authored
      Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
      tunnel. This one is a bit special because the caller may use the tunnel
      socket to retrieve statistics of one of its sessions. If the session_id
      is set, the corresponding session's statistics are returned, instead of
      those of the tunnel. This is handled by the new
      pppol2tp_tunnel_copy_stats() helper function.
      
      Set ->tunnel_id and ->using_ipsec out of the conditional, so
      that it can be used by the 'else' branch in the following patch.
      We cannot do that for ->session_id, because tunnel sockets have to
      report the value that was originally passed in 'stats.session_id',
      while session sockets have to report their own session_id.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      528534f0
    • Guillaume Nault's avatar
      l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl() · 79e6760e
      Guillaume Nault authored
      Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
      pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79e6760e
    • Guillaume Nault's avatar
      l2tp: simplify pppol2tp_ioctl() · bdd0292f
      Guillaume Nault authored
      * Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could
          not have called us.
      
        * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket
          would be in the process of being released and no ioctl could be
          running anymore.
      
        * Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get
          the session structure. If it is non-NULL, then the socket is
          connected. Testing for PPPOX_* is redundant.
      
        * Retrieve session using ->sk_user_data directly, instead of going
          through pppol2tp_sock_to_session(). This avoids grabbing a useless
          reference on the socket.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdd0292f
    • Guillaume Nault's avatar
      l2tp: split l2tp_session_get() · 01e28b92
      Guillaume Nault authored
      l2tp_session_get() is used for two different purposes. If 'tunnel' is
      NULL, the session is searched globally in the supplied network
      namespace. Otherwise it is searched exclusively in the tunnel context.
      
      Callers always know the context in which they need to search the
      session. But some of them do provide both a namespace and a tunnel,
      making the semantic of the call unclear.
      
      This patch defines l2tp_tunnel_get_session() for lookups done in a
      tunnel and restricts l2tp_session_get() to namespace searches.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01e28b92