1. 13 Aug, 2018 18 commits
  2. 12 Aug, 2018 4 commits
    • David S. Miller's avatar
      Merge branch 'ip-faster-in-order-IP-fragments' · 78cbac64
      David S. Miller authored
      Peter Oskolkov says:
      
      ====================
      ip: faster in-order IP fragments
      
      Added "Signed-off-by" in v2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78cbac64
    • Peter Oskolkov's avatar
      ip: process in-order fragments efficiently · a4fd284a
      Peter Oskolkov authored
      This patch changes the runtime behavior of IP defrag queue:
      incoming in-order fragments are added to the end of the current
      list/"run" of in-order fragments at the tail.
      
      On some workloads, UDP stream performance is substantially improved:
      
      RX: ./udp_stream -F 10 -T 2 -l 60
      TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60
      
      with this patchset applied on a 10Gbps receiver:
      
        throughput=9524.18
        throughput_units=Mbit/s
      
      upstream (net-next):
      
        throughput=4608.93
        throughput_units=Mbit/s
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4fd284a
    • Peter Oskolkov's avatar
      ip: add helpers to process in-order fragments faster. · 353c9cb3
      Peter Oskolkov authored
      This patch introduces several helper functions/macros that will be
      used in the follow-up patch. No runtime changes yet.
      
      The new logic (fully implemented in the second patch) is as follows:
      
      * Nodes in the rb-tree will now contain not single fragments, but lists
        of consecutive fragments ("runs").
      
      * At each point in time, the current "active" run at the tail is
        maintained/tracked. Fragments that arrive in-order, adjacent
        to the previous tail fragment, are added to this tail run without
        triggering the re-balancing of the rb-tree.
      
      * If a fragment arrives out of order with the offset _before_ the tail run,
        it is inserted into the rb-tree as a single fragment.
      
      * If a fragment arrives after the current tail fragment (with a gap),
        it starts a new "tail" run, as is inserted into the rb-tree
        at the end as the head of the new run.
      
      skb->cb is used to store additional information
      needed here (suggested by Eric Dumazet).
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      353c9cb3
    • David S. Miller's avatar
  3. 11 Aug, 2018 18 commits
    • David S. Miller's avatar
      Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations' · 9a95d9c6
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Remove rtnl lock dependency from all action implementations
      
      Currently, all netlink protocol handlers for updating rules, actions and
      qdiscs are protected with single global rtnl lock which removes any
      possibility for parallelism. This patch set is a second step to remove
      rtnl lock dependency from TC rules update path.
      
      Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
      Handlers registered with this flag are called without RTNL taken. End
      goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
      etc.) to be registered with UNLOCKED flag to allow parallel execution.
      However, there is no intention to completely remove or split rtnl lock
      itself. This patch set addresses specific problems in implementation of
      tc actions that prevent their control path from being executed
      concurrently. Additional changes are required to refactor classifiers
      API and individual classifiers for parallel execution. This patch set
      lays groundwork to eventually register rule update handlers as
      rtnl-unlocked.
      
      Action API is already prepared for parallel execution with previous
      patch set, which means that action ops that use action API for their
      implementation do not require additional modifications. (delete, search,
      etc.) Action API implements concurrency-safe reference counting and
      guarantees that cleanup/delete is called only once, after last reference
      to action is released.
      
      The goal of this change is to update specific actions APIs that access
      action private state directly, in order to be independent from external
      locking. General approach is to re-use existing tcf_lock spinlock (used
      by some action implementation to synchronize control path with data
      path) to protect action private state from concurrent modification. If
      action has rcu-protected pointer, tcf spinlock is used to protect its
      update code, instead of relying on rtnl lock.
      
      Some actions need to determine rtnl mutex status in order to release it.
      For example, ife action can load additional kernel modules(meta ops) and
      must make sure that no locks are held during module load. In such cases
      'rtnl_held' argument is used to conditionally release rtnl mutex.
      
      Changes from V1 to V2:
      - Patch 12:
        - new patch
      - Patch 14:
        - refactor gen_new_estimator() to reuse stats_lock when re-assigning
          rate estimator statistics pointer
      - Remove mirred and tunnel_key helper function changes. (to be submitted
        and standalone patch)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a95d9c6
    • Vlad Buslov's avatar
      net: sched: act_police: remove dependency on rtnl lock · e329bc42
      Vlad Buslov authored
      Use tcf spinlock to protect police action private data from concurrent
      modification during dump. (init already uses tcf spinlock when changing
      police action state)
      
      Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
      during action init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e329bc42
    • Vlad Buslov's avatar
      net: core: protect rate estimator statistics pointer with lock · 51a9f5ae
      Vlad Buslov authored
      Extend gen_new_estimator() to also take stats_lock when re-assigning rate
      estimator statistics pointer. (to be used by unlocked actions)
      
      Rename 'stats_lock' to 'lock' and change argument description to explain
      that it is now also used for control path.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51a9f5ae
    • Vlad Buslov's avatar
      net: sched: act_mirred: remove dependency on rtnl lock · 4e232818
      Vlad Buslov authored
      Re-introduce mirred list spinlock, that was removed some time ago, in order
      to protect it from concurrent modifications, instead of relying on rtnl
      lock.
      
      Use tcf spinlock to protect mirred action private data from concurrent
      modification in init and dump. Rearrange access to mirred data in order to
      be performed only while holding the lock.
      
      Rearrange net dev access to always hold reference while working with it,
      instead of relying on rntl lock.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e232818
    • Vlad Buslov's avatar
      net: sched: extend action ops with put_dev callback · 84a75b32
      Vlad Buslov authored
      As a preparation for removing dependency on rtnl lock from rules update
      path, all users of shared objects must take reference while working with
      them.
      
      Extend action ops with put_dev() API to be used on net device returned by
      get_dev().
      
      Modify mirred action (only action that implements get_dev callback):
      - Take reference to net device in get_dev.
      - Implement put_dev API that releases reference to net device.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84a75b32
    • Vlad Buslov's avatar
      net: sched: act_vlan: remove dependency on rtnl lock · 764e9a24
      Vlad Buslov authored
      Use tcf spinlock to protect vlan action private data from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      764e9a24
    • Vlad Buslov's avatar
      net: sched: act_tunnel_key: remove dependency on rtnl lock · 729e0126
      Vlad Buslov authored
      Use tcf lock to protect tunnel key action struct private data from
      concurrent modification in init and dump. Use rcu swap operation to
      reassign params pointer under protection of tcf lock. (old params value is
      not used by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl lock assertion that is no longer required.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      729e0126
    • Vlad Buslov's avatar
      net: sched: act_skbmod: remove dependency on rtnl lock · c8814552
      Vlad Buslov authored
      Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
      spinlock to protect private skbmod data from concurrent modification during
      dump.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8814552
    • Vlad Buslov's avatar
      net: sched: act_simple: remove dependency on rtnl lock · 5e48180e
      Vlad Buslov authored
      Use tcf spinlock to protect private simple action data from concurrent
      modification during dump. (simple init already uses tcf spinlock when
      changing action state)
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e48180e
    • Vlad Buslov's avatar
      net: sched: act_sample: remove dependency on rtnl lock · d7728495
      Vlad Buslov authored
      Use tcf spinlock to protect private sample action data from concurrent
      modification during dump and init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7728495
    • Vlad Buslov's avatar
      net: sched: act_pedit: remove dependency on rtnl lock · 67b0c1a3
      Vlad Buslov authored
      Rearrange pedit init code to only access pedit action data while holding
      tcf spinlock. Change keys allocation type to atomic to allow it to execute
      while holding tcf spinlock. Take tcf spinlock in dump function when
      accessing pedit action data.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67b0c1a3
    • Vlad Buslov's avatar
      net: sched: act_ipt: remove dependency on rtnl lock · ff25276d
      Vlad Buslov authored
      Use tcf spinlock to protect ipt action private data from concurrent
      modification during dump. Ipt init already takes tcf spinlock when
      modifying ipt state.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff25276d
    • Vlad Buslov's avatar
      net: sched: act_ife: remove dependency on rtnl lock · 54d0d423
      Vlad Buslov authored
      Use tcf spinlock and rcu to protect params pointer from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Ife action has meta-actions that are compiled as standalone modules. Rtnl
      mutex must be released while loading a kernel module. In order to support
      execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
      loading functions. When requesting meta action module, conditionally
      release rtnl lock depending on 'rtnl_held' argument.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54d0d423
    • Vlad Buslov's avatar
      net: sched: act_gact: remove dependency on rtnl lock · e8917f43
      Vlad Buslov authored
      Use tcf spinlock to protect gact action private state from concurrent
      modification during dump and init. Remove rtnl assertion that is no longer
      necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8917f43
    • Vlad Buslov's avatar
      net: sched: act_csum: remove dependency on rtnl lock · b6a2b971
      Vlad Buslov authored
      Use tcf lock to protect csum action struct private data from concurrent
      modification in init and dump. Use rcu swap operation to reassign params
      pointer under protection of tcf lock. (old params value is not used by
      init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6a2b971
    • Vlad Buslov's avatar
      net: sched: act_bpf: remove dependency on rtnl lock · 2142236b
      Vlad Buslov authored
      Use tcf spinlock to protect bpf action private data from concurrent
      modification during dump and init. Remove rtnl lock assertion that is no
      longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2142236b
    • David S. Miller's avatar
      Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc' · 2b14e1ea
      David S. Miller authored
      Konstantin Khorenko says:
      
      ====================
      net/sctp: Avoid allocating high order memory with kmalloc()
      
      Each SCTP association can have up to 65535 input and output streams.
      For each stream type an array of sctp_stream_in or sctp_stream_out
      structures is allocated using kmalloc_array() function. This function
      allocates physically contiguous memory regions, so this can lead
      to allocation of memory regions of very high order, i.e.:
      
        sizeof(struct sctp_stream_out) == 24,
        ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
        which means 9th memory order.
      
      This can lead to a memory allocation failures on the systems
      under a memory stress.
      
      We actually do not need these arrays of memory to be physically
      contiguous. Possible simple solution would be to use kvmalloc()
      instread of kmalloc() as kvmalloc() can allocate physically scattered
      pages if contiguous pages are not available. But the problem
      is that the allocation can happed in a softirq context with
      GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.
      
      So the other possible solution is to use flexible arrays instead of
      contiguios arrays of memory so that the memory would be allocated
      on a per-page basis.
      
      This patchset replaces kvmalloc() with flex_array usage.
      It consists of two parts:
      
        * First patch is preparatory - it mechanically wraps all direct
          access to assoc->stream.out[] and assoc->stream.in[] arrays
          with SCTP_SO() and SCTP_SI() wrappers so that later a direct
          array access could be easily changed to an access to a
          flex_array (or any other possible alternative).
        * Second patch replaces kmalloc_array() with flex_array usage.
      
      v2 changes:
       sctp_stream_in() users are updated to provide stream as an argument,
       sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}().
      
      v3 changes:
       Move type chages struct sctp_stream_out -> flex_array to next patch.
       Make sctp_stream_{in,out}() static incline and move them to a header.
      
      Performance results (single stream):
      ====================================
        * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
        * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                RAM: 32 Gb
      
        * netperf: taken from https://github.com/HewlettPackard/netperf.git,
      	     compiled from sources with sctp support
        * netperf server and client are run on the same node
        * ip link set lo mtu 1500
      
      The script used to run tests:
       # cat run_tests.sh
       #!/bin/bash
      
      for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
        echo "TEST: $test";
        for i in `seq 1 3`; do
          echo "Iteration: $i";
          set -x
          netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \
                  -l 60 -- -m 1452;
          set +x
        done
      done
      ================================================
      
      Results (a bit reformatted to be more readable):
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      				v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_STREAM
      212992 212992   1452    60.21	1125.52		1247.04
      212992 212992   1452    60.20	1376.38		1149.95
      212992 212992   1452    60.20	1131.40		1163.85
      TEST: SCTP_STREAM_MANY
      212992 212992   1452    60.00	1111.00		1310.05
      212992 212992   1452    60.00	1188.55		1130.50
      212992 212992   1452    60.00	1108.06		1162.50
      
      ===========
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      					v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_RR
      212992 212992 1        1       60.00	45486.98	46089.43
      212992 212992 1        1       60.00	45584.18	45994.21
      212992 212992 1        1       60.00	45703.86	45720.84
      TEST: SCTP_RR_MANY
      212992 212992 1        1       60.00	40.75		40.77
      212992 212992 1        1       60.00	40.58		40.08
      212992 212992 1        1       60.00	39.98		39.97
      
      Performance results for many streams:
      =====================================
         * Kernel: v4.18-rc8 - stock and with 2 patches v3
         * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                 RAM: 32 Gb
      
         * sctp_test: https://github.com/sctp/lksctp-tools
         * both server and client are run on the same node
         * ip link set lo mtu 1500
         * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented)
      
      The script used to run tests:
      =============================
       # cat run_sctp_test.sh
       #!/bin/bash
      
      set -x
      
      uname -r
      ip link set lo mtu 1500
      swapoff -a
      
      free
      cat /proc/buddyinfo
      
      ./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 &
      sleep 3
      
      time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \
               -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null
      
      killall -9 lt-sctp_test
      ===============================
      
      Results (a bit reformatted to be more readable):
      
      1) ms stock kernel v4.18-rc8, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.715s	0m14.593s	0m15.954s
      user    0m0.954s	0m0.955s	0m0.854s
      sys     0m13.388s	0m12.537s	0m13.749s
      
      2) kernel with fixes, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.959s	0m14.693s	0m14.762s
      user    0m0.948s	0m0.921s	0m0.929s
      sys     0m13.538s	0m13.225s	0m13.217s
      
      3) kernel with fixes, memory fragmented
      'free':
                     total        used        free      shared  buff/cache   available
      Mem:       32906008    30555200      302740         764     2048068      266452
      Mem:       32906008    30379948      541436         764     1984624      442376
      Mem:       32906008    30717312      262380         764     1926316      109908
      
      /proc/buddyinfo:
      Node 0, zone   Normal  40773     37     34     29      0      0      0      0      0      0      0
      Node 0, zone   Normal 100332     68      8      4      2      1      1      0      0      0      0
      Node 0, zone   Normal  31113      7      2      1      0      0      0      0      0      0      0
      
      	test 1		test 2		test 3
      real    0m14.159s	0m15.252s	0m15.826s
      user    0m0.839s	0m1.004s	0m1.048s
      sys     0m11.827s	0m14.240s	0m14.778s
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b14e1ea
    • Konstantin Khorenko's avatar
      net/sctp: Replace in/out stream arrays with flex_array · 0d493b4d
      Konstantin Khorenko authored
      This path replaces physically contiguous memory arrays
      allocated using kmalloc_array() with flexible arrays.
      This enables to avoid memory allocation failures on the
      systems under a memory stress.
      Signed-off-by: default avatarOleg Babin <obabin@virtuozzo.com>
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d493b4d