1. 11 Aug, 2018 40 commits
    • David S. Miller's avatar
      Merge branch 'Remove-rtnl-lock-dependency-from-all-action-implementations' · 9a95d9c6
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Remove rtnl lock dependency from all action implementations
      
      Currently, all netlink protocol handlers for updating rules, actions and
      qdiscs are protected with single global rtnl lock which removes any
      possibility for parallelism. This patch set is a second step to remove
      rtnl lock dependency from TC rules update path.
      
      Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
      Handlers registered with this flag are called without RTNL taken. End
      goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
      etc.) to be registered with UNLOCKED flag to allow parallel execution.
      However, there is no intention to completely remove or split rtnl lock
      itself. This patch set addresses specific problems in implementation of
      tc actions that prevent their control path from being executed
      concurrently. Additional changes are required to refactor classifiers
      API and individual classifiers for parallel execution. This patch set
      lays groundwork to eventually register rule update handlers as
      rtnl-unlocked.
      
      Action API is already prepared for parallel execution with previous
      patch set, which means that action ops that use action API for their
      implementation do not require additional modifications. (delete, search,
      etc.) Action API implements concurrency-safe reference counting and
      guarantees that cleanup/delete is called only once, after last reference
      to action is released.
      
      The goal of this change is to update specific actions APIs that access
      action private state directly, in order to be independent from external
      locking. General approach is to re-use existing tcf_lock spinlock (used
      by some action implementation to synchronize control path with data
      path) to protect action private state from concurrent modification. If
      action has rcu-protected pointer, tcf spinlock is used to protect its
      update code, instead of relying on rtnl lock.
      
      Some actions need to determine rtnl mutex status in order to release it.
      For example, ife action can load additional kernel modules(meta ops) and
      must make sure that no locks are held during module load. In such cases
      'rtnl_held' argument is used to conditionally release rtnl mutex.
      
      Changes from V1 to V2:
      - Patch 12:
        - new patch
      - Patch 14:
        - refactor gen_new_estimator() to reuse stats_lock when re-assigning
          rate estimator statistics pointer
      - Remove mirred and tunnel_key helper function changes. (to be submitted
        and standalone patch)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a95d9c6
    • Vlad Buslov's avatar
      net: sched: act_police: remove dependency on rtnl lock · e329bc42
      Vlad Buslov authored
      Use tcf spinlock to protect police action private data from concurrent
      modification during dump. (init already uses tcf spinlock when changing
      police action state)
      
      Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
      during action init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e329bc42
    • Vlad Buslov's avatar
      net: core: protect rate estimator statistics pointer with lock · 51a9f5ae
      Vlad Buslov authored
      Extend gen_new_estimator() to also take stats_lock when re-assigning rate
      estimator statistics pointer. (to be used by unlocked actions)
      
      Rename 'stats_lock' to 'lock' and change argument description to explain
      that it is now also used for control path.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51a9f5ae
    • Vlad Buslov's avatar
      net: sched: act_mirred: remove dependency on rtnl lock · 4e232818
      Vlad Buslov authored
      Re-introduce mirred list spinlock, that was removed some time ago, in order
      to protect it from concurrent modifications, instead of relying on rtnl
      lock.
      
      Use tcf spinlock to protect mirred action private data from concurrent
      modification in init and dump. Rearrange access to mirred data in order to
      be performed only while holding the lock.
      
      Rearrange net dev access to always hold reference while working with it,
      instead of relying on rntl lock.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e232818
    • Vlad Buslov's avatar
      net: sched: extend action ops with put_dev callback · 84a75b32
      Vlad Buslov authored
      As a preparation for removing dependency on rtnl lock from rules update
      path, all users of shared objects must take reference while working with
      them.
      
      Extend action ops with put_dev() API to be used on net device returned by
      get_dev().
      
      Modify mirred action (only action that implements get_dev callback):
      - Take reference to net device in get_dev.
      - Implement put_dev API that releases reference to net device.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84a75b32
    • Vlad Buslov's avatar
      net: sched: act_vlan: remove dependency on rtnl lock · 764e9a24
      Vlad Buslov authored
      Use tcf spinlock to protect vlan action private data from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      764e9a24
    • Vlad Buslov's avatar
      net: sched: act_tunnel_key: remove dependency on rtnl lock · 729e0126
      Vlad Buslov authored
      Use tcf lock to protect tunnel key action struct private data from
      concurrent modification in init and dump. Use rcu swap operation to
      reassign params pointer under protection of tcf lock. (old params value is
      not used by init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl lock assertion that is no longer required.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      729e0126
    • Vlad Buslov's avatar
      net: sched: act_skbmod: remove dependency on rtnl lock · c8814552
      Vlad Buslov authored
      Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
      spinlock to protect private skbmod data from concurrent modification during
      dump.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8814552
    • Vlad Buslov's avatar
      net: sched: act_simple: remove dependency on rtnl lock · 5e48180e
      Vlad Buslov authored
      Use tcf spinlock to protect private simple action data from concurrent
      modification during dump. (simple init already uses tcf spinlock when
      changing action state)
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e48180e
    • Vlad Buslov's avatar
      net: sched: act_sample: remove dependency on rtnl lock · d7728495
      Vlad Buslov authored
      Use tcf spinlock to protect private sample action data from concurrent
      modification during dump and init.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7728495
    • Vlad Buslov's avatar
      net: sched: act_pedit: remove dependency on rtnl lock · 67b0c1a3
      Vlad Buslov authored
      Rearrange pedit init code to only access pedit action data while holding
      tcf spinlock. Change keys allocation type to atomic to allow it to execute
      while holding tcf spinlock. Take tcf spinlock in dump function when
      accessing pedit action data.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67b0c1a3
    • Vlad Buslov's avatar
      net: sched: act_ipt: remove dependency on rtnl lock · ff25276d
      Vlad Buslov authored
      Use tcf spinlock to protect ipt action private data from concurrent
      modification during dump. Ipt init already takes tcf spinlock when
      modifying ipt state.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff25276d
    • Vlad Buslov's avatar
      net: sched: act_ife: remove dependency on rtnl lock · 54d0d423
      Vlad Buslov authored
      Use tcf spinlock and rcu to protect params pointer from concurrent
      modification during dump and init. Use rcu swap operation to reassign
      params pointer under protection of tcf lock. (old params value is not used
      by init, so there is no need of standalone rcu dereference step)
      
      Ife action has meta-actions that are compiled as standalone modules. Rtnl
      mutex must be released while loading a kernel module. In order to support
      execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
      loading functions. When requesting meta action module, conditionally
      release rtnl lock depending on 'rtnl_held' argument.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54d0d423
    • Vlad Buslov's avatar
      net: sched: act_gact: remove dependency on rtnl lock · e8917f43
      Vlad Buslov authored
      Use tcf spinlock to protect gact action private state from concurrent
      modification during dump and init. Remove rtnl assertion that is no longer
      necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8917f43
    • Vlad Buslov's avatar
      net: sched: act_csum: remove dependency on rtnl lock · b6a2b971
      Vlad Buslov authored
      Use tcf lock to protect csum action struct private data from concurrent
      modification in init and dump. Use rcu swap operation to reassign params
      pointer under protection of tcf lock. (old params value is not used by
      init, so there is no need of standalone rcu dereference step)
      
      Remove rtnl assertion that is no longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6a2b971
    • Vlad Buslov's avatar
      net: sched: act_bpf: remove dependency on rtnl lock · 2142236b
      Vlad Buslov authored
      Use tcf spinlock to protect bpf action private data from concurrent
      modification during dump and init. Remove rtnl lock assertion that is no
      longer necessary.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2142236b
    • David S. Miller's avatar
      Merge branch 'net-sctp-Avoid-allocating-high-order-memory-with-kmalloc' · 2b14e1ea
      David S. Miller authored
      Konstantin Khorenko says:
      
      ====================
      net/sctp: Avoid allocating high order memory with kmalloc()
      
      Each SCTP association can have up to 65535 input and output streams.
      For each stream type an array of sctp_stream_in or sctp_stream_out
      structures is allocated using kmalloc_array() function. This function
      allocates physically contiguous memory regions, so this can lead
      to allocation of memory regions of very high order, i.e.:
      
        sizeof(struct sctp_stream_out) == 24,
        ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
        which means 9th memory order.
      
      This can lead to a memory allocation failures on the systems
      under a memory stress.
      
      We actually do not need these arrays of memory to be physically
      contiguous. Possible simple solution would be to use kvmalloc()
      instread of kmalloc() as kvmalloc() can allocate physically scattered
      pages if contiguous pages are not available. But the problem
      is that the allocation can happed in a softirq context with
      GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.
      
      So the other possible solution is to use flexible arrays instead of
      contiguios arrays of memory so that the memory would be allocated
      on a per-page basis.
      
      This patchset replaces kvmalloc() with flex_array usage.
      It consists of two parts:
      
        * First patch is preparatory - it mechanically wraps all direct
          access to assoc->stream.out[] and assoc->stream.in[] arrays
          with SCTP_SO() and SCTP_SI() wrappers so that later a direct
          array access could be easily changed to an access to a
          flex_array (or any other possible alternative).
        * Second patch replaces kmalloc_array() with flex_array usage.
      
      v2 changes:
       sctp_stream_in() users are updated to provide stream as an argument,
       sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}().
      
      v3 changes:
       Move type chages struct sctp_stream_out -> flex_array to next patch.
       Make sctp_stream_{in,out}() static incline and move them to a header.
      
      Performance results (single stream):
      ====================================
        * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
        * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                RAM: 32 Gb
      
        * netperf: taken from https://github.com/HewlettPackard/netperf.git,
      	     compiled from sources with sctp support
        * netperf server and client are run on the same node
        * ip link set lo mtu 1500
      
      The script used to run tests:
       # cat run_tests.sh
       #!/bin/bash
      
      for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
        echo "TEST: $test";
        for i in `seq 1 3`; do
          echo "Iteration: $i";
          set -x
          netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \
                  -l 60 -- -m 1452;
          set +x
        done
      done
      ================================================
      
      Results (a bit reformatted to be more readable):
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      				v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_STREAM
      212992 212992   1452    60.21	1125.52		1247.04
      212992 212992   1452    60.20	1376.38		1149.95
      212992 212992   1452    60.20	1131.40		1163.85
      TEST: SCTP_STREAM_MANY
      212992 212992   1452    60.00	1111.00		1310.05
      212992 212992   1452    60.00	1188.55		1130.50
      212992 212992   1452    60.00	1108.06		1162.50
      
      ===========
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      					v4.18-rc7	v4.18-rc7 + fixes
      TEST: SCTP_RR
      212992 212992 1        1       60.00	45486.98	46089.43
      212992 212992 1        1       60.00	45584.18	45994.21
      212992 212992 1        1       60.00	45703.86	45720.84
      TEST: SCTP_RR_MANY
      212992 212992 1        1       60.00	40.75		40.77
      212992 212992 1        1       60.00	40.58		40.08
      212992 212992 1        1       60.00	39.98		39.97
      
      Performance results for many streams:
      =====================================
         * Kernel: v4.18-rc8 - stock and with 2 patches v3
         * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
                 RAM: 32 Gb
      
         * sctp_test: https://github.com/sctp/lksctp-tools
         * both server and client are run on the same node
         * ip link set lo mtu 1500
         * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented)
      
      The script used to run tests:
      =============================
       # cat run_sctp_test.sh
       #!/bin/bash
      
      set -x
      
      uname -r
      ip link set lo mtu 1500
      swapoff -a
      
      free
      cat /proc/buddyinfo
      
      ./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 &
      sleep 3
      
      time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \
               -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null
      
      killall -9 lt-sctp_test
      ===============================
      
      Results (a bit reformatted to be more readable):
      
      1) ms stock kernel v4.18-rc8, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.715s	0m14.593s	0m15.954s
      user    0m0.954s	0m0.955s	0m0.854s
      sys     0m13.388s	0m12.537s	0m13.749s
      
      2) kernel with fixes, no memory fragmentation
      	test 1		test 2		test 3
      real    0m14.959s	0m14.693s	0m14.762s
      user    0m0.948s	0m0.921s	0m0.929s
      sys     0m13.538s	0m13.225s	0m13.217s
      
      3) kernel with fixes, memory fragmented
      'free':
                     total        used        free      shared  buff/cache   available
      Mem:       32906008    30555200      302740         764     2048068      266452
      Mem:       32906008    30379948      541436         764     1984624      442376
      Mem:       32906008    30717312      262380         764     1926316      109908
      
      /proc/buddyinfo:
      Node 0, zone   Normal  40773     37     34     29      0      0      0      0      0      0      0
      Node 0, zone   Normal 100332     68      8      4      2      1      1      0      0      0      0
      Node 0, zone   Normal  31113      7      2      1      0      0      0      0      0      0      0
      
      	test 1		test 2		test 3
      real    0m14.159s	0m15.252s	0m15.826s
      user    0m0.839s	0m1.004s	0m1.048s
      sys     0m11.827s	0m14.240s	0m14.778s
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b14e1ea
    • Konstantin Khorenko's avatar
      net/sctp: Replace in/out stream arrays with flex_array · 0d493b4d
      Konstantin Khorenko authored
      This path replaces physically contiguous memory arrays
      allocated using kmalloc_array() with flexible arrays.
      This enables to avoid memory allocation failures on the
      systems under a memory stress.
      Signed-off-by: default avatarOleg Babin <obabin@virtuozzo.com>
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d493b4d
    • Konstantin Khorenko's avatar
      net/sctp: Make wrappers for accessing in/out streams · 05364ca0
      Konstantin Khorenko authored
      This patch introduces wrappers for accessing in/out streams indirectly.
      This will enable to replace physically contiguous memory arrays
      of streams with flexible arrays (or maybe any other appropriate
      mechanism) which do memory allocation on a per-page basis.
      Signed-off-by: default avatarOleg Babin <obabin@virtuozzo.com>
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05364ca0
    • Keara Leibovitz's avatar
      tc: Update README and add config · b70f1f3a
      Keara Leibovitz authored
      Updated README.
      
      Added config file that contains the minimum required features enabled to
      run the tests currently present in the kernel.
      This must be updated when new unittests are created and require their own
      modules.
      Signed-off-by: default avatarKeara Leibovitz <kleib@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b70f1f3a
    • David S. Miller's avatar
      Merge branch 'l2tp-rework-pppol2tp-ioctl-handling' · 3305f9a9
      David S. Miller authored
      Guillaume Nault says:
      
      ====================
      l2tp: rework pppol2tp ioctl handling
      
      The current ioctl() handling code can be simplified. It tests for
      non-relevant conditions and uselessly holds sockets. Once useless
      code is removed, it becomes even simpler to let pppol2tp_ioctl() handle
      commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl()
      or pppol2tp_session_ioctl(). That is the approach taken by this series.
      
      Patch #1 and #2 define helper functions aimed at simplifying the rest
      of the patch set.
      
      Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a
      refcount on the socket.
      
      Patches #4, #5 and #6 are the core of the series. They let
      pppol2tp_ioctl() handle all ioctls and drop the tunnel and session
      specific functions.
      
      Then patch #6 brings a little bit of consolidation.
      
      Finally, patch #7 takes advantage of the simplified code to make
      pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer
      feature, but it is trivial and it is always nice to see l2tp getting
      better integration with the rest of the stack.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3305f9a9
    • Guillaume Nault's avatar
      l2tp: let pppol2tp_ioctl() fallback to dev_ioctl() · 4f5f85e9
      Guillaume Nault authored
      Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
      handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
      PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
      this mechanism.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f5f85e9
    • Guillaume Nault's avatar
      l2tp: zero out stats in pppol2tp_copy_stats() · 7390ed8a
      Guillaume Nault authored
      Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
      manually every time.
      
      While there, constify 'stats'.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7390ed8a
    • Guillaume Nault's avatar
      l2tp: remove pppol2tp_session_ioctl() · b0e29063
      Guillaume Nault authored
      pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
      on session sockets. We just need to copy the stats and set ->session_id.
      
      As a side effect of sharing session and tunnel code, ->using_ipsec is
      properly set even when the request was made using a session socket.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0e29063
    • Guillaume Nault's avatar
      l2tp: remove pppol2tp_tunnel_ioctl() · 528534f0
      Guillaume Nault authored
      Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
      tunnel. This one is a bit special because the caller may use the tunnel
      socket to retrieve statistics of one of its sessions. If the session_id
      is set, the corresponding session's statistics are returned, instead of
      those of the tunnel. This is handled by the new
      pppol2tp_tunnel_copy_stats() helper function.
      
      Set ->tunnel_id and ->using_ipsec out of the conditional, so
      that it can be used by the 'else' branch in the following patch.
      We cannot do that for ->session_id, because tunnel sockets have to
      report the value that was originally passed in 'stats.session_id',
      while session sockets have to report their own session_id.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      528534f0
    • Guillaume Nault's avatar
      l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl() · 79e6760e
      Guillaume Nault authored
      Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
      pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79e6760e
    • Guillaume Nault's avatar
      l2tp: simplify pppol2tp_ioctl() · bdd0292f
      Guillaume Nault authored
      * Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could
          not have called us.
      
        * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket
          would be in the process of being released and no ioctl could be
          running anymore.
      
        * Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get
          the session structure. If it is non-NULL, then the socket is
          connected. Testing for PPPOX_* is redundant.
      
        * Retrieve session using ->sk_user_data directly, instead of going
          through pppol2tp_sock_to_session(). This avoids grabbing a useless
          reference on the socket.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdd0292f
    • Guillaume Nault's avatar
      l2tp: split l2tp_session_get() · 01e28b92
      Guillaume Nault authored
      l2tp_session_get() is used for two different purposes. If 'tunnel' is
      NULL, the session is searched globally in the supplied network
      namespace. Otherwise it is searched exclusively in the tunnel context.
      
      Callers always know the context in which they need to search the
      session. But some of them do provide both a namespace and a tunnel,
      making the semantic of the call unclear.
      
      This patch defines l2tp_tunnel_get_session() for lookups done in a
      tunnel and restricts l2tp_session_get() to namespace searches.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01e28b92
    • Guillaume Nault's avatar
      l2tp: define l2tp_tunnel_uses_xfrm() · d6a61ec9
      Guillaume Nault authored
      Use helper function to figure out if a tunnel is using ipsec.
      Also, avoid accessing ->sk_policy directly since it's RCU protected.
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6a61ec9
    • David S. Miller's avatar
      Merge branch 'netsec-driver-improvements' · 8a8982d1
      David S. Miller authored
      Ilias Apalodimas says:
      
      ====================
      netsec driver improvements
      
      This patchset introduces some improvements on socionext netsec driver.
       - patch 1/2, avoids unneeded MMIO reads on the Rx path
       - patch 2/2, is adjusting the numbers of descriptors used
      
      Changes since v1:
       - Move dma_rmb() to protect descriptor accesses until the device
       has updated the NETSEC_RX_PKT_OWN_FIELD bit
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a8982d1
    • Ilias Apalodimas's avatar
      net: socionext: Increase descriptors to 256 · b6311b7b
      Ilias Apalodimas authored
      Increasing descriptors to 256 from 128 and adjusting the NAPI weight
      to 64 increases performace on Rx by ~20% on 64byte packets
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6311b7b
    • Ilias Apalodimas's avatar
      net: socionext: Use descriptor info instead of MMIO reads on Rx · 63ae7949
      Ilias Apalodimas authored
      MMIO reads for remaining packets in queue occur (at least)twice per
      invocation of netsec_process_rx(). We can use the packet descriptor to
      identify if it's owned by the hardware and break out, avoiding the more
      expensive MMIO read operations. This has a ~2% increase on the pps of the
      Rx path when tested with 64byte packets
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63ae7949
    • YueHaibing's avatar
      vxge: remove set but not used variable 'req_out', 'status' and 'ret' · 78aca3bb
      YueHaibing authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/net/ethernet/neterion/vxge/vxge-config.c:1097:6: warning:
       variable 'ret' set but not used [-Wunused-but-set-variable]
      drivers/net/ethernet/neterion/vxge/vxge-config.c:2263:6: warning:
       variable 'req_out' set but not used [-Wunused-but-set-variable]
      drivers/net/ethernet/neterion/vxge/vxge-config.c:2262:22: warning:
       variable 'status' set but not used [-Wunused-but-set-variable]
      drivers/net/ethernet/neterion/vxge/vxge-config.c:2360:22: warning:
       variable 'status' set but not used [-Wunused-but-set-variable]
        enum vxge_hw_status status = VXGE_HW_OK;
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78aca3bb
    • David S. Miller's avatar
      Merge branch 'virtio_net-Expand-affinity-to-arbitrary-numbers-of-cpu-and-vq' · 29afde50
      David S. Miller authored
      Caleb Raitto says:
      
      ====================
      virtio_net: Expand affinity to arbitrary numbers of cpu and vq
      
      Virtio-net tries to pin each virtual queue rx and tx interrupt to a cpu if
      there are as many queues as cpus.
      
      Expand this heuristic to configure a reasonable affinity setting also
      when the number of cpus != the number of virtual queues.
      
      Patch 1 allows vqs to take an affinity mask with more than 1 cpu.
      Patch 2 generalizes the algorithm in virtnet_set_affinity beyond
      the case where #cpus == #vqs.
      
      v2 changes:
      Renamed "virtio_net: Make vp_set_vq_affinity() take a mask." to
      "virtio: Make vp_set_vq_affinity() take a mask."
      
      Tested:
      
      [InstanceSetup]
      set_multiqueue = false
      
      $ cd /proc/irq
      $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
      0-15
      0
      0
      1
      1
      2
      2
      3
      3
      4
      4
      5
      5
      6
      6
      7
      7
      8
      8
      9
      9
      10
      10
      11
      11
      12
      12
      13
      13
      14
      14
      15
      15
      0-15
      0-15
      0-15
      0-15
      
      $ cd /sys/class/net/eth0/queues/
      $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
      0001
      0002
      0004
      0008
      0010
      0020
      0040
      0080
      0100
      0200
      0400
      0800
      1000
      2000
      4000
      8000
      
      $ sudo ethtool -L eth0 combined 15
      
      $ cd /proc/irq
      $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
      0-15
      0-1
      0-1
      2
      2
      3
      3
      4
      4
      5
      5
      6
      6
      7
      7
      8
      8
      9
      9
      10
      10
      11
      11
      12
      12
      13
      13
      14
      14
      15
      15
      15
      15
      0-15
      0-15
      0-15
      0-15
      
      $ cd /sys/class/net/eth0/queues/
      $ for i in `seq 0 14` ; do sudo grep ".*" tx-$i/xps_cpus; done
      0003
      0004
      0008
      0010
      0020
      0040
      0080
      0100
      0200
      0400
      0800
      1000
      2000
      4000
      8000
      
      $ sudo ethtool -L eth0 combined 8
      
      $ cd /proc/irq
      $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
      0-15
      0-1
      0-1
      2-3
      2-3
      4-5
      4-5
      6-7
      6-7
      8-9
      8-9
      10-11
      10-11
      12-13
      12-13
      14-15
      14-15
      9
      9
      10
      10
      11
      11
      12
      12
      13
      13
      14
      14
      15
      15
      15
      15
      0-15
      0-15
      0-15
      0-15
      
      $ cd /sys/class/net/eth0/queues/
      $ for i in `seq 0 7` ; do sudo grep ".*" tx-$i/xps_cpus; done
      0003
      000c
      0030
      00c0
      0300
      0c00
      3000
      c000
      
      $ sudo ethtool -L eth0 combined 16
      $ sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu15/online"
      
      $ cd /proc/irq
      $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
      0-15
      0
      0
      1
      1
      2
      2
      3
      3
      4
      4
      5
      5
      6
      6
      7
      7
      8
      8
      9
      9
      10
      10
      11
      11
      12
      12
      13
      13
      14
      14
      0
      0
      0-15
      0-15
      0-15
      0-15
      
      $ cd /sys/class/net/eth0/queues/
      $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
      0001
      0002
      0004
      0008
      0010
      0020
      0040
      0080
      0100
      0200
      0400
      0800
      1000
      2000
      4000
      0001
      
      $ for i in `seq 8 15`; \
      do sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu$i/online"; done
      
      $ cd /proc/irq
      $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list;  done
      0-15
      0
      0
      1
      1
      2
      2
      3
      3
      4
      4
      5
      5
      6
      6
      7
      7
      0
      0
      1
      1
      2
      2
      3
      3
      4
      4
      5
      5
      6
      6
      7
      7
      0-15
      0-15
      0-15
      0-15
      
      $ cd /sys/class/net/eth0/queues/
      $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done
      0001
      0002
      0004
      0008
      0010
      0020
      0040
      0080
      0001
      0002
      0004
      0008
      0010
      0020
      0040
      0080
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29afde50
    • Caleb Raitto's avatar
      virtio_net: Stripe queue affinities across cores. · 2ca653d6
      Caleb Raitto authored
      Always set the affinity hint, even if #cpu != #vq.
      
      Handle the case where #cpu > #vq (including when #cpu % #vq != 0) and
      when #vq > #cpu (including when #vq % #cpu != 0).
      Signed-off-by: default avatarCaleb Raitto <caraitto@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJon Olson <jonolson@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ca653d6
    • Caleb Raitto's avatar
      virtio: Make vp_set_vq_affinity() take a mask. · 19e226e8
      Caleb Raitto authored
      Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU.
      
      If there are fewer queues than cores, queue affinity should be able to
      map to multiple cores.
      
      Link: https://patchwork.ozlabs.org/patch/948149/Suggested-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarCaleb Raitto <caraitto@google.com>
      Acked-by: default avatarGonglei <arei.gonglei@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19e226e8
    • Bryan Whitehead's avatar
      lan743x: lan743x: Add PTP support · 07624df1
      Bryan Whitehead authored
      PTP support includes:
          Ingress, and egress timestamping.
          One step timestamping available.
          PTP clock support.
          Periodic output support.
      Signed-off-by: default avatarBryan Whitehead <Bryan.Whitehead@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07624df1
    • David S. Miller's avatar
      Merge branch 'tcp-new-mechanism-to-ACK-immediately' · 217e502b
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      tcp: new mechanism to ACK immediately
      
      This patch is a follow-up feature improvement to the recent fixes on
      the performance issues in ECN (delayed) ACKs. Many of the fixes use
      tcp_enter_quickack_mode routine to force immediate ACKs. However the
      routine also reset tracking interactive session. This is not ideal
      because these immediate ACKs are required by protocol specifics
      unrelated to the interactiveness nature of the application.
      
      This patch set introduces a new flag to send a one-time immediate ACK
      without changing the status of interactive session tracking. With this
      patch set the immediate ACKs are generated upon these protocol states:
      
      1) When a hole is repaired
      2) When CE status changes between subsequent data packets received
      3) When a data packet carries CWR flag
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      217e502b
    • Yuchung Cheng's avatar
      tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag · fd2123a3
      Yuchung Cheng authored
      Previously commit 9aee4000 ("tcp: ack immediately when a cwr
      packet arrives") calls tcp_enter_quickack_mode to force sending
      two immediate ACKs upon receiving a packet w/ CWR flag. The side
      effect is it'll also reset the delayed ACK timer and interactive
      session tracking. This patch removes that side effect by using the
      new ACK_NOW flag to force an immmediate ACK.
      
      Packetdrill to demonstrate:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
        +.1 < [ect0] . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 < [ect0] . 1:1001(1000) ack 1 win 257
         +0 > [ect01] . 1:1(0) ack 1001
      
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 1:2(1) ack 1001
      
         +0 < [ect0] . 1001:2001(1000) ack 2 win 257
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 2:3(1) ack 2001
      
         +0 < [ect0] . 2001:3001(1000) ack 3 win 257
         +0 < [ect0] . 3001:4001(1000) ack 3 win 257
         // Ack delayed ...
      
         +.01 < [ce] P. 4001:4501(500) ack 3 win 257
         +0 > [ect01] . 3:3(0) ack 4001
         +0 > [ect01] E. 3:3(0) ack 4501
      
      +.001 read(4, ..., 4500) = 4500
         +0 write(4, ..., 1) = 1
         +0 > [ect01] PE. 3:4(1) ack 4501 win 100
      
       +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
         // No delayed ACK on CWR flag
         +0 > [ect01] . 4:4(0) ack 5501
      
       +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
         +0 > [ect01] . 4:4(0) ack 6501
      
      Fixes: 9aee4000 ("tcp: ack immediately when a cwr packet arrives")
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd2123a3
    • Yuchung Cheng's avatar
      tcp: always ACK immediately on hole repairs · 15bdd568
      Yuchung Cheng authored
      RFC 5681 sec 4.2:
        To provide feedback to senders recovering from losses, the receiver
        SHOULD send an immediate ACK when it receives a data segment that
        fills in all or part of a gap in the sequence space.
      
      When a gap is partially filled, __tcp_ack_snd_check already checks
      the out-of-order queue and correctly send an immediate ACK. However
      when a gap is fully filled, the previous implementation only resets
      pingpong mode which does not guarantee an immediate ACK because the
      quick ACK counter may be zero. This patch addresses this issue by
      marking the one-time immediate ACK flag instead.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15bdd568