1. 31 Oct, 2019 34 commits
    • Jiri Pirko's avatar
      mlxsw: spectrum: Use port_module_max_width to compute base port index · 013da297
      Jiri Pirko authored
      Instead of using constant value, use port_module_max_width which is
      aligned with the cluster size.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      013da297
    • Jiri Pirko's avatar
      mlxsw: spectrum: Remember split base local port and use it in unsplit · 49185277
      Jiri Pirko authored
      Don't compute the original base local port during unsplit, rather
      remember it in mlxsw_sp_port structure during split port creation.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49185277
    • Jiri Pirko's avatar
      mlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port · 038784a9
      Jiri Pirko authored
      In Spectrum-3 the modules have 8 lanes, so split by count 2 results in
      two split ports each of 4 lanes. Add a resource that can be used to
      obtain local port offset in that case.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      038784a9
    • Jiri Pirko's avatar
      mlxsw: spectrum: Push getting offsets of split ports into a helper · d0846ce9
      Jiri Pirko authored
      Get local port offsets of split port in a separate helper function and
      use it in both split and unsplit function.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0846ce9
    • Jiri Pirko's avatar
      mlxsw: spectrum: Add sanity checks into module info get · c8fc10dc
      Jiri Pirko authored
      Driver assumes certain values in the PMLP register. Add checks that
      verify that PMLP register provides fitting values.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8fc10dc
    • Jiri Pirko's avatar
      mlxsw: spectrum: Pass mapping values in port mapping structure · 35896d96
      Jiri Pirko authored
      Pass the port mapping structure down to create, module_map and other
      function instead of individual values.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35896d96
    • Jiri Pirko's avatar
      mlxsw: spectrum: Use mapping of port being split for creating split ports · 7b39fa5b
      Jiri Pirko authored
      Don't use constant max width value and instead of that, use the actual
      width of the port. Also don't pass module value and use the value
      stored in the same structure.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b39fa5b
    • Jiri Pirko's avatar
      mlxsw: spectrum: Replace port_to_module array with array of structs · 4a7f970f
      Jiri Pirko authored
      Store the initial PMLP register configuration into array of structures
      instead of just simple array of module numbers.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a7f970f
    • Jiri Pirko's avatar
      mlxsw: spectrum: Distinguish between unsplittable and split port · 26a6befa
      Jiri Pirko authored
      Currently when user does split, he is not able to distinguish if the
      port cannot be split because it is already split, or because it cannot
      be split at all. Add another check for split flag to distinguish this.
      Also add check forbidding split when maximal width is 1.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26a6befa
    • Jiri Pirko's avatar
      mlxsw: spectrum: Move max_width check up before count check · 2e6a2d7b
      Jiri Pirko authored
      The fact that the port cannot be split further should be checked before
      checking the count, so move it.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e6a2d7b
    • Jiri Pirko's avatar
      mlxsw: spectrum: Use PMTM register to get max module width · 25911e1b
      Jiri Pirko authored
      Currently the max module width is hard-coded according to ASIC type.
      That is not entirely correct, as the max module width might differ
      per-board. Use PMTM register to query FW for maximal width of a module.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25911e1b
    • Jiri Pirko's avatar
      mlxsw: reg: Add Port Module Type Mapping Register · a513b1a5
      Jiri Pirko authored
      The PMTM allows query or configuration of module types.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a513b1a5
    • Jiri Pirko's avatar
      mlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits · 94e76837
      Jiri Pirko authored
      The tx/rx lane fields got extended to 4 bits, update the reg field
      description accordingly.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94e76837
    • Christophe JAILLET's avatar
      cxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()' · d74361dc
      Christophe JAILLET authored
      Use '__skb_queue_purge()' instead of re-implementing it.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d74361dc
    • David S. Miller's avatar
      Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag' · d86784fe
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Control action percpu counters allocation by netlink flag
      
      Currently, significant fraction of CPU time during TC filter allocation
      is spent in percpu allocator. Moreover, percpu allocator is protected
      with single global mutex which negates any potential to improve its
      performance by means of recent developments in TC filter update API that
      removed rtnl lock for some Qdiscs and classifiers. In order to
      significantly improve filter update rate and reduce memory usage we
      would like to allow users to skip percpu counters allocation for
      specific action if they don't expect high traffic rate hitting the
      action, which is a reasonable expectation for hardware-offloaded setup.
      In that case any potential gains to software fast-path performance
      gained by usage of percpu-allocated counters compared to regular integer
      counters protected by spinlock are not important, but amount of
      additional CPU and memory consumed by them is significant.
      
      In order to allow configuring action counters allocation type at
      runtime, implement following changes:
      
      - Implement helper functions to update the action counters and use them
        in affected actions instead of updating counters directly. This steps
        abstracts actions implementation from counter types that are being
        used for particular action instance at runtime.
      
      - Modify the new helpers to use percpu counters if they were allocated
        during action initialization and use regular counters otherwise.
      
      - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add
        TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update
        hardware-offloaded actions to not allocate percpu counters when the
        flag is set.
      
      With this changes users that prefer action update slow-path speed over
      software fast-path speed can dynamically request actions to skip percpu
      counters allocation without affecting other users.
      
      Now, lets look at actual performance gains provided by this change.
      Simple test is used to measure insertion rate - iproute2 TC is executed
      in parallel by xargs in batch mode, its total execution time is measured
      by shell builtin "time" command. The command runs 20 concurrent tc
      instances, each with its own batch file with 100k rules:
      
      $ time ls add* | xargs -n 1 -P 20 sudo tc -b
      
      Two main rule profiles are tested. First is simple L2 flower classifier
      with single gact drop action. The configuration is chosen as worst case
      scenario because with single-action rules pressure on percpu allocator
      is minimized. Example rule:
      
      filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
          src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop
      
      Second profile is typical real-world scenario that uses flower
      classifier with some L2-4 fields and two actions (tunnel_key+mirred).
      Example rule:
      
      filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
          skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
          192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
          1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
          4789 action mirred egress redirect dev vxlan1
      
       Profile           |        percpu |     no_percpu | X improvement
                         | (k rules/sec) | (k rules/sec) |
      -------------------+---------------+---------------+---------------
       Gact drop         |           203 |           259 |          1.28
       tunnel_key+mirred |            92 |           204 |          2.22
      
      For simple drop action removing percpu allocation leads to ~25%
      insertion rate improvement. Perf profiles highlights the bottlenecks.
      
      Perf profile of run with percpu allocation (gact drop):
      
      + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
      + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
      + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
      + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
      + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
      + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
      + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
      + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
      + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
      + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
      + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
      + 75.73% 0.65% tc [cls_flower] [k] fl_change
      + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
      + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
      + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
      + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
      + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
      + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
      - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
        - 49.05% pcpu_alloc
          + 39.35% __mutex_lock.isra.0 4.99% memset_erms
          + 2.16% pcpu_alloc_area
        + 2.17% __libc_sendmsg
      + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
      + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
      + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
      + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
      + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
      + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
      + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node
      
      Here bottleneck is clearly in pcpu_alloc() function that takes more than
      half CPU time, which is mostly wasted busy-waiting for internal percpu
      allocator global lock.
      
      With percpu allocation removed (gact drop):
      
      + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
      + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
      + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
      + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
      + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
      + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
      + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
      + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
      + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
      + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
      + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
      + 75.13% 0.84% tc [cls_flower] [k] fl_change
      + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
      + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
      + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
      - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
        - 68.70% tcf_gact_init
          + 36.08% tcf_idr_check_alloc
          + 31.88% tcf_idr_insert
      + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
      + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
      + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
      + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
      
      The gact actions (like all other actions types) are inserted in single
      idr instance protected by global (per namespace) lock that becomes new
      bottleneck with such simple rule profile and prevents achieving 2x+
      performance increase that can be expected by looking at profiling data
      for insertion action with percpu counter.
      
      Perf profile of run with percpu allocation (tunnel_key+mirred):
      
      + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
      + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
      + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
      + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
      + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
      + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
      + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
      + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
      + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
      + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
      + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
      + 83.41% 0.33% tc [cls_flower] [k] fl_change
      + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
      + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
      + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
      - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
        - 71.42% pcpu_alloc
          + 61.41% __mutex_lock.isra.0 5.02% memset_erms
          + 2.93% pcpu_alloc_area
        + 2.16% __libc_sendmsg
      + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
      + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
      + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
      + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
      + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
      + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
      + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms
      
      With two times more actions pressure on percpu allocator doubles, so now
      it takes ~74% of CPU execution time.
      
      With percpu allocation removed (tunnel_key+mirred):
      
      + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
      + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
      + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
      + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
      + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
      + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
      + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
      + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
      + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
      + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
      + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
      + 73.99% 0.63% tc [cls_flower] [k] fl_change
      + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
      + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
      + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
      + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
      + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
      - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
        - 35.81% tunnel_key_init
          + 15.95% tcf_idr_check_alloc
          + 13.91% tcf_idr_insert
          - 4.70% dst_cache_init
            + 4.68% pcpu_alloc
      + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
      + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
      + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
      + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
      + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
      + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
      + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify
      
      With percpu allocation removed insertion rate is increased by ~120%.
      Such rule profile scales much better than simple single action because
      both types of actions were competing for single lock in percpu
      allocator, but not for action idr lock, which is per-action. Note that
      percpu allocator is still used by dst_cache in tunnel_key actions and
      consumes 4.68% CPU time. Dst_cache seems like good opportunity for
      further insertion rate optimization but is not addressed by this change.
      
      Another improvement provided by this change is significantly reduced
      memory usage. The test is implemented by sampling "used memory" value
      from "vmstat -s" command output. Following table includes memory usage
      measurements for same two configurations that were used for measuring
      insertion rate:
      
       Profile           | Mem per rule | Mem per rule no_percpu | Less memory used
                         |         (KB) |                   (KB) |             (KB)
      -------------------+--------------+------------------------+------------------
       Gact drop         |         3.91 |                   2.51 |              1.4
       tunnel_key+mirred |         6.73 |                   3.91 |              2.8
      
      Results indicate that memory usage of percpu allocator per action is
      ~1.4 KB. Note that any measurements of percpu allocator memory usage is
      inherently tied to particular setup since memory usage is linear to
      number of cores in system. It is to be expected that on current top of
      the line servers percpu allocator memory usage will be 2-5x more than on
      24 CPUs setup that was used for testing.
      
      Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory
      
      Patches applied on top of net-next branch:
      
      commit 2203cbf2 (net-next) Author:
      Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
      +0100
      
      net: sfp: move fwnode parsing into sfp-bus layer
      
      Changes V1 -> V2:
      
      - Include memory measurements.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d86784fe
    • Vlad Buslov's avatar
      tc-testing: implement tests for new fast_init action flag · 9ae6b787
      Vlad Buslov authored
      Add basic tests to verify action creation with new fast_init flag for all
      actions that support the flag.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ae6b787
    • Vlad Buslov's avatar
      net: sched: update action implementations to support flags · e3822678
      Vlad Buslov authored
      Extend struct tc_action with new "tcfa_flags" field. Set the field in
      tcf_idr_create() function and provide new helper
      tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
      value. Update individual hardware-offloaded actions init() to pass their
      "flags" argument to new helper in order to skip percpu stats allocation
      when user requested it through flags.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3822678
    • Vlad Buslov's avatar
      net: sched: extend TCA_ACT space with TCA_ACT_FLAGS · abbb0d33
      Vlad Buslov authored
      Extend TCA_ACT space with nla_bitfield32 flags. Add
      TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
      tcf_action_init_1() and pass resulting value as additional argument to
      a_o->init().
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abbb0d33
    • Vlad Buslov's avatar
      net: sched: modify stats helper functions to support regular stats · 5e174d5e
      Vlad Buslov authored
      Modify stats update helper functions introduced in previous patches in this
      series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
      not allocated for the action argument. If regular non-percpu allocated
      counters are in use, then obtain action tcfa_lock while modifying them.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e174d5e
    • Vlad Buslov's avatar
      net: sched: don't expose action qstats to skb_tc_reinsert() · ef816f3c
      Vlad Buslov authored
      Previous commit introduced helper function for updating qstats and
      refactored set of actions to use the helpers, instead of modifying qstats
      directly. However, one of the affected action exposes its qstats to
      skb_tc_reinsert(), which then modifies it.
      
      Refactor skb_tc_reinsert() to return integer error code and don't increment
      overlimit qstats in case of error, and use the returned error code in
      tcf_mirred_act() to manually increment the overlimit counter with new
      helper function.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef816f3c
    • Vlad Buslov's avatar
      net: sched: extract qstats update code into functions · 26b537a8
      Vlad Buslov authored
      Extract common code that increments cpu_qstats counters into standalone act
      API functions. Change hardware offloaded actions that use percpu counter
      allocation to use the new functions instead of accessing cpu_qstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26b537a8
    • Vlad Buslov's avatar
      net: sched: extract bstats update code into function · 5e1ad95b
      Vlad Buslov authored
      Extract common code that increments cpu_bstats counter into standalone act
      API function. Change hardware offloaded actions that use percpu counter
      allocation to use the new function instead of incrementing cpu_bstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e1ad95b
    • Vlad Buslov's avatar
      net: sched: extract common action counters update code into function · c8ecebd0
      Vlad Buslov authored
      Currently, all implementations of tc_action_ops->stats_update() callback
      have almost exactly the same implementation of counters update
      code (besides gact which also updates drop counter). In order to simplify
      support for using both percpu-allocated and regular action counters
      depending on run-time flag in following patches, extract action counters
      update code into standalone function in act API.
      
      This commit doesn't change functionality.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8ecebd0
    • Christophe JAILLET's avatar
      net: qrtr: Simplify 'qrtr_tun_release()' · 21d8bd12
      Christophe JAILLET authored
      Use 'skb_queue_purge()' instead of re-implementing it.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21d8bd12
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · dba7bf03
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      1GbE Intel Wired LAN Driver Updates 2019-10-29
      
      This series contains updates to e1000e, igb, ixgbe and i40e drivers.
      
      Sasha adds support for Intel client platforms Comet Lake and Tiger Lake
      to the e1000e driver.  Also adds a fix for a compiler warning that was
      recently introduced, when CONFIG_PM_SLEEP is not defined, so wrap the
      code that requires this kernel configuration to be defined.
      
      Alex fixes a potential race condition between network configuration and
      power management for e1000e, which is similar to a past issue in the igb
      driver.  Also provided a bit of code cleanup since the driver no longer
      checks for __E1000_DOWN.
      
      Josh Hunt adds UDP segmentation offload support for igb, ixgbe and i40e.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dba7bf03
    • zhong jiang's avatar
      wimax: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops · 84e93d99
      zhong jiang authored
      It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
      operation rather than DEFINE_SIMPLE_ATTRIBUTE.
      
      It is detected with the help of coccinelle.
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84e93d99
    • Heiner Kallweit's avatar
      net: dsa: add ethtool pause configuration support · a2a1a13b
      Heiner Kallweit authored
      This patch adds glue logic to make pause settings per port
      configurable vie ethtool.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2a1a13b
    • Guillaume Nault's avatar
      vxlan: drop "vxlan" parameter in vxlan_fdb_alloc() · 1d7a5526
      Guillaume Nault authored
      This parameter has never been used.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d7a5526
    • Heiner Kallweit's avatar
      net: phy: marvell: add downshift support for 88E1145 · a319fb52
      Heiner Kallweit authored
      Add downshift support for 88E1145, it uses the same downshift
      configuration registers as 88E1111.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a319fb52
    • David S. Miller's avatar
      Merge branch 'ICMP-flow-improvements' · 29f52875
      David S. Miller authored
      Matteo Croce says:
      
      ====================
      ICMP flow improvements
      
      This series improves the flow inspector handling of ICMP packets:
      The first two patches just add some comments in the code which would have saved
      me a few minutes of time, and refactor a piece of code.
      The third one adds to the flow inspector the capability to extract the
      Identifier field, if present, so echo requests and replies are classified
      as part of the same flow.
      The fourth patch uses the function introduced earlier to the bonding driver,
      so echo replies can be balanced across bonding slaves.
      
      v1 -> v2:
       - remove unused struct members
       - add an helper to check for the Id field
       - use a local flow_dissector_key in the bonding to avoid
         changing behaviour of the flow dissector
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29f52875
    • Matteo Croce's avatar
      bonding: balance ICMP echoes in layer3+4 mode · 58deb77c
      Matteo Croce authored
      The bonding uses the L4 ports to balance flows between slaves. As the ICMP
      protocol has no ports, those packets are sent all to the same device:
      
          # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
          # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
          # ping -qc1 192.168.0.2
          1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
          1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
          # ping -qc1 192.168.0.2
          1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
          1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
          # ping -qc1 192.168.0.2
          1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
          1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64
      
      But some ICMP packets have an Identifier field which is
      used to match packets within sessions, let's use this value in the hash
      function to balance these packets between bond slaves:
      
          # ping -qc1 192.168.0.2
          0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
          0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
          # ping -qc1 192.168.0.2
          1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
          1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64
      
      Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
      so we can balance pings encapsulated in a tunnel when using mode encap3+4:
      
          # ping -q 192.168.1.2 -c1
          0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
          0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
          # ping -q 192.168.1.2 -c1
          1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
          1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58deb77c
    • Matteo Croce's avatar
      flow_dissector: extract more ICMP information · 5dec597e
      Matteo Croce authored
      The ICMP flow dissector currently parses only the Type and Code fields.
      Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
      is used to correlate packets.
      Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
      with a more complex function which populate this field.
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5dec597e
    • Matteo Croce's avatar
      flow_dissector: skip the ICMP dissector for non ICMP packets · 3b336d6f
      Matteo Croce authored
      FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones.
      Even if the test overhead is probably negligible, move the
      ICMP dissector code under the big 'switch(ip_proto)' so it gets called
      only for ICMP packets.
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b336d6f
    • Matteo Croce's avatar
      flow_dissector: add meaningful comments · 98298e6c
      Matteo Croce authored
      Documents two piece of code which can't be understood at a glance.
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98298e6c
  2. 30 Oct, 2019 6 commits
    • Roman Mashak's avatar
      tc-testing: fixed two failing pedit tests · c4917bfc
      Roman Mashak authored
      Two pedit tests were failing due to incorrect operation
      value in matchPattern, should be 'add' not 'val', so fix it.
      Signed-off-by: default avatarRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4917bfc
    • Jon Maloy's avatar
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy authored
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: default avatarYing Xue <ying.xue@windreiver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0bceb97
    • David S. Miller's avatar
      Merge branch 'mlxsw-Update-firmware-version' · 6c814e8c
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Update firmware version
      
      This patch set updates the firmware version for Spectrum-1 and enforces
      a firmware version for Spectrum-2.
      
      The version adds support for querying port module type. It will be used
      by a followup patch set from Jiri to make port split code more generic.
      
      Patch #1 increases the size of an existing register in order to be
      compatible with the new firmware version. In the future the firmware
      will assign default values to fields not specified by the driver.
      
      Patch #2 temporarily increases the PCI reset timeout for SN3800 systems.
      Note that in normal cases the driver will need to wait no longer than 5
      seconds for the device to become ready following reset command.
      
      Patch #3 bumps the firmware version for Spectrum-1.
      
      Patch #4 enforces a minimum firmware version for Spectrum-2.
      
      v2:
      * Added patch #2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c814e8c
    • Ido Schimmel's avatar
      mlxsw: Enforce firmware version for Spectrum-2 · a72afb68
      Ido Schimmel authored
      In a similar fashion to Spectrum-1, enforce a specific firmware version
      for Spectrum-2 so that the driver and firmware are always in sync with
      regards to new features.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a72afb68
    • Ido Schimmel's avatar
      mlxsw: Bump firmware version to 13.2000.2308 · 5fd2ef46
      Ido Schimmel authored
      The version adds support for querying port module type. It will be used
      by a followup patch set from Jiri to make port split code more generic.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd2ef46
    • Ido Schimmel's avatar
      mlxsw: pci: Increase PCI reset timeout for SN3800 systems · ff298839
      Ido Schimmel authored
      SN3800 Spectrum-2 based systems have gearboxes that need to be
      initialized by the firmware during its initialization flow. In certain
      cases, the firmware might need to flash these gearboxes, which is
      currently a time-consuming process.
      
      In newer firmware versions, the firmware will not signal to the driver
      that it is ready until the gearboxes are flashed. Increase the PCI reset
      timeout for these situations. In normal cases, the driver will need to
      wait no longer than 5 seconds.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff298839