1. 03 Jul, 2014 17 commits
  2. 02 Jul, 2014 15 commits
  3. 01 Jul, 2014 8 commits
    • David S. Miller's avatar
      Merge branch 'bnx2x-next' · b6fd8b7f
      David S. Miller authored
      Yuval Mintz says:
      
      ====================
      bnx2x: Enhancement patch series
      
      This patch series introduces the ability to propagate link parameters
      to VFs as well as control the VF link via hypervisor.
      
      In addition, it contains 2 small improvements [one IOV-related and the
      other improves performance on machines with short cache lines].
      
      Please consider applying these patches to `net-next'.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6fd8b7f
    • Yuval Mintz's avatar
      bnx2x: Fail probe of VFs using an old incompatible driver · ebf457f9
      Yuval Mintz authored
      There are linux distributions where the inbox bnx2x driver contains SRIOV
      support but doesn't contain the changes introduced in b9871bcf
      "bnx2x: VF RSS support - PF side".
      
      A VF in a VM running that distribution over a new hypervisor will access
      incorrect addresses when trying to transmit packets, causing an attention
      in the hypervisor and making that VF inactive until FLRed.
      
      The driver in the VM has to ne upgraded [no real way to overcome this], but
      due to the HW attention currently arising upgrading the driver in the VM
      would not suffice [since the VF needs also be FLRed if the previous driver
      was already loaded].
      
      This patch causes the PF to fail the acquire message from a VF running an
      old problematic driver; The VF will then gracefully fail it's probe preventing
      the HW attention [and allow clean upgrade of driver in VM].
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebf457f9
    • Dmitry Kravkov's avatar
      bnx2x: enlarge minimal alignemnt of data offset · 9927b514
      Dmitry Kravkov authored
      This improves the performance of driver on machine with L1_CACHE_SHIFT of at
      most 32 bytes [HW was planned for 64-byte aligned fastpath data].
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9927b514
    • Dmitry Kravkov's avatar
      bnx2x: VF can report link speed · 6495d15a
      Dmitry Kravkov authored
      Until now VFs were oblvious to the actual configured link parameters.
      This patch does 2 things:
      
        1. It enables a PF to inform its VF using the bulletin board of the link
           configured, and allows the VF to present that information.
      
        2. It adds support of `ndo_set_vf_link_state', allowing the hypervisor
           to set the VF link state.
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6495d15a
    • David S. Miller's avatar
      Merge branch 'pktgen' · edd79ca8
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      Optimizing pktgen for single CPU performance
      
      This series focus on optimizing "pktgen" for single CPU performance.
      
      V2-series:
       - Removed some patches
       - Doc real reason for TX ring buffer filling up
      
      NIC tuning for pktgen:
       http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
      
      General overload setup according to:
       http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
      
      Hardware:
       System: CPU E5-2630
       NIC: Intel ixgbe/82599 chip
      
      Testing done with net-next git tree on top of
       commit 6623b419 ("Merge branch 'master' of...jkirsher/net-next")
      
      Pktgen script exercising race condition:
       https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh
      
      Tool for measuring LOCK overhead:
       https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edd79ca8
    • Jesper Dangaard Brouer's avatar
      pktgen: RCU-ify "if_list" to remove lock in next_to_run() · 8788370a
      Jesper Dangaard Brouer authored
      The if_lock()/if_unlock() in next_to_run() adds a significant
      overhead, because its called for every packet in busy loop of
      pktgen_thread_worker().  (Thomas Graf originally pointed me
      at this lock problem).
      
      Removing these two "LOCK" operations should in theory save us approx
      16ns (8ns x 2), as illustrated below we do save 16ns when removing
      the locks and introducing RCU protection.
      
      Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev   : 5684009 pps --> 175.93ns (1/5684009*10^9)
       * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
       * Diff   : +588195 pps --> -16.50ns
      
      To understand this RCU patch, I describe the pktgen thread model
      below.
      
      In pktgen there is several kernel threads, but there is only one CPU
      running each kernel thread.  Communication with the kernel threads are
      done through some thread control flags.  This allow the thread to
      change data structures at a know synchronization point, see main
      thread func pktgen_thread_worker().
      
      Userspace changes are communicated through proc-file writes.  There
      are three types of changes, general control changes "pgctrl"
      (func:pgctrl_write), thread changes "kpktgend_X"
      (func:pktgen_thread_write), and interface config changes "etcX@N"
      (func:pktgen_if_write).
      
      Userspace "pgctrl" and "thread" changes are synchronized via the mutex
      pktgen_thread_lock, thus only a single userspace instance can run.
      The mutex is taken while the packet generator is running, by pgctrl
      "start".  Thus e.g. "add_device" cannot be invoked when pktgen is
      running/started.
      
      All "pgctrl" and all "thread" changes, except thread "add_device",
      communicate via the thread control flags.  The main problem is the
      exception "add_device", that modifies threads "if_list" directly.
      
      Fortunately "add_device" cannot be invoked while pktgen is running.
      But there exists a race between "rem_device_all" and "add_device"
      (which normally don't occur, because "rem_device_all" waits 125ms
      before returning). Background'ing "rem_device_all" and running
      "add_device" immediately allow the race to occur.
      
      The race affects the threads (list of devices) "if_list".  The if_lock
      is used for protecting this "if_list".  Other readers are given
      lock-free access to the list under RCU read sections.
      
      Note, interface config changes (via proc) can occur while pktgen is
      running, which worries me a bit.  I'm assuming proc_remove() takes
      appropriate locks, to assure no writers exists after proc_remove()
      finish.
      
      I've been running a script exercising the race condition (leading me
      to fix the proc_remove order), without any issues.  The script also
      exercises concurrent proc writes, while the interface config is
      getting removed.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8788370a
    • Jesper Dangaard Brouer's avatar
      pktgen: avoid expensive set_current_state() call in loop · baac167b
      Jesper Dangaard Brouer authored
      Avoid calling set_current_state() inside the busy-loop in
      pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
      used/enabled in pktgen_xmit() via the spin() call.
      
      The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
      is LOCK prefixed.  I've measured the asm LOCK operation to take approx
      8ns on this E5-2630 CPU.  Performance increase corrolate with this
      measurement.
      
      Performance data with CLONE_SKB==100000, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev:  5454050 pps --> 183.35ns (1/5454050*10^9)
       * Now:   5684009 pps --> 175.93ns (1/5684009*10^9)
       * Diff:  +229959 pps -->  -7.42ns
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      baac167b
    • Jesper Dangaard Brouer's avatar
      pktgen: document tuning for max NIC performance · 9ceb87fc
      Jesper Dangaard Brouer authored
      Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
      running full.  Thus, the TX ring is artificially limiting pktgen.
      (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
      counters.)
      
      Using ixgbe, the real reason behind the TX ring running full, is due
      to TX ring not being cleaned up fast enough. The ixgbe driver combines
      TX+RX ring cleanups, and the cleanup interval is affected by the
      ethtool --coalesce setting of parameter "rx-usecs".
      
      Do not increase the default NIC TX ring buffer or default cleanup
      interval.  Instead simply document that pktgen needs special NIC
      tuning for maximum packet per sec performance.
      
      Performance results with pktgen with clone_skb=100000.
      TX ring size 512 (default), adjusting "rx-usecs":
       (Single CPU performance, E5-2630, ixgbe)
       - 3935002 pps - rx-usecs:  1 (irqs:  9346)
       - 5132350 pps - rx-usecs: 10 (irqs: 99157)
       - 5375111 pps - rx-usecs: 20 (irqs: 50154)
       - 5454050 pps - rx-usecs: 30 (irqs: 33872)
       - 5496320 pps - rx-usecs: 40 (irqs: 26197)
       - 5502510 pps - rx-usecs: 50 (irqs: 21527)
      
      TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
       - 3935002 pps - tx-size:  512
       - 5354401 pps - tx-size:  768
       - 5356847 pps - tx-size: 1024
       - 5327595 pps - tx-size: 1536
       - 5356779 pps - tx-size: 2048
       - 5353438 pps - tx-size: 4096
      
      Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
      devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
      the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
      more pressure on the TX ring buffers.
      
      It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
      pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ceb87fc