1. 20 Feb, 2016 15 commits
    • David S. Miller's avatar
      Merge branch 'bpf-get-stackid' · 80c804bf
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf_get_stackid() and stack_trace map
      
      This patch set introduces new map type to store stack traces and
      corresponding bpf_get_stackid() helper.
      BPF programs already can walk the stack via unrolled loop
      of bpf_probe_read()s which is ok for simple analysis, but it's
      not efficient and limited to <30 frames after that the programs
      don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper
      the programs can collect up to PERF_MAX_STACK_DEPTH both
      user and kernel frames.
      Using stack traces as a key in a map turned out to be very useful
      for generating flame graphs, off-cpu graphs, waker and chain graphs.
      Patch 3 is a simplified version of 'offwaketime' tool which is
      described in detail here:
      http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
      
      Earlier version of this patch were using save_stack_trace() helper,
      but 'unreliable' frames add to much noise and two equiavlent
      stack traces produce different 'stackid's.
      Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is
      great for lockdep, but not acceptable for bpf, since the stack_trace
      map needs to be freed when user Ctrl-C the tool.
      The ftrace style with per_cpu(struct ftrace_stack) is great, but it's
      tightly coupled with ftrace ring buffer and has the same 'unreliable'
      noise. perf_event's perf_callchain() mechanism is also very efficient
      and it only needed minor generalization which is done in patch 1
      to be used by bpf stack_trace maps.
      Peter, please take a look at patch 1.
      If you're ok with it, I'd like to take the whole set via net-next.
      
      Patch 1 - generalization of perf_callchain()
      Patch 2 - stack_trace map done as lock-less hashtable without link list
        to avoid spinlock on insertion which is critical path when
        bpf_get_stackid() helper is called for every task switch event
      Patch 3 - offwaketime example
      
      After the patch the 'perf report' for artificial 'sched_bench'
      benchmark that doing pthread_cond_wait/signal and 'offwaketime'
      example is running in the background:
       16.35%  swapper      [kernel.vmlinux]    [k] intel_idle
        2.18%  sched_bench  [kernel.vmlinux]    [k] __switch_to
        2.18%  sched_bench  libpthread-2.12.so  [.] pthread_cond_signal@@GLIBC_2.3.2
        1.72%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_unlock
        1.53%  sched_bench  [kernel.vmlinux]    [k] bpf_get_stackid
        1.44%  sched_bench  [kernel.vmlinux]    [k] entry_SYSCALL_64
        1.39%  sched_bench  [kernel.vmlinux]    [k] __call_rcu.constprop.73
        1.13%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_lock
        1.07%  sched_bench  libpthread-2.12.so  [.] pthread_cond_wait@@GLIBC_2.3.2
        1.07%  sched_bench  [kernel.vmlinux]    [k] hash_futex
        1.05%  sched_bench  [kernel.vmlinux]    [k] do_futex
        1.05%  sched_bench  [kernel.vmlinux]    [k] get_futex_key_refs.isra.13
      
      The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider
      using some faster hash in the future, but it's good enough for now.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80c804bf
    • Alexei Starovoitov's avatar
      samples/bpf: offwaketime example · a6ffe7b9
      Alexei Starovoitov authored
      This is simplified version of Brendan Gregg's offwaketime:
      This program shows kernel stack traces and task names that were blocked and
      "off-CPU", along with the stack traces and task names for the threads that woke
      them, and the total elapsed time from when they blocked to when they were woken
      up. The combined stacks, task names, and total time is summarized in kernel
      context for efficiency.
      
      Example:
      $ sudo ./offwaketime | flamegraph.pl > demo.svg
      Open demo.svg in the browser as FlameGraph visualization.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6ffe7b9
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov authored
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
    • Alexei Starovoitov's avatar
      perf: generalize perf_callchain · 568b329a
      Alexei Starovoitov authored
      . avoid walking the stack when there is no room left in the buffer
      . generalize get_perf_callchain() to be called from bpf helper
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      568b329a
    • Daniel Borkmann's avatar
      net: use skb_postpush_rcsum instead of own implementations · 6b83d28a
      Daniel Borkmann authored
      Replace individual implementations with the recently introduced
      skb_postpush_rcsum() helper.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b83d28a
    • Andrew Lunn's avatar
      phy: marvell/micrel: Fix Unpossible condition · 321b4d4b
      Andrew Lunn authored
      commit 2b2427d0 ("phy: micrel: Add ethtool statistics counters")
      from Dec 30, 2015, leads to the following static checker
      warning:
      
              drivers/net/phy/micrel.c:609 kszphy_get_stat()
              warn: unsigned 'val' is never less than zero.
      
      drivers/net/phy/micrel.c
         602  static u64 kszphy_get_stat(struct phy_device *phydev, int i)
         603  {
         604          struct kszphy_hw_stat stat = kszphy_hw_stats[i];
         605          struct kszphy_priv *priv = phydev->priv;
         606          u64 val;
         607
         608          val = phy_read(phydev, stat.reg);
         609          if (val < 0) {
                          ^^^^^^^
      Unpossible!
      
         610                  val = UINT64_MAX;
         611          } else {
         612                  val = val & ((1 << stat.bits) - 1);
         613                  priv->stats[i] += val;
         614                  val = priv->stats[i];
         615          }
         616
         617          return val;
         618  }
      
      The same problem exists in the Marvell driver. Fix both.
      
      Fixes: 2b2427d0 ("phy: micrel: Add ethtool statistics counters")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reported-by: default avatarJulia.Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      321b4d4b
    • David S. Miller's avatar
      Merge branch 'ethtool-perqueue-params' · 2f860177
      David S. Miller authored
      Kan Liang says:
      
      ====================
      ethtool per queue parameters support
      
      Modern network interface controllers usually support multiple receive
      and transmit queues. Each queue may have its own parameters. For
      example, Intel XL710/X710 hardware supports per queue interrupt
      moderation. However, current ethtool does not support per queue
      parameters option. User has to set parameters for the whole NIC.
      This series extends ethtool to support per queue parameters option.
      
      Since the support of per queue parameters vary with different cards,
      it is impossible to address all cards in one patch. This series only
      supports per queue coalesce options on i40e driver. The framework used
      in the patch can be easily extended to other cards and parameters.
      
      The lib bitmap needs to be extended to facilitate exchanging queue bitmaps
      between user space and kernel space. Two patches from David's latest V8
      patch series are also cited in this series. You may refer to
      https://lkml.org/lkml/2016/2/9/919 for more details.
      
      Changes since V6:
       - Rebase on commit 76d13b56. Did minor change in patch 6.
      
      Changes since V5:
       - Add test_bitmap.c and bitmap.sh in the series. They are forgot
         to be added previously.
       - Update the first two patches to David's latest V8 version. The changes
         include
            - bitmap u32 API returns number of bits copied, unit tests updated
            - module_exit in test_bitmap
       - Also change the mode of bitmap.sh to 755 according to Ben's suggestion
      
      Changes since V4:
       - Modify set/get_per_queue_coalesce function description
       - Change the queue number to be u32
       - Correct an error of calculating coalesce backup buffer address
       - Rename queue_num to n_queues
       - Don't log error message in __i40e_get_coalesce
      
      Changes since V3:
       - Based on David's lib bitmap.
       - ETHTOOL_PERQUEUE should be handled before the containing switch
       - Make the rollback code unconditional
       - some minor changes according to Ben's feedback
      
      Changes since V2:
       - Add queue-specific settings for interrupt moderation in i40e
      
      Changes since V1:
       - Checking the sub-command number to determine whether the command
         requires CAP_NET_ADMIN
       - Refine the struct ethtool_per_queue_op and improve the comments
       - Use bitmap functions to parse queue mask
       - Improve comments
       - Use bitmap functions to parse queue mask
       - Improve comments
       - Add rollback support
       - Correct the way to find the vector for specific queue.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f860177
    • Kan Liang's avatar
      i40e/ethtool: support coalesce setting by queue · f3757a4d
      Kan Liang authored
      This patch implements set_per_queue_coalesce for i40e driver.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Acked-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3757a4d
    • Kan Liang's avatar
      i40e/ethtool: support coalesce getting by queue · be280bad
      Kan Liang authored
      This patch implements get_per_queue_coalesce for i40e driver.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Acked-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be280bad
    • Kan Liang's avatar
      i40e: queue-specific settings for interrupt moderation · a75e8005
      Kan Liang authored
      For i40e driver, each vector has its own ITR register. However, there
      are no concept of queue-specific settings in the driver proper. Only
      global variable is used to store ITR values. That will cause problems
      especially when resetting the vector. The specific ITR values could be
      lost.
      This patch move rx_itr_setting and tx_itr_setting to i40e_ring to store
      specific ITR register for each queue.
      i40e_get_coalesce and i40e_set_coalesce are also modified accordingly to
      support queue-specific settings. To make it compatible with old ethtool,
      if user doesn't specify the queue number, i40e_get_coalesce will return
      queue 0's value. While i40e_set_coalesce will apply value to all queues.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Acked-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Acked-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a75e8005
    • Kan Liang's avatar
      net/ethtool: support set coalesce per queue · f38d138a
      Kan Liang authored
      This patch implements sub command ETHTOOL_SCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
      set coalesce of each masked queue to device driver. The wanted coalesce
      information are stored in "data" for each masked queue, which can copy
      from userspace.
      If it fails to set coalesce to device driver, the value which already
      set to specific queue will be tried to rollback.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Reviewed-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f38d138a
    • Kan Liang's avatar
      net/ethtool: support get coalesce per queue · 421797b1
      Kan Liang authored
      This patch implements sub command ETHTOOL_GCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
      get coalesce of each masked queue from device driver. Then the interrupt
      coalescing parameters will be copied back to user space one by one.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Reviewed-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      421797b1
    • Kan Liang's avatar
      net/ethtool: introduce a new ioctl for per queue setting · ac2c7ad0
      Kan Liang authored
      Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
      The following patches will enable some SUB_COMMANDs for per queue
      setting.
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Reviewed-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac2c7ad0
    • David Decotigny's avatar
      test_bitmap: unit tests for lib/bitmap.c · 5fd003f5
      David Decotigny authored
      This is mainly testing bitmap construction and conversion to/from u32[]
      for now.
      
      Tested:
        qemu i386, x86_64, ppc, ppc64 BE and LE, ARM.
      Signed-off-by: default avatarDavid Decotigny <decot@googlers.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd003f5
    • David Decotigny's avatar
      lib/bitmap.c: conversion routines to/from u32 array · e52bc7c2
      David Decotigny authored
      Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
      way.
      
      Tested:
        unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
        ARM.
      Signed-off-by: default avatarDavid Decotigny <decot@googlers.com>
      Reviewed-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e52bc7c2
  2. 19 Feb, 2016 25 commits