Commits · 0162a58302ade67af69acc59342c3fcc673ecae1 · nexedi / linux

22 Feb, 2016 26 commits

David S. Miller authored Feb 21, 2016

Yuval Mintz says:

====================
qed*: Driver updates

This contains various minor changes to driver - changing memory allocation,
fixing a small theoretical bug, as well as some mostly-semantic changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0162a583

qed,qede: Bump driver versions to 8.7.0.0 · d4ee5289

Yuval Mintz authored Feb 21, 2016

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4ee5289

qed: Introduce DMA_REGPAIR_LE · 94494598

Yuval Mintz authored Feb 21, 2016

FW hsi contains regpairs, mostly for 64-bit address representations.
Since same paradigm is applied each time a regpair is filled, this
introduces a new utility macro for setting such regpairs.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94494598

qed: Change metadata needed for SPQ entries · 06f56b81

Yuval Mintz authored Feb 21, 2016

Each configuration element send via ramrod requires a Slow Path Queue
entry. This slightly changes the way such an entry is configured, but
contains mostly semantic changes [where more parameters are gathered
in a sub-struct instead of being directly passed].
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06f56b81

qed: Handle possible race in SB config · 0a0c5d3b

Yuval Mintz authored Feb 21, 2016

Due to HW design, some of the memories are wide-bus and access to those
needs to be sequentialized on a per-HW-block level; Read/write to a
given HW-block might break other read/write to wide-bus memory done at
~same time.

Status blocks initialization in CAU is done into such a wide-bus memory.
This moves the initialization into using DMAE which is guaranteed to be
safe to use on such memories.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a0c5d3b

qed: Turn most GFP_ATOMIC into GFP_KERNEL · 60fffb3b

Yuval Mintz authored Feb 21, 2016

Initial driver submission used GFP_ATOMIC almost inclusively when
allocating memory. We now remedy this point, using GFP_KERNEL where
it's possible.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60fffb3b

Merge branch 'ipvlan-misc' · ea5b2f44

David S. Miller authored Feb 21, 2016

Mahesh Bandewar says:

====================
IPvlan misc patches

This is a collection of unrelated patches for IPvlan driver.
a. crub_skb() changes are added to ensure that the packets hit the
NF_HOOKS in masters' ns in L3 mode.
b. u16 change is bug fix while
c. the third patch is to group tx/rx variables in single cacheline
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ea5b2f44

ipvlan: misc changes · ab5b7013

Mahesh Bandewar authored Feb 20, 2016

1. scope correction for few functions that are used in single file.
2. Adjust variables that are used in fast-path to fit into single cacheline
3. Update rcv_frame() to skip shared check for frames coming over wire
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab5b7013

ipvlan: mode is u16 · e93fbc5a

Mahesh Bandewar authored Feb 20, 2016

The mode argument was erronusly defined as u32 but it has always
been u16. Also use ipvlan_set_mode() helper to set the mode instead
of assigning directly. This should avoid future erronus assignments /
updates.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e93fbc5a

ipvlan: scrub skb before routing in L3 mode. · c3aaa06d

Mahesh Bandewar authored Feb 20, 2016

Scrub skb before hitting the iptable hooks to ensure packets hit
these hooks. Set the xnet param only when the packet is crossing the
ns boundry so if the IPvlan slave and master belong to the same ns,
the param will be set to false.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3aaa06d

Merge tag 'linux-can-next-for-4.6-20160220' of... · 86310cc4

David S. Miller authored Feb 21, 2016

Merge tag 'linux-can-next-for-4.6-20160220' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2016-02-20

this is a pull request of 9 patch for net-next/master.

The first 3 patches are from Damien Riegel, they add support for
Technologic Systems IP core to tje sja100 driver. The next patches 6 by
Marek Vasut (including one my me) first clean sort the CAN driver's
Kconfig and Makefiles and then add support for the IFI CANFD IP core.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

86310cc4

Merge branch 'bpf-helper-improvements' · 9c572dc4

David S. Miller authored Feb 21, 2016

Daniel Borkmann says:

====================
BPF updates

This set contains various updates for eBPF, i.e. the addition of a
generic csum helper function and other misc bits that mostly improve
existing helpers and ease programming with eBPF on cls_bpf. For more
details, please see individual patches.

Set is rebased on top of http://patchwork.ozlabs.org/patch/584465/.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9c572dc4

bpf: don't emit mov A,A on return · 6205b9cf

Daniel Borkmann authored Feb 19, 2016

While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax',
and found that this comes from BPF_RET | BPF_A translations from classic
BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0
already, therefore only emit a mov when immediates are used as return value.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6205b9cf

bpf: fix csum update in bpf_l4_csum_replace helper for udp · 2f72959a

Daniel Borkmann authored Feb 19, 2016

When using this helper for updating UDP checksums, we need to extend
this in order to write CSUM_MANGLED_0 for csum computations that result
into 0 as sum. Reason we need this is because packets with a checksum
could otherwise become incorrectly marked as a packet without a checksum.
Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
not turn packets without a checksum into ones with a checksum.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f72959a

bpf: try harder on clones when writing into skb · 3697649f

Daniel Borkmann authored Feb 19, 2016

When we're dealing with clones and the area is not writeable, try
harder and get a copy via pskb_expand_head(). Replace also other
occurences in tc actions with the new skb_try_make_writable().
Reported-by: Ashhad Sheikh <ashhadsheikh394@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3697649f

bpf: remove artificial bpf_skb_{load, store}_bytes buffer limitation · 21cafc1d

Daniel Borkmann authored Feb 19, 2016

We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes()
helpers to only store or load a maximum buffer of 16 bytes. Thus,
loading, rewriting and storing headers require several bpf_skb_load_bytes()
and bpf_skb_store_bytes() calls.

Also here we can use a per-cpu scratch buffer instead in order to not
pressure stack space any further. I do suspect that this limit was mainly
set in place for this particular reason. So, ease program development
by removing this limitation and make the scratchpad generic, so it can
be reused.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

21cafc1d

bpf: add generic bpf_csum_diff helper · 7d672345

Daniel Borkmann authored Feb 19, 2016

For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
currently limited to handle 2 and 4 byte changes in a header and feeds the
from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
working with IPv6, for example, this makes it rather cumbersome to deal
with, similarly when editing larger parts of a header.

Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
add a case for header field mask of 0 to change the checksum at a given
offset through inet_proto_csum_replace_by_diff(), and provide a helper
bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
amounts of data.

This can be used in multiple ways: for the bpf_l4_csum_replace() only
part, this even provides us with the option to insert precalculated diffs
from user space f.e. from a map, or from bpf_csum_diff() during runtime.

bpf_csum_diff() has a optional from/to stack buffer input, so we can
calculate a diff by using a scratchbuffer for scenarios where we're
inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
don't need to be of equal size) data. Also, bpf_csum_diff() allows to
feed a previous csum into csum_partial(), so the function can also be
cascaded.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d672345

bpf: add new arg_type that allows for 0 sized stack buffer · 8e2fe1d9

Daniel Borkmann authored Feb 19, 2016

Currently, when we pass a buffer from the eBPF stack into a helper
function, the function proto indicates argument types as ARG_PTR_TO_STACK
and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
must be of the latter type. Then, verifier checks whether the buffer
points into eBPF stack, is initialized, etc. The verifier also guarantees
that the constant value passed in R<X+1> is greater than 0, so helper
functions don't need to test for it and can always assume a non-NULL
initialized buffer as well as non-0 buffer size.

This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
Such helper functions, of course, need to be able to handle these cases
internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
other combinations are not possible to load.

I went through various options of extending the verifier, and introducing
the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
needed to the verifier.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8e2fe1d9

Merge branch 'geneve-vxlan-outer-checksum' · 8b393f83

David S. Miller authored Feb 21, 2016

Alexander Duyck says:

====================
GENEVE/VXLAN: Enable outer Tx checksum by default

This patch series makes it so that we enable the outer Tx checksum for IPv4
tunnels by default.  This makes the behavior consistent with how we were
handling this for IPv6.  In addition I have updated the internal flags for
these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
match up will with the ZERO_CSUM6_TX flag which was already in use for
IPv6.

For most network devices this should be a net gain in terms of performance
as having the outer header checksum present allows for devices to report
CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
to determine if the inner header checksum is valid.

Below is some data I collected with ixgbe with an X540 that demonstrates
this.  I located two PFs connected back to back in two different name
spaces and then setup a pair of tunnels on each, one with checksum enabled
and one without.

Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send
Size   Size    Size     Time     Throughput  local
bytes  bytes   bytes    secs.    10^6bits/s  % S

noudpcsum:
 87380  16384  16384    30.00      8898.67   12.80
udpcsum:
 87380  16384  16384    30.00      9088.47   5.69

The one spot where this may cause a performance regression is if the
environment contains devices that can parse the inner headers and a device
supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
the case of such a device we have to fall back to using GSO to segment the
tunnel instead of TSO and as a result we may take a performance hit as seen
below with i40e.

Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send
Size   Size    Size     Time     Throughput  local
bytes  bytes   bytes    secs.    10^6bits/s  % S

noudpcsum:
 87380  16384  16384    30.00      9085.21   3.32
udpcsum:
 87380  16384  16384    30.00      9089.23   5.54

In addition it will be necessary to update iproute2 so that we don't
provide the checksum attribute unless specified.  This way on older kernels
which don't have local checksum offload we will default to disabling the
outer checksum, and on newer kernels that have LCO we can default to
enabling it.

I also haven't investigated the effect this will have on OVS.  However I
suspect the impact should be minimal as the worst case scenario should be
that Tx checksumming will become enabled by default which should be
consistent with the existing behavior for IPv6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8b393f83

VXLAN: Support outer IPv4 Tx checksums by default · 6ceb31ca

Alexander Duyck authored Feb 19, 2016

This change makes it so that if UDP CSUM is not specified we will default
to enabling it. The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for VXLAN
tunnels on devices that don't know how to parse tunnel headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6ceb31ca

GENEVE: Support outer IPv4 Tx checksums by default · 14f1f724

Alexander Duyck authored Feb 19, 2016

This change makes it so that if UDP CSUM is not specified we will default
to enabling it.  The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for GENEVE
tunnels on hardware that doesn't know how to parse them.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14f1f724

Merge branch 'lwt-autoload' · 417b7ca4

David S. Miller authored Feb 21, 2016

Robert Shearman says:

====================
lwtunnel: autoload of lwt modules

Changes since v1:
 - remove "LWTUNNEL_ENCAP_" prefix for the string form of the encaps
   used when requesting the module to reduce duplication, and don't
   bother returning strings for lwt modules using netdevices, both
   suggested by Jiri.
 - update commit message of first patch to clarify security
   implications, in response to Eric's comments.

The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, these patches add the ability to autoload modules
registering lwt operations for lwt implementations not using a net
device so that users don't have to manually load the modules.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

417b7ca4

ila: autoload module · 84a8cbe4

Robert Shearman authored Feb 19, 2016

Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

84a8cbe4

mpls: autoload lwt module · b2b04edc

Robert Shearman authored Feb 19, 2016

Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2b04edc

lwtunnel: autoload of lwt modules · 745041e2

Robert Shearman authored Feb 19, 2016

The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.

Only users with the CAP_NET_ADMIN capability can cause modules to be
loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
messages for users without this capability, and by
lwtunnel_build_state not being called in response to RTM_GETxxx
messages.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

745041e2

vlan: turn on unicast filtering on vlan device · e817af27

Zhang Shengju authored Feb 18, 2016

Currently vlan device inherits unicast filtering flag from underlying
device. If underlying device doesn't support unicast filter, this will
put vlan device into promiscuous mode when it's stacked.

Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does
not go into promiscuous mode needlessly. If underlying device does not
support unicast filtering, that device will enter promiscuous mode.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e817af27

20 Feb, 2016 14 commits

can: ifi: Add IFI CANFD IP support · 0c4d9c94

Marek Vasut authored Jan 20, 2016

The patch adds support for IFI CAN/FD controller [1]. This driver
currently supports sending and receiving both standard CAN and new
CAN/FD frames. Both ISO and BOSCH variant of CAN/FD is supported.

[1] http://www.ifi-pld.de/IP/CANFD/canfd.htmlSigned-off-by: Marek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

0c4d9c94

can: ifi: Add DT bindings for ifi,canfd · 36840646

Marek Vasut authored Jan 11, 2016

Add device tree bindings for the I/F/I CANFD controller IP core.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

36840646

of: Add vendor prefix for I/F/I · 5afec080

Marek Vasut authored Jan 11, 2016

Add vendor prefix for I/F/I, Ingenieurbüro Für IC-Technologie
http://www.ifi-pld.de/Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

5afec080

can: Makefile: Sort the Makefile · 47338082

Marek Vasut authored Jan 11, 2016

Just sort the drivers in the Makefile, no functional change.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

47338082

can: Kconfig: sort drivers alphabetically · 26821162

Marc Kleine-Budde authored Feb 16, 2016

Sort the drivers that are directly listed in the Kconfig alphabetically, no
functional change.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

26821162

can: Kconfig: Sort the Kconfig includes · 83407c7f

Marek Vasut authored Jan 11, 2016

Sort the Kconfig includes, no functional change.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

83407c7f

can: sja1000: of: add compatibility with Technologic Systems version · dfb86c0d

Damien Riegel authored Jan 12, 2016

Technologic Systems provides an IP compatible with the SJA1000,
instantiated in an FPGA. Because of some bus widths issue, access to
registers is made through a "window" that works like this:

    base + 0x0: address to read/write
    base + 0x2: 8-bit register value

This commit adds a new compatible device, "technologic,sja1000", with
read and write functions using the window mechanism.
Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

dfb86c0d

can: sja1000: add documentation for Technologic Systems version · 83c26850

Damien Riegel authored Jan 12, 2016

This commit adds documentation for the Technologic Systems version of
SJA1000. The difference with the NXP version is in the way the registers
are accessed.
Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

83c26850

can: sja1000: of: add per-compatible init hook · f49cbe6b

Damien Riegel authored Jan 12, 2016

This commit adds the capability to allocate and init private data
embedded in the sja1000_priv structure on a per-compatible basis. The
device node is passed as a parameter of the init callback to allow
parsing of custom device tree properties.
Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

f49cbe6b

Merge branch 'bpf-get-stackid' · 80c804bf

David S. Miller authored Feb 20, 2016

Alexei Starovoitov says:

====================
bpf_get_stackid() and stack_trace map

This patch set introduces new map type to store stack traces and
corresponding bpf_get_stackid() helper.
BPF programs already can walk the stack via unrolled loop
of bpf_probe_read()s which is ok for simple analysis, but it's
not efficient and limited to <30 frames after that the programs
don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper
the programs can collect up to PERF_MAX_STACK_DEPTH both
user and kernel frames.
Using stack traces as a key in a map turned out to be very useful
for generating flame graphs, off-cpu graphs, waker and chain graphs.
Patch 3 is a simplified version of 'offwaketime' tool which is
described in detail here:
http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html

Earlier version of this patch were using save_stack_trace() helper,
but 'unreliable' frames add to much noise and two equiavlent
stack traces produce different 'stackid's.
Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is
great for lockdep, but not acceptable for bpf, since the stack_trace
map needs to be freed when user Ctrl-C the tool.
The ftrace style with per_cpu(struct ftrace_stack) is great, but it's
tightly coupled with ftrace ring buffer and has the same 'unreliable'
noise. perf_event's perf_callchain() mechanism is also very efficient
and it only needed minor generalization which is done in patch 1
to be used by bpf stack_trace maps.
Peter, please take a look at patch 1.
If you're ok with it, I'd like to take the whole set via net-next.

Patch 1 - generalization of perf_callchain()
Patch 2 - stack_trace map done as lock-less hashtable without link list
  to avoid spinlock on insertion which is critical path when
  bpf_get_stackid() helper is called for every task switch event
Patch 3 - offwaketime example

After the patch the 'perf report' for artificial 'sched_bench'
benchmark that doing pthread_cond_wait/signal and 'offwaketime'
example is running in the background:
 16.35%  swapper      [kernel.vmlinux]    [k] intel_idle
  2.18%  sched_bench  [kernel.vmlinux]    [k] __switch_to
  2.18%  sched_bench  libpthread-2.12.so  [.] pthread_cond_signal@@GLIBC_2.3.2
  1.72%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_unlock
  1.53%  sched_bench  [kernel.vmlinux]    [k] bpf_get_stackid
  1.44%  sched_bench  [kernel.vmlinux]    [k] entry_SYSCALL_64
  1.39%  sched_bench  [kernel.vmlinux]    [k] __call_rcu.constprop.73
  1.13%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_lock
  1.07%  sched_bench  libpthread-2.12.so  [.] pthread_cond_wait@@GLIBC_2.3.2
  1.07%  sched_bench  [kernel.vmlinux]    [k] hash_futex
  1.05%  sched_bench  [kernel.vmlinux]    [k] do_futex
  1.05%  sched_bench  [kernel.vmlinux]    [k] get_futex_key_refs.isra.13

The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider
using some faster hash in the future, but it's good enough for now.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

80c804bf

samples/bpf: offwaketime example · a6ffe7b9

Alexei Starovoitov authored Feb 17, 2016

This is simplified version of Brendan Gregg's offwaketime:
This program shows kernel stack traces and task names that were blocked and
"off-CPU", along with the stack traces and task names for the threads that woke
them, and the total elapsed time from when they blocked to when they were woken
up. The combined stacks, task names, and total time is summarized in kernel
context for efficiency.

Example:
$ sudo ./offwaketime | flamegraph.pl > demo.svg
Open demo.svg in the browser as FlameGraph visualization.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6ffe7b9

bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6

Alexei Starovoitov authored Feb 17, 2016

add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
        bit 8 - collect user stack instead of kernel
        bit 9 - compare stacks by hash only
        bit 10 - if two different stacks hash into the same stackid
                 discard old
        other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d5a3b1f6

perf: generalize perf_callchain · 568b329a

Alexei Starovoitov authored Feb 17, 2016

. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

568b329a

net: use skb_postpush_rcsum instead of own implementations · 6b83d28a

Daniel Borkmann authored Feb 20, 2016

Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b83d28a