- 04 Nov, 2019 27 commits
-
-
Maciej Fijalkowski authored
At this point ice driver is able to work on order 1 pages that are split onto two 3k buffers. Let's reflect that when user is setting new MTU size and XDP is present on interface. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Maciej Fijalkowski authored
Driver is now prepared for building the skb around the existing Rx buffer, so introduce the ice_build_skb responsible for it. Make use of XDP's data_meta as well. I've observed around 30% less CPU consumption with build_skb Rx path, in comparison to legacy Rx. What stands behind such result is the avoidance of flow_dissector (which we were diving into via eth_get_headlen) and no memcpy calls. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Maciej Fijalkowski authored
Take into account the underlying architecture specific settings and based on that calculate the possible padding that can be supplied. Typically, for x86 and standard MTU size we will end up with 192 bytes of headroom. This is the same behavior as our other drivers have and we can dedicate it for XDP purposes. Furthermore, introduce the Rx ring flag for indicating whether build_skb is used on particular. Based on that invoke the routines for padding calculation. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Maciej Fijalkowski authored
Add an ethtool "legacy-rx" priv flag for toggling the Rx path. This control knob will be mainly used for build_skb usage as well as buffer size/MTU manipulation. In preparation for adding build_skb support in a way that it takes care of how we set the values of max_frame and rx_buf_len fields of struct ice_vsi. Specifically, in this patch mentioned fields are set to values that will allow us to provide headroom and tailroom in-place. This can be mostly broken down onto following: - for legacy-rx "on" ethtool control knob, old behaviour is kept; - for standard 1500 MTU size configure the buffer of size 1536, as network stack is expecting the NET_SKB_PAD to be provided and NET_IP_ALIGN can have a non-zero value (these can be typically equal to 32 and 2, respectively); - for larger MTUs go with max_frame set to 9k and configure the 3k buffer in case when PAGE_SIZE of underlying arch is less than 8k; 3k buffer is implying the need for order 1 page, so that our page recycling scheme can still be applied; With that said, substitute the hardcoded ICE_RXBUF_2048 and PAGE_SIZE values in DMA API that we're making use of with rx_ring->rx_buf_len and ice_rx_pg_size(rx_ring). The latter is an introduced helper for determining the page size based on its order (which was figured out via ice_rx_pg_order). Last but not least, take care of truesize calculation. In the followup patch the headroom/tailroom computation logic will be introduced. This change aligns the buffer and frame configuration with other Intel drivers, most importantly with iavf. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Krzysztof Kazimierczak authored
Add zero copy AF_XDP support. This patch adds zero copy support for Tx and Rx; code for zero copy is added to ice_xsk.h and ice_xsk.c. For Tx, implement ndo_xsk_wakeup. As with other drivers, reuse existing XDP Tx queues for this task, since XDP_REDIRECT guarantees mutual exclusion between different NAPI contexts based on CPU ID. In turn, a netdev can XDP_REDIRECT to another netdev with a different NAPI context, since the operation is bound to a specific core and each core has its own hardware ring. For Rx, allocate frames as MEM_TYPE_ZERO_COPY on queues that AF_XDP is enabled. Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Krzysztof Kazimierczak authored
In preparation of AF XDP, move functions that will be used both by skb and zero-copy paths to a new file called ice_txrx_lib.c. This allows us to avoid using ifdefs to control the staticness of said functions. Move other functions (ice_rx_csum, ice_rx_hash and ice_ptype_to_htype) called only by the moved ones to the new file as well. Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Maciej Fijalkowski authored
Add support for XDP. Implement ndo_bpf and ndo_xdp_xmit. Upon load of an XDP program, allocate additional Tx rings for dedicated XDP use. The following actions are supported: XDP_TX, XDP_DROP, XDP_REDIRECT, XDP_PASS, and XDP_ABORTED. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Maciej Fijalkowski authored
There's no reason for treating DCB as first class citizen when configuring the Tx queues and going through TCs. Reverse the logic and base the configuration logic on rings, which is the object of interest anyway. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Anirudh Venkataramanan authored
Remove a few uses of kernel configuration flags from ice_lib.c by introducing a new source file ice_base.c. Also move corresponding function prototypes from ice_lib.h to ice_base.h and include ice_base.h where required. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
Saeed Mahameed says: ==================== mlx5-updates-2019-11-01 Misc updates for mlx5 netdev and core driver 1) Steering Core: Replace CRC32 internal implementation with standard kernel lib. 2) Steering Core: Support IPv4 and IPv6 mixed matcher. 3) Steering Core: Lockless FTE read lookups 4) TC: Bit sized fields rewrite support. 5) FPGA: Standalone FPGA support. 6) SRIOV: Reset VF parameters configurations on SRIOV disable. 7) netdev: Dump WQs wqe descriptors on CQE with error events. 8) MISC Cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
YueHaibing authored
drivers/isdn/hardware/mISDN/mISDNisar.c:30:17: warning: faxmodulation_s defined but not used [-Wunused-const-variable=] It is never used, so can be removed. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vincent Cheng authored
The IDT ClockMatrix (TM) family includes integrated devices that provide eight PLL channels. Each PLL channel can be independently configured as a frequency synthesizer, jitter attenuator, digitally controlled oscillator (DCO), or a digital phase lock loop (DPLL). Typically these devices are used as timing references and clock sources for PTP applications. This patch adds support for the device. Co-developed-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Vincent Cheng <vincent.cheng.xh@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vincent Cheng authored
Add device tree binding doc for the IDT ClockMatrix PTP clock. Signed-off-by: Vincent Cheng <vincent.cheng.xh@renesas.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Francesco Ruggeri authored
traceroute6 output can be confusing, in that it shows the address that a router would use to reach the sender, rather than the address the packet used to reach the router. Consider this case: ------------------------ N2 | | ------ ------ N3 ---- | R1 | | R2 |------|H2| ------ ------ ---- | | ------------------------ N1 | ---- |H1| ---- where H1's default route is through R1, and R1's default route is through R2 over N2. traceroute6 from H1 to H2 shows R2's address on N1 rather than on N2. The script below can be used to reproduce this scenario. traceroute6 output without this patch: traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets 1 2000:101::1 (2000:101::1) 0.036 ms 0.008 ms 0.006 ms 2 2000:101::2 (2000:101::2) 0.011 ms 0.008 ms 0.007 ms 3 2000:103::4 (2000:103::4) 0.013 ms 0.010 ms 0.009 ms traceroute6 output with this patch: traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets 1 2000:101::1 (2000:101::1) 0.056 ms 0.019 ms 0.006 ms 2 2000:102::2 (2000:102::2) 0.013 ms 0.008 ms 0.008 ms 3 2000:103::4 (2000:103::4) 0.013 ms 0.009 ms 0.009 ms #!/bin/bash # # ------------------------ N2 # | | # ------ ------ N3 ---- # | R1 | | R2 |------|H2| # ------ ------ ---- # | | # ------------------------ N1 # | # ---- # |H1| # ---- # # N1: 2000:101::/64 # N2: 2000:102::/64 # N3: 2000:103::/64 # # R1's host part of address: 1 # R2's host part of address: 2 # H1's host part of address: 3 # H2's host part of address: 4 # # For example: # the IPv6 address of R1's interface on N2 is 2000:102::1/64 # # Nets are implemented by macvlan interfaces (bridge mode) over # dummy interfaces. # # Create net namespaces ip netns add host1 ip netns add host2 ip netns add rtr1 ip netns add rtr2 # Create nets ip link add net1 type dummy; ip link set net1 up ip link add net2 type dummy; ip link set net2 up ip link add net3 type dummy; ip link set net3 up # Add interfaces to net1, move them to their nemaspaces ip link add link net1 dev host1net1 type macvlan mode bridge ip link set host1net1 netns host1 ip link add link net1 dev rtr1net1 type macvlan mode bridge ip link set rtr1net1 netns rtr1 ip link add link net1 dev rtr2net1 type macvlan mode bridge ip link set rtr2net1 netns rtr2 # Add interfaces to net2, move them to their nemaspaces ip link add link net2 dev rtr1net2 type macvlan mode bridge ip link set rtr1net2 netns rtr1 ip link add link net2 dev rtr2net2 type macvlan mode bridge ip link set rtr2net2 netns rtr2 # Add interfaces to net3, move them to their nemaspaces ip link add link net3 dev rtr2net3 type macvlan mode bridge ip link set rtr2net3 netns rtr2 ip link add link net3 dev host2net3 type macvlan mode bridge ip link set host2net3 netns host2 # Configure interfaces and routes in host1 ip netns exec host1 ip link set lo up ip netns exec host1 ip link set host1net1 up ip netns exec host1 ip -6 addr add 2000:101::3/64 dev host1net1 ip netns exec host1 ip -6 route add default via 2000:101::1 # Configure interfaces and routes in rtr1 ip netns exec rtr1 ip link set lo up ip netns exec rtr1 ip link set rtr1net1 up ip netns exec rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1 ip netns exec rtr1 ip link set rtr1net2 up ip netns exec rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2 ip netns exec rtr1 ip -6 route add default via 2000:102::2 ip netns exec rtr1 sysctl net.ipv6.conf.all.forwarding=1 # Configure interfaces and routes in rtr2 ip netns exec rtr2 ip link set lo up ip netns exec rtr2 ip link set rtr2net1 up ip netns exec rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1 ip netns exec rtr2 ip link set rtr2net2 up ip netns exec rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2 ip netns exec rtr2 ip link set rtr2net3 up ip netns exec rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3 ip netns exec rtr2 sysctl net.ipv6.conf.all.forwarding=1 # Configure interfaces and routes in host2 ip netns exec host2 ip link set lo up ip netns exec host2 ip link set host2net3 up ip netns exec host2 ip -6 addr add 2000:103::4/64 dev host2net3 ip netns exec host2 ip -6 route add default via 2000:103::2 # Ping host2 from host1 ip netns exec host1 ping6 -c5 2000:103::4 # Traceroute host2 from host1 ip netns exec host1 traceroute6 2000:103::4 # Delete nets ip link del net3 ip link del net2 ip link del net1 # Delete namespaces ip netns del rtr2 ip netns del rtr1 ip netns del host2 ip netns del host1 Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Original-patch-by: Honggang Xu <hxu@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tuong Lien authored
As mentioned in commit e95584a8 ("tipc: fix unlimited bundling of small messages"), the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will just check & keep a reference to it if the message is suitable for bundling. The message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Francesco Ruggeri authored
Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the primary address of the interface the packet was received on, even if the path goes through a secondary address. In the example: 1.0.3.1/24 ---- 1.0.1.3/24 1.0.1.1/24 ---- 1.0.2.1/24 1.0.2.4/24 ---- |H1|--------------------------|R1|--------------------------|H2| ---- N1 ---- N2 ---- where 1.0.3.1/24 is R1's primary address on N1, traceroute from H1 to H2 returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.3.1 (1.0.3.1) 0.018 ms 0.006 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.021 ms 0.007 ms 0.007 ms After applying this patch, it returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.1.1 (1.0.1.1) 0.033 ms 0.007 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.011 ms 0.007 ms 0.007 ms Original-patch-by: Bill Fenner <fenner@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Tonghao Zhang says: ==================== optimize openvswitch flow looking up This series patch optimize openvswitch for performance or simplify codes. Patch 1, 2, 4: Port Pravin B Shelar patches to linux upstream with little changes. Patch 5, 6, 7: Optimize the flow looking up and simplify the flow hash. Patch 8, 9: are bugfix. The performance test is on Intel Xeon E5-2630 v4. The test topology is show as below: +-----------------------------------+ | +---------------------------+ | | | eth0 ovs-switch eth1 | | Host0 | +---------------------------+ | +-----------------------------------+ ^ | | | | | | | | v +-----+----+ +----+-----+ | netperf | Host1 | netserver| Host2 +----------+ +----------+ We use netperf send the 64B packets, and insert 255+ flow-mask: $ ovs-dpctl add-flow ovs-switch "in_port(1),eth(dst=00:01:00:00:00:00/ff:ff:ff:ff:ff:01),eth_type(0x0800),ipv4(frag=no)" 2 ... $ ovs-dpctl add-flow ovs-switch "in_port(1),eth(dst=00:ff:00:00:00:00/ff:ff:ff:ff:ff:ff),eth_type(0x0800),ipv4(frag=no)" 2 $ $ netperf -t UDP_STREAM -H 2.2.2.200 -l 40 -- -m 18 * Without series patch, throughput 8.28Mbps * With series patch, throughput 46.05Mbps v6: some coding style fixes v5: rewrite patch 8, release flow-mask when freeing flow v4: access ma->count with READ_ONCE/WRITE_ONCE API. More information, see patch 5 comments. v3: update ma point when realloc mask_array in patch 5 v2: simplify codes. e.g. use kfree_rcu instead of call_rcu ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
use the specified functions to init resource. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
Unlocking of a not locked mutex is not allowed. Other kernel thread may be in critical section while we unlock it because of setting user_feature fail. Fixes: 95a7233c ("net: openvswitch: Set OvS recirc_id from tc chain index") Cc: Paul Blakey <paulb@mellanox.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
When we destroy the flow tables which may contain the flow_mask, so release the flow mask struct. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
The most case *index < ma->max, and flow-mask is not NULL. We add un/likely for performance. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
Simplify the code and remove the unnecessary BUILD_BUG_ON. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
The full looking up on flow table traverses all mask array. If mask-array is too large, the number of invalid flow-mask increase, performance will be drop. One bad case, for example: M means flow-mask is valid and NULL of flow-mask means deleted. +-------------------------------------------+ | M | NULL | ... | NULL | M| +-------------------------------------------+ In that case, without this patch, openvswitch will traverses all mask array, because there will be one flow-mask in the tail. This patch changes the way of flow-mask inserting and deleting, and the mask array will be keep as below: there is not a NULL hole. In the fast path, we can "break" "for" (not "continue") in flow_lookup when we get a NULL flow-mask. "break" v +-------------------------------------------+ | M | M | NULL |... | NULL | NULL| +-------------------------------------------+ This patch don't optimize slow or control path, still using ma->max to traverse. Slow path: * tbl_mask_array_realloc * ovs_flow_tbl_lookup_exact * flow_mask_find Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
Port the codes to linux upstream and with little changes. Pravin B Shelar, says: | In case hash collision on mask cache, OVS does extra flow | lookup. Following patch avoid it. Link: https://github.com/openvswitch/ovs/commit/0e6efbe2712da03522532dc5e84806a96f6a0dd1Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
When creating and inserting flow-mask, if there is no available flow-mask, we realloc the mask array. When removing flow-mask, if necessary, we shrink mask array. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
Port the codes to linux upstream and with little changes. Pravin B Shelar, says: | mask caches index of mask in mask_list. On packet recv OVS | need to traverse mask-list to get cached mask. Therefore array | is better for retrieving cached mask. This also allows better | cache replacement algorithm by directly checking mask's existence. Link: https://github.com/openvswitch/ovs/commit/d49fc3ff53c65e4eca9cabd52ac63396746a7ef5Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
The idea of this optimization comes from a patch which is committed in 2014, openvswitch community. The author is Pravin B Shelar. In order to get high performance, I implement it again. Later patches will use it. Pravin B Shelar, says: | On every packet OVS needs to lookup flow-table with every | mask until it finds a match. The packet flow-key is first | masked with mask in the list and then the masked key is | looked up in flow-table. Therefore number of masks can | affect packet processing performance. Link: https://github.com/openvswitch/ovs/commit/5604935e4e1cbc16611d2d97f50b717aa31e8ec5Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 02 Nov, 2019 13 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-11-02 The following pull-request contains BPF updates for your *net-next* tree. We've added 30 non-merge commits during the last 7 day(s) which contain a total of 41 files changed, 1864 insertions(+), 474 deletions(-). The main changes are: 1) Fix long standing user vs kernel access issue by introducing bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel. 2) Accelerated xskmap lookup, from Björn and Maciej. 3) Support for automatic map pinning in libbpf, from Toke. 4) Cleanup of BTF-enabled raw tracepoints, from Alexei. 5) Various fixes to libbpf and selftests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller authored
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
Daniel Borkmann says: ==================== This set adds probe_read_{user,kernel}(), probe_read_str_{user,kernel}() helpers, fixes probe_write_user() helper and selftests. For details please see individual patches. Thanks! v2 -> v3: - noticed two more things that are fixed in here: - bpf uapi helper description used 'int size' for *_str helpers, now u32 - we need TASK_SIZE_MAX + guard page on x86-64 in patch 2 otherwise we'll trigger the 00c42373 warn as well, so full range covered now v1 -> v2: - standardize unsafe_ptr terminology in uapi header comment (Andrii) - probe_read_{user,kernel}[_str] naming scheme (Andrii) - use global data in last test case, remove relaxed_maps (Andrii) - add strict non-pagefault kernel read funcs to avoid warning in kernel probe read helpers (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Tested on x86-64 and Ilya was also kind enough to give it a spin on s390x, both passing with probe_user:OK there. The test is using the newly added bpf_probe_read_user() to dump sockaddr from connect call into .bss BPF map and overrides the user buffer via bpf_probe_write_user(): # ./test_progs [...] #17 pkt_md_access:OK #18 probe_user:OK #19 prog_run_xattr:OK [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Use probe read *_{kernel,user}{,_str}() helpers instead of bpf_probe_read() or bpf_probe_read_user_str() for program tests where appropriate. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/4a61d4b71ce3765587d8ef5cb93afa18515e5b3e.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Use bpf_probe_read_user() helper instead of bpf_probe_read() for samples that attach to kprobes probing on user addresses. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/5b0144b3f8e031ec5e2438bd7de8d7877e63bf2f.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Commit 2a02759e ("bpf: Add support for BTF pointers to interpreter") explicitly states that the pointer to BTF object is a pointer to a kernel object or NULL. Therefore we should also switch to using the strict kernel probe helper which is restricted to kernel addresses only when architectures have non-overlapping address spaces. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/d2b90827837685424a4b8008dfe0460558abfada.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken in that they assume they can be used for probing memory access for kernel space addresses /as well as/ user space addresses. However, plain use of probe_kernel_read() for both cases will attempt to always access kernel space address space given access is performed under KERNEL_DS and some archs in-fact have overlapping address spaces where a kernel pointer and user pointer would have the /same/ address value and therefore accessing application memory via bpf_probe_read{,_str}() would read garbage values. Lets fix BPF side by making use of recently added 3d708182 ("uaccess: Add non-pagefault user-space read functions"). Unfortunately, the only way to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}() helpers are kept as-is to retain their current behavior. The two *_user() variants attempt the access always under USER_DS set, the two *_kernel() variants will -EFAULT when accessing user memory if the underlying architecture has non-overlapping address ranges, also avoiding throwing the kernel warning via 00c42373 ("x86-64: add warning for non-canonical user access address dereferences"). Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper") Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Convert the bpf_probe_write_user() helper to probe_user_write() such that writes are not attempted under KERNEL_DS anymore which is buggy as kernel and user space pointers can have overlapping addresses. Also, given we have the access_ok() check inside probe_user_write(), the helper doesn't need to do it twice. Fixes: 96ae5227 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/841c461781874c07a0ee404a454c3bc0459eed30.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict() helpers which by default alias to the __probe_kernel_read() and the __strncpy_from_unsafe(), respectively, but can be overridden by archs which have non-overlapping address ranges for kernel space and user space in order to bail out with -EFAULT when attempting to probe user memory including non-canonical user access addresses [0]: 4-level page tables: user-space mem: 0x0000000000000000 - 0x00007fffffffffff non-canonical: 0x0000800000000000 - 0xffff7fffffffffff 5-level page tables: user-space mem: 0x0000000000000000 - 0x00ffffffffffffff non-canonical: 0x0100000000000000 - 0xfeffffffffffffff The idea is that these helpers are complementary to the probe_user_read() and strncpy_from_unsafe_user() which probe user-only memory. Both added helpers here do the same, but for kernel-only addresses. Both set of helpers are going to be used for BPF tracing. They also explicitly avoid throwing the splat for non-canonical user addresses from 00c42373 ("x86-64: add warning for non-canonical user access address dereferences"). For compat, the current probe_kernel_read() and strncpy_from_unsafe() are left as-is. [0] Documentation/x86/x86_64/mm.txt Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: x86@kernel.org Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
-
Daniel Borkmann authored
Commit 3d708182 ("uaccess: Add non-pagefault user-space read functions") missed to add probe write function, therefore factor out a probe_write_common() helper with most logic of probe_kernel_write() except setting KERNEL_DS, and add a new probe_user_write() helper so it can be used from BPF side. Again, on some archs, the user address space and kernel address space can co-exist and be overlapping, so in such case, setting KERNEL_DS would mean that the given address is treated as being in kernel address space. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
-
Alexei Starovoitov authored
Toke Høiland-Jørgensen says: ==================== This series adds support to libbpf for reading 'pinning' settings from BTF-based map definitions. It introduces a new open option which can set the pinning path; if no path is set, /sys/fs/bpf is used as the default. Callers can customise the pinning between open and load by setting the pin path per map, and still get the automatic reuse feature. The semantics of the pinning is similar to the iproute2 "PIN_GLOBAL" setting, and the eventual goal is to move the iproute2 implementation to be based on libbpf and the functions introduced in this series. Changelog: v6: - Fix leak of struct bpf_object in selftest - Make struct bpf_map arg const in bpf_map__is_pinned() and bpf_map__get_pin_path() v5: - Don't pin maps with pinning set, but with a value of LIBBPF_PIN_NONE - Add a few more selftests: - Should not pin map with pinning set, but value LIBBPF_PIN_NONE - Should fail to load a map with an invalid pinning value - Should fail to re-use maps with parameter mismatch - Alphabetise libbpf.map - Whitespace and typo fixes v4: - Don't check key_type_id and value_type_id when checking for map reuse compatibility. - Move building of map->pin_path into init_user_btf_map() - Get rid of 'pinning' attribute in struct bpf_map - Make sure we also create parent directory on auto-pin (new patch 3). - Abort the selftest on error instead of attempting to continue. - Support unpinning all pinned maps with bpf_object__unpin_maps(obj, NULL) - Support pinning at map->pin_path with bpf_object__pin_maps(obj, NULL) - Make re-pinning a map at the same path a noop - Rename the open option to pin_root_path - Add a bunch more self-tests for pin_maps(NULL) and unpin_maps(NULL) - Fix a couple of smaller nits v3: - Drop bpf_object__pin_maps_opts() and just use an open option to customise the pin path; also don't touch bpf_object__{un,}pin_maps() - Integrate pinning and reuse into bpf_object__create_maps() instead of having multiple loops though the map structure - Make errors in map reuse and pinning fatal to the load procedure - Add selftest to exercise pinning feature - Rebase series to latest bpf-next v2: - Drop patch that adds mounting of bpffs - Only support a single value of the pinning attribute - Add patch to fixup error handling in reuse_fd() - Implement the full automatic pinning and map reuse logic on load ==================== Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Toke Høiland-Jørgensen authored
This adds a new BPF selftest to exercise the new automatic map pinning code. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/157269298209.394725.15420085139296213182.stgit@toke.dk
-