- 25 May, 2018 35 commits
-
-
Petr Machata authored
Having a clsact qdisc on $h3 is useful in several tests, and will be useful in more tests to come. Move the registration from all the tests that need it into the topology file itself. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
For non-GRE mirroring tests, a functions along the lines of do_test_span_gre_dir_ips() and test_span_gre_dir_ips() are necessary, but such that they don't assume tunnels are involved. Extract the code from mirror_gre_lib.sh to mirror_lib.sh and convert to just use a given device without assuming it's named "h3-$tundev". Convert the two above-mentioned functions to wrappers that pass along the correct device name. Add test_span_dir() and fail_test_span_dir() to round up the API for use by following patches. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
Move generic parts of mirror_gre_topo_lib.sh into a new file mirror_topo_lib.sh. Reuse the functions in GRE topo, adding the tunnel devices as necessary. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-05-24 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc). 2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers. 3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit. 4) Jiong Wang adds support for indirect and arithmetic shifts to NFP 5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible. 6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions. 7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT. 8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events. 9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Thomas Falcon says: ==================== ibmvnic: Failover hardening Introduce additional transport event hardening to handle events during device reset. In the driver's current state, if a transport event is received during device reset, it can cause the device to become unresponsive as invalid operations are processed as the backing device context changes. After a transport event, the device expects a request to begin the initialization process. If the driver is still processing a previously queued device reset in this state, it is likely to fail as firmware will reject any commands other than the one to initialize the client driver's Command-Response Queue. Instead of failing and becoming dormant, the driver will make one more attempt to recover and continue operation. This is achieved by setting a state flag, which if true will direct the driver to clean up all allocated resources and perform a hard reset in an attempt to bring the driver back to an operational state. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Introduce a recovery hard reset to handle reset failure as a result of change of device context following a transport event, such as a backing device failover or partition migration. These operations reset the device context to its initial state. If this occurs during a reset, any initialization commands are likely to fail with an invalid state error as backing device firmware requests reinitialization. When this happens, make one more attempt by performing a hard reset, which frees any resources currently allocated and performs device initialization. If a transport event occurs during a device reset, a flag is set which will trigger a new hard reset following the completionof the current reset event. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Set device resetting state at the earliest possible point: as soon as a reset is successfully scheduled. The reset state is toggled off when all resets have been processed to completion. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Instead of having one initialization routine for all cases, create a separate, simpler function for standard initialization, such as during device probe. Use the original initialization function to handle device reset scenarios. The goal of this patch is to avoid having a single, cluttered init function to handle all possible scenarios. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
If setting the link state is not successful, print a warning with the resulting return code and return it to be handled by the caller. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
If device init is interrupted by a failover, set the init return code so that it can be checked and handled appropriately by the init routine. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Check whether CRQ command is successful before awaiting a response from the management partition. If the command was not successful, the driver may hang waiting for a response that will never come. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Introduce an "active" state for a IBM vNIC Command-Response Queue. A CRQ is considered active once it has initialized or linked with its partner by sending an initialization request and getting a successful response back from the management partition. Until this has happened, do not allow CRQ commands to be sent other than the initialization request. This change will avoid a protocol error in case of a device transport event occurring during a initialization. When the driver receives a transport event notification indicating that the backing hardware has changed and needs reinitialization, any further commands other than the initialization handshake with the VIOS management partition will result in an invalid state error. Instead of sending a command that will be returned with an error, print a warning and return an error that will be handled by the caller. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
Set adapter NAPI state as disabled if they are removed. This will allow them to be enabled again if reallocated in case of a hard reset. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Petr Machata says: ==================== selftests: forwarding: Additions to mirror-to-gretap tests This patchset is for a handful of edge cases in mirror-to-gretap scenarios: removal of mirrored-to netdevice (#1), removal of underlay route for tunnel remote endpoint (#2) and cessation of mirroring upon removal of flower mirroring rule (#3). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
Test that when flower-based mirror action is removed, mirroring stops. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
When underlay route is removed, the mirrored traffic should not be forwarded. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
Tests that the mirroring code catches up with deletion of a mirrored-to device. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YueHaibing authored
t4_prep_fw doesn't check for card_fw pointer before store the read data, which could lead to a NULL pointer dereference if kvzalloc failed. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
Jesper Dangaard Brouer says: ==================== This patchset change ndo_xdp_xmit API to take a bulk of xdp frames. When kernel is compiled with CONFIG_RETPOLINE, every indirect function pointer (branch) call hurts performance. For XDP this have a huge negative performance impact. This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but also prepares for further optimizations. The DMA APIs use of indirect function pointer calls is the primary source the regression. It is left for a followup patchset, to use bulking calls towards the DMA API (via the scatter-gatter calls). The other advantage of this API change is that drivers can easier amortize the cost of any sync/locking scheme, over the bulk of packets. The assumption of the current API is that the driver implemementing the NDO will also allocate a dedicated XDP TX queue for every CPU in the system. Which is not always possible or practical to configure. E.g. ixgbe cannot load an XDP program on a machine with more than 96 CPUs, due to limited hardware TX queues. E.g. virtio_net is hard to configure as it requires manually increasing the queues. E.g. tun driver chooses to use a per XDP frame producer lock modulo smp_processor_id over avail queues. I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of this patchset. This will be a followup patchset, once we know if this will be needed (e.g. for non-map xdp_redirect flush-flag, and if AF_XDP chooses to use ndo_xdp_xmit for TX). --- V5: Fixed up issues spotted by Daniel and John V4: Splitout the patches from 4 to 8 patches. I cannot split the driver changes from the NDO change, but I've tried to isolated the NDO change together with the driver change as much as possible. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
Update xdp_monitor to use the recently added err code introduced in tracepoint xdp:xdp_devmap_xmit, to show if the drop count is caused by some driver general delivery problem. Other kind of drops will likely just be more normal TX space issues. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code allow people to easier identify the reason behind the ndo_xdp_xmit call to a given driver is failing. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
When sending an xdp_frame through xdp_do_redirect call, then error cases can happen where the xdp_frame needs to be dropped, and returning an -errno code isn't sufficient/possible any-longer (e.g. for cpumap case). This is already fully supported, by simply calling xdp_return_frame. This patch is an optimization, which provides xdp_return_frame_rx_napi, which is a faster variant for these error cases. It take advantage of the protection provided by XDP RX running under NAPI protection. This change is mostly relevant for drivers using the page_pool allocator as it can take advantage of this. (Tested with mlx5). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
The xdp_monitor sample/tool is updated to use the new tracepoint xdp:xdp_devmap_xmit the previous patch just introduced. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
Notice how this allow us get XDP statistic without affecting the XDP performance, as tracepoint is no-longer activated on a per packet basis. V5: Spotted by John Fastabend. Fix 'sent' also counted 'drops' in this patch, a later patch corrected this, but it was a mistake in this intermediate step. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
Like cpumap create queue for xdp frames that will be bulked. For now, this patch simply invoke ndo_xdp_xmit foreach frame. This happens, either when the map flush operation is envoked, or when the limit DEV_MAP_BULK_SIZE is reached. V5: Avoid memleak on error path in dev_map_update_elem() Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jesper Dangaard Brouer authored
Functionality is the same, but the ndo_xdp_xmit call is now simply invoked from inside the devmap.c code. V2: Fix compile issue reported by kbuild test robot <lkp@intel.com> V5: Cleanups requested by Daniel - Newlines before func definition - Use BUILD_BUG_ON checks - Remove unnecessary use return value store in dev_map_enqueue Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Yonghong Song says: ==================== Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, this command will return bpf related information to user space. Right now it only supports tracepoint/kprobe/uprobe perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Patch #1 adds function perf_get_event() in kernel/events/core.c. Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY. Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query() in the libbpf library for samples/selftests/bpftool to use. Patch #4 adds ksym_get_addr() utility function. Patch #5 add a test in samples/bpf for querying k[ret]probes and u[ret]probes. Patch #6 add a test in tools/testing/selftests/bpf for querying raw_tracepoint and tracepoint. Patch #7 add a new subcommand "perf" to bpftool. Changelogs: v4 -> v5: . return strlen(buf) instead of strlen(buf) + 1 in the attr.buf_len. As long as user provides non-empty buffer, it will be filed with empty string, truncated string, or full string based on the buffer size and the length of to-be-copied string. v3 -> v4: . made attr buf_len input/output. The length of actual buffter is written to buf_len so user space knows what is actually needed. If user provides a buffer with length >= 1 but less than required, do partial copy and return -ENOSPC. . code simplification with put_user. . changed query result attach_info to fd_type. . add tests at selftests/bpf to test zero len, null buf and insufficient buf. v2 -> v3: . made perf_get_event() return perf_event pointer const. this was to ensure that event fields are not meddled. . detect whether newly BPF_TASK_FD_QUERY is supported or not in "bpftool perf" and warn users if it is not. v1 -> v2: . changed bpf subcommand name from BPF_PERF_EVENT_QUERY to BPF_TASK_FD_QUERY. . fixed various "bpftool perf" issues and added documentation and auto-completion. ==================== Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
The new command "bpftool perf [show | list]" will traverse all processes under /proc, and if any fd is associated with a perf event, it will print out related perf event information. Documentation is also added. Below is an example to show the results using bcc commands. Running the following 4 bcc commands: kprobe: trace.py '__x64_sys_nanosleep' kretprobe: trace.py 'r::__x64_sys_nanosleep' tracepoint: trace.py 't:syscalls:sys_enter_nanosleep' uprobe: trace.py 'p:/home/yhs/a.out:main' The bpftool command line and result: $ bpftool perf pid 21711 fd 5: prog_id 5 kprobe func __x64_sys_write offset 0 pid 21765 fd 5: prog_id 7 kretprobe func __x64_sys_nanosleep offset 0 pid 21767 fd 5: prog_id 8 tracepoint sys_enter_nanosleep pid 21800 fd 5: prog_id 9 uprobe filename /home/yhs/a.out offset 1159 $ bpftool -j perf [{"pid":21711,"fd":5,"prog_id":5,"fd_type":"kprobe","func":"__x64_sys_write","offset":0}, \ {"pid":21765,"fd":5,"prog_id":7,"fd_type":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \ {"pid":21767,"fd":5,"prog_id":8,"fd_type":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \ {"pid":21800,"fd":5,"prog_id":9,"fd_type":"uprobe","filename":"/home/yhs/a.out","offset":1159}] $ bpftool prog 5: kprobe name probe___x64_sys tag e495a0c82f2c7a8d gpl loaded_at 2018-05-15T04:46:37-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 4 7: kprobe name probe___x64_sys tag f2fdee479a503abf gpl loaded_at 2018-05-15T04:48:32-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 7 8: tracepoint name tracepoint__sys tag 5390badef2395fcf gpl loaded_at 2018-05-15T04:48:48-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 8 9: kprobe name probe_main_1 tag 0a87bdc2e2953b6d gpl loaded_at 2018-05-15T04:49:52-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 9 $ ps ax | grep "python ./trace.py" 21711 pts/0 T 0:03 python ./trace.py __x64_sys_write 21765 pts/0 S+ 0:00 python ./trace.py r::__x64_sys_nanosleep 21767 pts/2 S+ 0:00 python ./trace.py t:syscalls:sys_enter_nanosleep 21800 pts/3 S+ 0:00 python ./trace.py p:/home/yhs/a.out:main 22374 pts/1 S+ 0:00 grep --color=auto python ./trace.py Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
The new tests are added to query perf_event information for raw_tracepoint and tracepoint attachment. For tracepoint, both syscalls and non-syscalls tracepoints are queries as they are treated slightly differently inside the kernel. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
This is mostly to test kprobe/uprobe which needs kernel headers. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Given a kernel function name, ksym_get_addr() will return the kernel address for this function, or 0 if it cannot find this function name in /proc/kallsyms. This function will be used later when a kernel address is used to initiate a kprobe perf event. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and implement bpf_task_fd_query() in libbpf. The test programs in samples/bpf and tools/testing/selftests/bpf, and later bpftool will use this libbpf function to query kernel. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if the <pid, fd> is associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 24 May, 2018 5 commits
-
-
Heiner Kallweit authored
In struct phy_device we have a number of flags being defined as type bool. Similar to e.g. struct pci_dev we can save some space by using bit-fields. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.open-mesh.org/linux-mergeDavid S. Miller authored
Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Disable batman-adv debugfs by default, by Sven Eckelmann - Improve handling mesh nodes with multicast optimizations disabled, by Linus Luessing - Avoid bool in structs, by Sven Eckelmann - Allocate less memory when debugfs is disabled, by Sven Eckelmann - Fix batadv_interface_tx return data type, by Luc Van Oostenryck - improve link speed handling for virtual interfaces, by Marek Lindner - Enable BATMAN V algorithm by default, by Marek Lindner ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Passing O_CREAT (00000100) to open means we should also pass file mode as the third parameter. Creating /dev/console as a regular file may not be helpful anyway, so simply drop the flag when opening debug_fd. Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
BPFILTER could have been enabled without INET causing this build error: ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined! Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module") Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
Mathieu Xhonneux says: ==================== As of Linux 4.14, it is possible to define advanced local processing for IPv6 packets with a Segment Routing Header through the seg6local LWT infrastructure. This LWT implements the network programming principles defined in the IETF "SRv6 Network Programming" draft. The implemented operations are generic, and it would be very interesting to be able to implement user-specific seg6local actions, without having to modify the kernel directly. To do so, this patchset adds an End.BPF action to seg6local, powered by some specific Segment Routing-related helpers, which provide SR functionalities that can be applied on the packet. This BPF hook would then allow to implement specific actions at native kernel speed such as OAM features, advanced SR SDN policies, SRv6 actions like Segment Routing Header (SRH) encapsulation depending on the content of the packet, etc. This patchset is divided in 6 patches, whose main features are : - A new seg6local action End.BPF with the corresponding new BPF program type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be passed to the LWT seg6local through netlink, the same way as the LWT BPF hook operates. - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/ shrink a SRH and apply on a packet some of the generic SRv6 actions. - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through encapsulation (via IPv6 encapsulation or inlining if the packet contains already an IPv6 header). As this patchset adds a new LWT BPF hook, I took into account the result of the discussions when the LWT BPF infrastructure got merged. Hence, the seg6local BPF hook doesn't allow write access to skb->data directly, only the SRH can be modified through specific helpers, which ensures that the integrity of the packet is maintained. More details are available in the related patches messages. The performances of this BPF hook have been assessed with the BPF JIT enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads clocked at 2.53 GHz. No throughput losses are noted with the seg6local BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes) drops the throughput to 410kpps, and inlining a SRH via bpf_lwt_seg6_action drops the throughput to 420kpps. All throughputs are stable. Changelog: v2: move the SRH integrity state from skb->cb to a per-cpu buffer v3: - document helpers in man-page style - fix kbuild bugs - un-break BPF LWT out hook - bpf_push_seg6_encap is now static - preempt_enable is now called when the packet is dropped in input_action_end_bpf v4: fix kbuild bugs when CONFIG_IPV6=m v5: fix kbuild sparse warnings when CONFIG_IPV6=m v6: fix skb pointers-related bugs in helpers v7: - fix memory leak in error path of End.BPF setup - add freeing of BPF data in seg6_local_destroy_state - new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for netlink nested bpf attributes - SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping state ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-