- 14 Jun, 2019 6 commits
-
-
Martin KaFai Lau authored
This patch adds a test for the new sockopt SO_REUSEPORT_DETACH_BPF. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Martin KaFai Lau authored
SO_DETACH_REUSEPORT_BPF is needed for the test in the next patch. It is defined in the socket.h. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Martin KaFai Lau authored
There is SO_ATTACH_REUSEPORT_[CE]BPF but there is no DETACH. This patch adds SO_DETACH_REUSEPORT_BPF sockopt. The same sockopt can be used to undo both SO_ATTACH_REUSEPORT_[CE]BPF. reseport_detach_prog() is added and it is mostly a mirror of the existing reuseport_attach_prog(). The differences are, it does not call reuseport_alloc() and returns -ENOENT when there is no old prog. Cc: Craig Gallek <kraig@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrii Nakryiko authored
Kernel internally checks that either key or value type ID is specified, before using btf_fd. Do the same in libbpf's map creation code for determining when to retry map creation w/o BTF. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: fba01a06 ("libbpf: use negative fd to specify missing BTF") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Dan Carpenter authored
The "len" variable needs to be signed for the error handling to work properly. Fixes: 596092ef ("selftests/bpf: enable all available cgroup v2 controllers") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Prashant Bhole authored
Recent commit included libbpf.h in selftests/bpf/bpf_util.h. Since some samples use bpf_util.h and samples/bpf/Makefile doesn't have libbpf.h path included, build was failing. Let's add the path in samples/bpf/Makefile. Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 12 Jun, 2019 1 commit
-
-
Valdis Klētnieks authored
Compiling kernel/bpf/core.c with W=1 causes a flood of warnings: kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init] 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true | ^~~~ kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL' 1087 | INSN_3(ALU, ADD, X), \ | ^~~~~~ kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP' 1202 | BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL), | ^~~~~~~~~~~~ kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]') 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true | ^~~~ kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL' 1087 | INSN_3(ALU, ADD, X), \ | ^~~~~~ kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP' 1202 | BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL), | ^~~~~~~~~~~~ 98 copies of the above. The attached patch silences the warnings, because we *know* we're overwriting the default initializer. That leaves bpf/core.c with only 6 other warnings, which become more visible in comparison. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 11 Jun, 2019 12 commits
-
-
Daniel Borkmann authored
Hechao Li says: ==================== Getting number of possible CPUs is commonly used for per-CPU BPF maps and perf_event_maps. Add a new API libbpf_num_possible_cpus() that helps user with per-CPU related operations and remove duplicate implementations in bpftool and selftests. v2: Save errno before calling pr_warning in case it is changed. v3: Make sure libbpf_num_possible_cpus never returns 0 so that user only has to check if ret value < 0. v4: Fix error code when reading 0 bytes from possible CPU file. v5: Fix selftests compliation issue. v6: Split commit to reuse libbpf_num_possible_cpus() into two commits: One commit to remove bpf_util.h from test BPF C programs. One commit to reuse libbpf_num_possible_cpus() in bpftools and bpf_util.h. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Hechao Li authored
Use the newly added bpf_num_possible_cpus() in bpftool and selftests and remove duplicate implementations. Signed-off-by: Hechao Li <hechaol@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Hechao Li authored
Though currently there is no problem including bpf_util.h in kernel space BPF C programs, in next patch in this stack, I will reuse libbpf_num_possible_cpus() in bpf_util.h thus include libbpf.h in it, which will cause BPF C programs compile error. Therefore I will first remove bpf_util.h from all test BPF programs. This can also make it clear that bpf_util.h is a user-space utility while bpf_helpers.h is a kernel space utility. Signed-off-by: Hechao Li <hechaol@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Hechao Li authored
Adding a new API libbpf_num_possible_cpus() that helps user with per-CPU map operations. Signed-off-by: Hechao Li <hechaol@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Hechao Li authored
An error "implicit declaration of function 'reallocarray'" can be thrown with the following steps: $ cd tools/testing/selftests/bpf $ make clean && make CC=<Path to GCC 4.8.5> $ make clean && make CC=<Path to GCC 7.x> The cause is that the feature folder generated by GCC 4.8.5 is not removed, leaving feature-reallocarray being 1, which causes reallocarray not defined when re-compliing with GCC 7.x. This diff adds feature folder to EXTRA_CLEAN to avoid this problem. v2: Rephrase the commit message. Signed-off-by: Hechao Li <hechaol@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrii Nakryiko authored
Fix signature of bpf_probe_read and bpf_probe_write_user to mark source pointer as const. This causes warnings during compilation for applications relying on those helpers. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Quentin reports that commit 07c3bbdb ("samples: bpf: print a warning about headers_install") is producing the false positive when make is invoked locally, from the samples/bpf/ directory. When make is run locally it hits the "all" target, which will recursively invoke make through the full build system. Speed up the "local" run which doesn't actually build anything, and avoid false positives by skipping all the probes if not in kbuild environment (cover both the new warning and the BTF probes). Reported-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Jonathan Lemon says: ==================== Currently, the AF_XDP code uses a separate map in order to determine if an xsk is bound to a queue. Have the xskmap lookup return a XDP_SOCK pointer on the kernel side, which the verifier uses to extract relevant values. Patches: 1 - adds XSK_SOCK type 2 - sync bpf.h with tools 3 - add tools selftest 4 - update lib/bpf, removing qidconf v4->v5: - xskmap lookup now returns XDP_SOCK type instead of pointer to element. - no changes lib/bpf/xsk.c v3->v4: - Clarify error handling path. v2->v3: - Use correct map type. ==================== Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jonathan Lemon authored
Use the recent change to XSKMAP bpf_map_lookup_elem() to test if there is a xsk present in the map instead of duplicating the work with qidconf. Fix things so callers using XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD bypass any internal bpf maps, so xsk_socket__{create|delete} works properly. Clean up error handling path. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Tested-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jonathan Lemon authored
Check that bpf_map_lookup_elem lookup and structure access operats correctly. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jonathan Lemon authored
Sync uapi/linux/bpf.h Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jonathan Lemon authored
Currently, the AF_XDP code uses a separate map in order to determine if an xsk is bound to a queue. Instead of doing this, have bpf_map_lookup_elem() return a xdp_sock. Rearrange some xdp_sock members to eliminate structure holes. Remove selftest - will be added back in later patch. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 06 Jun, 2019 2 commits
-
-
Roman Gushchin authored
Currently bpf_skb_cgroup_id() is not supported for CGROUP_SKB programs. An attempt to load such a program generates an error like this: libbpf: 0: (b7) r6 = 0 ... 9: (85) call bpf_skb_cgroup_id#79 unknown func bpf_skb_cgroup_id#79 There are no particular reasons for denying it, and we have some use cases where it might be useful. So let's add it to the list of allowed helpers. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jakub Kicinski authored
It seems like periodically someone posts patches to "fix" header includes. The issue is that samples expect the include path to have the uAPI headers (from usr/) first, and then tools/ headers, so that locally installed uAPI headers take precedence. This means that if users didn't run headers_install they will see all sort of strange compilation errors, e.g.: HOSTCC samples/bpf/test_lru_dist samples/bpf/test_lru_dist.c:39:8: error: redefinition of ‘struct list_head’ struct list_head { ^~~~~~~~~ In file included from samples/bpf/test_lru_dist.c:9:0: ../tools/include/linux/types.h:69:8: note: originally defined here struct list_head { ^~~~~~~~~ Try to detect this situation, and print a helpful warning. v2: just use HOSTCC (Jiong). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 04 Jun, 2019 2 commits
-
-
Colin Ian King authored
The variable err is assigned with the value -EINVAL that is never read and it is re-assigned a new value later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Colin Ian King authored
There is a spelling mistake in the help information, fix this. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 01 Jun, 2019 4 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-05-31 The following pull-request contains BPF updates for your *net-next* tree. Lots of exciting new features in the first PR of this developement cycle! The main changes are: 1) misc verifier improvements, from Alexei. 2) bpftool can now convert btf to valid C, from Andrii. 3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs. This feature greatly improves BPF speed on 32-bit architectures. From Jiong. 4) cgroups will now auto-detach bpf programs. This fixes issue of thousands bpf programs got stuck in dying cgroups. From Roman. 5) new bpf_send_signal() helper, from Yonghong. 6) cgroup inet skb programs can signal CN to the stack, from Lawrence. 7) miscellaneous cleanups, from many developers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alan Maguire authored
xdping allows us to get latency estimates from XDP. Output looks like this: ./xdping -I eth4 192.168.55.8 Setting up XDP for eth4, please wait... XDP setup disrupts network connectivity, hit Ctrl+C to quit Normal ping RTT data [Ignore final RTT; it is distorted by XDP using the reply] PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data. 64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms 64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms 64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms 64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms 4 packets transmitted, 4 received, 0% packet loss, time 3079ms rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms XDP RTT data: 64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms 64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms 64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms 64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms The xdping program loads the associated xdping_kern.o BPF program and attaches it to the specified interface. If run in client mode (the default), it will add a map entry keyed by the target IP address; this map will store RTT measurements, current sequence number etc. Finally in client mode the ping command is executed, and the xdping BPF program will use the last ICMP reply, reformulate it as an ICMP request with the next sequence number and XDP_TX it. After the reply to that request is received we can measure RTT and repeat until the desired number of measurements is made. This is why the sequence numbers in the normal ping are 1, 2, 3 and 8. We XDP_TX a modified version of ICMP reply 4 and keep doing this until we get the 4 replies we need; hence the networking stack only sees reply 8, where we have XDP_PASSed it upstream since we are done. In server mode (-s), xdping simply takes ICMP requests and replies to them in XDP rather than passing the request up to the networking stack. No map entry is required. xdping can be run in native XDP mode (the default, or specified via -N) or in skb mode (-S). A test program test_xdping.sh exercises some of these options. Note that native XDP does not seem to XDP_TX for veths, hence -N is not tested. Looking at the code, it looks like XDP_TX is supported so I'm not sure if that's expected. Running xdping in native mode for ixgbe as both client and server works fine. Changes since v4 - close fds on cleanup (Song Liu) Changes since v3 - fixed seq to be __be16 (Song Liu) - fixed fd checks in xdping.c (Song Liu) Changes since v2 - updated commit message to explain why seq number of last ICMP reply is 8 not 4 (Song Liu) - updated types of seq number, raddr and eliminated csum variable in xdpclient/xdpserver functions as it was not needed (Song Liu) - added XDPING_DEFAULT_COUNT definition and usage specification of default/max counts (Song Liu) Changes since v1 - moved from RFC to PATCH - removed unused variable in ipv4_csum() (Song Liu) - refactored ICMP checks into icmp_check() function called by client and server programs and reworked client and server programs due to lack of shared code (Song Liu) - added checks to ensure that SKB and native mode are not requested together (Song Liu) Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller authored
Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2019-05-31 This series contains updates to the iavf driver. Nathan Chancellor converts the use of gnu_printf to printf. Aleksandr modifies the driver to limit the number of RSS queues to the number of online CPUs in order to avoid creating misconfigured RSS queues. Gustavo A. R. Silva converts a couple of instances where sizeof() can be replaced with struct_size(). Alice makes the remaining changes to the iavf driver to cleanup all the old "i40evf" references in the driver to iavf, including the file names that still contained the old driver reference. There was no functional changes made, just cosmetic to reduce any confusion going forward now that the iavf driver is the virtual function driver for both i40e and ice drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiong Wang authored
There has been quite a few progress around the two steps mentioned in the answer to the following question: Q: BPF 32-bit subregister requirements This patch updates the answer to reflect what has been done. v2: - Add missing full stop. (Song Liu) - Minor tweak on one sentence. (Song Liu) v1: - Integrated rephrase from Quentin and Jakub Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 31 May, 2019 13 commits
-
-
Alexei Starovoitov authored
Roman Gushchin says: ==================== During my work on memcg-based memory accounting for bpf maps I've done some cleanups and refactorings of the existing memlock rlimit-based code. It makes it more robust, unifies size to pages conversion, size checks and corresponding error codes. Also it adds coverage for cgroup local storage and socket local storage maps. It looks like some preliminary work on the mm side might be required to start working on the memcg-based accounting, so I'm sending these patches as a separate patchset. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
Most bpf map types doing similar checks and bytes to pages conversion during memory allocation and charging. Let's unify these checks by moving them into bpf_map_charge_init(). Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
In order to unify the existing memlock charging code with the memcg-based memory accounting, which will be added later, let's rework the current scheme. Currently the following design is used: 1) .alloc() callback optionally checks if the allocation will likely succeed using bpf_map_precharge_memlock() 2) .alloc() performs actual allocations 3) .alloc() callback calculates map cost and sets map.memory.pages 4) map_create() calls bpf_map_init_memlock() which sets map.memory.user and performs actual charging; in case of failure the map is destroyed <map is in use> 1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which performs uncharge and releases the user 2) .map_free() callback releases the memory The scheme can be simplified and made more robust: 1) .alloc() calculates map cost and calls bpf_map_charge_init() 2) bpf_map_charge_init() sets map.memory.user and performs actual charge 3) .alloc() performs actual allocations <map is in use> 1) .map_free() callback releases the memory 2) bpf_map_charge_finish() performs uncharge and releases the user The new scheme also allows to reuse bpf_map_charge_init()/finish() functions for memcg-based accounting. Because charges are performed before actual allocations and uncharges after freeing the memory, no bogus memory pressure can be created. In cases when the map structure is not available (e.g. it's not created yet, or is already destroyed), on-stack bpf_map_memory structure is used. The charge can be transferred with the bpf_map_charge_move() function. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
Group "user" and "pages" fields of bpf_map into the bpf_map_memory structure. Later it can be extended with "memcg" and other related information. The main reason for a such change (beside cosmetics) is to pass bpf_map_memory structure to charging functions before the actual allocation of bpf_map. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
Socket local storage maps lack the memlock precharge check, which is performed before the memory allocation for most other bpf map types. Let's add it in order to unify all map types. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
Cgroup local storage maps lack the memlock precharge check, which is performed before the memory allocation for most other bpf map types. Let's add it in order to unify all map types. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Lawrence Brakmo says: ==================== This patchset adds support for propagating congestion notifications (cn) to TCP from cgroup inet skb egress BPF programs. Current cgroup skb BPF programs cannot trigger TCP congestion window reductions, even when they drop a packet. This patch-set adds support for cgroup skb BPF programs to send congestion notifications in the return value when the packets are TCP packets. Rather than the current 1 for keeping the packet and 0 for dropping it, they can now return: NET_XMIT_SUCCESS (0) - continue with packet output NET_XMIT_DROP (1) - drop packet and do cn NET_XMIT_CN (2) - continue with packet output and do cn -EPERM - drop packet Finally, HBM programs are modified to collect and return more statistics. There has been some discussion regarding the best place to manage bandwidths. Some believe this should be done in the qdisc where it can also be managed with a BPF program. We believe there are advantages for doing it with a BPF program in the cgroup/skb callback. For example, it reduces overheads in the cases where there is on primary workload and one or more secondary workloads, where each workload is running on its own cgroupv2. In this scenario, we only need to throttle the secondary workloads and there is no overhead for the primary workload since there will be no BPF program attached to its cgroup. Regardless, we agree that this mechanism should not penalize those that are not using it. We tested this by doing 1 byte req/reply RPCs over loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs. Each test was repeated 50 times with a 1 minute delay between each set of 10. We then calculated the average RPCs/sec over the 50 tests. We compare upstream with upstream + patchset and no BPF program as well as upstream + patchset and a BPF program that just returns ALLOW_PKT. Here are the results: upstream 80937 RPCs/sec upstream + patches, no BPF program 80894 RPCs/sec upstream + patches, BPF program 80634 RPCs/sec These numbers indicate that there is no penalty for these patches The use of congestion notifications improves the performance of HBM when using Cubic. Without congestion notifications, Cubic will not decrease its cwnd and HBM will need to drop a large percentage of the packets. The following results are obtained for rate limits of 1Gbps, between two servers using netperf, and only one flow. We also show how reducing the max delayed ACK timer can improve the performance when using Cubic. Command used was: ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \ -s=<server running netserver> where: <rate> is 1000 --no_cn specifies no cwr notifications dctcp uses dctcp Cubic DCTCP Lim, DA Mbps cwnd cred drops Mbps cwnd cred drops -------- ---- ---- ---- ----- ---- ---- ---- ----- 1G, 40 35 462 -320 67% 995 1 -212 0.05% 1G, 40,cn 736 9 -78 0.07 995 1 -212 0.05 1G, 5,cn 941 2 -189 0.13 995 1 -212 0.05 Notes: --no_cn has no effect with DCTCP Lim = rate limit DA = maximum delay ack timer cred = credit in packets drops = % packets dropped v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3 New egress values apply to all protocols, not just TCP Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers Removed changes to __tcp_transmit_skb (patch 5), no longer needed Removed sample use of EDT v2->v3: Removed the probe timer related changes v3->v4: Replaced preempt_enable_no_resched() by preempt_enable() in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
Adds more stats to HBM, including average cwnd and rtt of all TCP flows, percents of packets that are ecn ce marked and distribution of return values. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
Update hbm_out_kern.c to support returning cn notifications. Also updates relevant files to allow disabling cn notifications. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning congestion notifications from the BPF programs. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
For egress packets, __cgroup_bpf_fun_filter_skb() will now call BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY() in order to propagate congestion notifications (cn) requests to TCP callers. For egress packets, this function can return: NET_XMIT_SUCCESS (0) - continue with packet output NET_XMIT_DROP (1) - drop packet and notify TCP to call cwr NET_XMIT_CN (2) - continue with packet output and notify TCP to call cwr -EPERM - drop packet For ingress packets, this function will return -EPERM if any attached program was found and if it returned != 1 during execution. Otherwise 0 is returned. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
Allows cgroup inet skb programs to return values in the range [0, 3]. The second bit is used to deterine if congestion occurred and higher level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr() The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS at load time if it uses the new return values (i.e. 2 or 3). The expected_attach_type is currently not enforced for BPF_PROG_TYPE_CGROUP_SKB. e.g Meaning the current bpf_prog with expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to BPF_CGROUP_INET_INGRESS. Blindly enforcing expected_attach_type will break backward compatibility. This patch adds a enforce_expected_attach_type bit to only enforce the expected_attach_type when it uses the new return value. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
brakmo authored
Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can request cwr for TCP packets. Current cgroup skb programs can only return 0 or 1 (0 to drop the packet. This macro changes the behavior so the low order bit indicates whether the packet should be dropped (0) or not (1) and the next bit is used for congestion notification (cn). Hence, new allowed return values of CGROUP EGRESS BPF programs are: 0: drop packet 1: keep packet 2: drop packet and call cwr 3: keep packet and call cwr This macro then converts it to one of NET_XMIT values or -EPERM that has the effect of dropping the packet with no cn. 0: NET_XMIT_SUCCESS skb should be transmitted (no cn) 1: NET_XMIT_DROP skb should be dropped and cwr called 2: NET_XMIT_CN skb should be transmitted and cwr called 3: -EPERM skb should be dropped (no cn) Note that when more than one BPF program is called, the packet is dropped if at least one of programs requests it be dropped, and there is cn if at least one program returns cn. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-