- 19 Jul, 2023 39 commits
-
-
Daniel Borkmann authored
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdfSigned-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.netSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Maciej Fijalkowski says: ==================== xsk: multi-buffer support v6->v7: - rebase...[Alexei] v5->v6: - update bpf_xdp_query_opts__last_field in patch 10 [Alexei] v4->v5: - align options argument size to match options from xdp_desc [Benjamin] - cleanup skb from xdp_sock on socket termination [Toke] - introduce new netlink attribute for letting user space know about Tx frag limit; this substitutes xdp_features flag previously dedicated for setting ZC multi-buffer support [Toke, Jakub] - include i40e ZC multi-buffer support - enable TOO_MANY_FRAGS for ZC on xskxceiver; this is now possible due to netlink attribute mentioned two bullets above v3->v4: -rely on ynl for adding new xdp_features flag [Jakub] - move xskb_list to xsk_buff_pool v2->v3: - Fix issue with the next valid packet getting dropped after an invalid packet with MAX_SKB_FRAGS + 1 frags [Magnus] - query NETDEV_XDP_ACT_ZC_SG flag within xskxceiver and act on it - remove redundant include in xsk.c [kernel test robot] - s/NETDEV_XDP_ACT_NDO_ZC_SG/NETDEV_XDP_ACT_ZC_SG + kernel doc [Magnus, Simon] v1->v2: - fix spelling issues in commit messages [Simon] - remove XSK_DESC_MAX_FRAGS, use MAX_SKB_FRAGS instead [Stan, Alexei] - add documentation patch - fix build error from kernel test robot on patch 10 This series of patches add multi-buffer support for AF_XDP. XDP and various NIC drivers already have support for multi-buffer packets. With this patch set, programs using AF_XDP sockets can now also receive and transmit multi-buffer packets both in copy as well as zero-copy mode. ZC multi-buffer implementation is based on ice driver. Some definitions to put us all on the same page: * A packet consists of one or more frames * A descriptor in one of the AF_XDP rings always refers to a single frame. In the case the packet consists of a single frame, the descriptor refers to the whole packet. To represent a packet consisting of multiple frames, we introduce a new flag called XDP_PKT_CONTD in the options field of the Rx and Tx descriptors. If it is true (1) the packet continues with the next descriptor and if it is false (0) it means this is the last descriptor of the packet. Why the reverse logic of end-of-packet (eop) flag found in many NICs? Just to preserve compatibility with non-multi-buffer applications that have this bit set to false for all packets on Rx, and the apps set the options field to zero for Tx, as anything else will be treated as an invalid descriptor. These are the semantics for producing packets onto XSK Tx ring consisting of multiple frames: * When an invalid descriptor is found, all the other descriptors/frames of this packet are marked as invalid and not completed. The next descriptor is treated as the start of a new packet, even if this was not the intent (because we cannot guess the intent). As before, if your program is producing invalid descriptors you have a bug that must be fixed. * Zero length descriptors are treated as invalid descriptors. * For copy mode, the maximum supported number of frames in a packet is equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all descriptors accumulated so far are dropped and treated as invalid. To produce an application that will work on any system regardless of this config setting, limit the number of frags to 18, as the minimum value of the config is 17. * For zero-copy mode, the limit is up to what the NIC HW supports. User space can discover this via newly introduced NETDEV_A_DEV_XDP_ZC_MAX_SEGS netlink attribute. Here is an example Tx path pseudo-code (using libxdp interfaces for simplicity) ignoring that the umem is finite in size, and that we eventually will run out of packets to send. Also assumes pkts.addr points to a valid location in the umem. void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, int batch_size) { u32 idx, i, pkt_nb = 0; xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); for (i = 0; i < batch_size;) { u64 addr = pkts[pkt_nb].addr; u32 len = pkts[pkt_nb].size; do { struct xdp_desc *tx_desc; tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); tx_desc->addr = addr; if (len > xsk_frame_size) { tx_desc->len = xsk_frame_size; tx_desc->options |= XDP_PKT_CONTD; } else { tx_desc->len = len; tx_desc->options = 0; pkt_nb++; } len -= tx_desc->len; addr += xsk_frame_size; if (i == batch_size) { /* Remember len, addr, pkt_nb for next * iteration. Skipped for simplicity. */ break; } } while (len); } xsk_ring_prod__submit(&xsk->tx, i); } On the Rx path in copy mode, the xsk core copies the XDP data into multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as detailed before. Zero-copy mode in order to avoid the copies has to maintain a chain of xdp_buff_xsk structs that represent whole packet. This is because what actually is redirected is the xdp_buff and we currently have no equivalent mechanism that is used for copy mode (embedded skb_shared_info in xdp_buff) to carry the frags. This means xdp_buff_xsk grows in size but these members are at the end and should not be touched when data path is not dealing with fragmented packets. This solution kept us within assumed performance impact, hence we decided to proceed with it. When the application gets a descriptor with the XDP_PKT_CONTD flag set to one, it means that the packet consists of multiple buffers and it continues with the next buffer in the following descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it means that this is the last buffer of the packet. AF_XDP guarantees that only a complete packet (all frames in the packet) is sent to the application. If application reads a batch of descriptors, using for example the libxdp interfaces, it is not guaranteed that the batch will end with a full packet. It might end in the middle of a packet and the rest of the buffers of that packet will arrive at the beginning of the next batch, since the libxdp interface does not read the whole ring (unless you have an enormous batch size or a very small ring size). Here is a simple Rx path pseudo-code example (using libxdp interfaces for simplicity). Error paths have been excluded for simplicity: void rx_packets(struct xsk_socket_info *xsk) { static bool new_packet = true; u32 idx_rx = 0, idx_fq = 0; static char *pkt; int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); for (int i = 0; i < rcvd; i++) { struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); bool eop = !(desc->options & XDP_PKT_CONTD); if (new_packet) pkt = frag; else add_frag_to_pkt(pkt, frag); if (eop) process_pkt(pkt); new_packet = eop; *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; } xsk_ring_prod__submit(&xsk->umem->fq, rcvd); xsk_ring_cons__release(&xsk->rx, rcvd); } We had to introduce a new bind flag (XDP_USE_SG) on the AF_XDP level to enable multi-buffer support. The reason we need to differentiate between non multi-buffer and multi-buffer is the behaviour when the kernel gets a packet that is larger than the frame size. Without multi-buffer, this packet is dropped and marked in the stats. With multi-buffer on, we want to split it up into multiple frames instead. At the start, we thought that riding on the .frags section name of the XDP program was a good idea. You do not have to introduce yet another flag and all AF_XDP users must load an XDP program anyway to get any traffic up to the socket, so why not just say that the XDP program decides if the AF_XDP socket should get multi-buffer packets or not? The problem is that we can create an AF_XDP socket that is Tx only and that works without having to load an XDP program at all. Another problem is that the XDP program might change during the execution, so we would have to check this for every single packet. Here is the observed throughput when compared to a codebase without any multi-buffer changes and measured with xdpsock for 64B packets. Apparently ZC Tx takes a hit from explicit zero length descriptors validation. Overall, in terms of ZC performance, there is a room for improvement, but for now we think this work is in a good shape in terms of correctness and functionality. We were targetting for up to 5% overhead though. Note that ZC performance drops come from core + driver support being combined, whereas copy mode had already driver support in place. Mode rxdrop l2fwd txonly ice-zc -4% -7% -6% i40e-zc -7% -6% -7% drv -1.2% 0% +2% skb -0.6% -1% +2% Thank you, Tirthendu, Magnus and Maciej ==================== Link: https://lore.kernel.org/r/20230719132421.584801-1-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Currently, when running ZC test suite, after finishing first run of test suite and then switching to busy-poll tests within xskxceiver, such errors are observed: libbpf: Kernel error message: ice: MTU is too large for linear frames and XDP prog does not support frags 1..26 libbpf: Kernel error message: Native and generic XDP can't be active at the same time Error attaching XDP program not ok 1 [xskxceiver.c:xsk_reattach_xdp:1568]: ERROR: 17/"File exists" this is because test suite ends with 9k MTU and native xdp program being loaded. Busy-poll tests start non-multi-buffer tests for generic mode. To fix this, let us introduce bash function that will reset NIC settings to default (e.g. 1500 MTU and no xdp progs loaded) so that test suite can continue without interrupts. It also means that after busy-poll tests NIC will have those default settings, whereas right now it is left with 9k MTU and xdp prog loaded in native mode. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-25-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add a test that will exercise maximum number of supported fragments. This number depends on mode of the test - for SKB and DRV it will be 18 whereas for ZC this is defined by a value from NETDEV_A_DEV_XDP_ZC_MAX_SEGS netlink attribute. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # made use of new netlink attribute Link: https://lore.kernel.org/r/20230719132421.584801-24-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Enable the already existing metadata copy test to also run in multi-buffer mode with 9K packets. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-23-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add a test that produces lots of nasty descriptors testing the corner cases of the descriptor validation. Some of these descriptors are valid and some are not as indicated by the valid flag. For a description of all the test combinations, please see the code. To stress the API, we need to be able to generate combinations of descriptors that make little sense. A new verbatim mode is introduced for the packet_stream to accomplish this. In this mode, all packets in the packet_stream are sent as is. We do not try to chop them up into frames that are of the right size that we know are going to work as we would normally do. The packets are just written into the Tx ring even if we know they make no sense. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # adjusted valid flags for frags Link: https://lore.kernel.org/r/20230719132421.584801-22-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add a test for multi-buffer AF_XDP when using unaligned mode. The test sends 4096 9K-buffers. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-21-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add the first basic multi-buffer test that sends a stream of 9K packets and validates that they are received at the other end. In order to enable sending and receiving multi-buffer packets, code that sets the MTU is introduced as well as modifications to the XDP programs so that they signal that they are multi-buffer enabled. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-20-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add the ability to send and receive packets that are larger than the size of a umem frame, using the AF_XDP /XDP multi-buffer support. There are three pieces of code that need to be changed to achieve this: the Rx path, the Tx path, and the validation logic. Both the Rx path and Tx could only deal with a single fragment per packet. The Tx path is extended with a new function called pkt_nb_frags() that can be used to retrieve the number of fragments a packet will consume. We then create these many fragments in a loop and fill the N-1 first ones to the max size limit to use the buffer space efficiently, and the Nth one with whatever data that is left. This goes on until we have filled in at the most BATCH_SIZE worth of descriptors and fragments. If we detect that the next packet would lead to BATCH_SIZE number of fragments sent being exceeded, we do not send this packet and finish the batch. This packet is instead sent in the next iteration of BATCH_SIZE fragments. For Rx, we loop over all fragments we receive as usual, but for every descriptor that we receive we call a new validation function called is_frag_valid() to validate the consistency of this fragment. The code then checks if the packet continues in the next frame. If so, it loops over the next packet and performs the same validation. once we have received the last fragment of the packet we also call the function is_pkt_valid() to validate the packet as a whole. If we get to the end of the batch and we are not at the end of the current packet, we back out the partial packet and end the loop. Once we get into the receive loop next time, we start over from the beginning of that packet. This so the code becomes simpler at the cost of some performance. The validation function is_frag_valid() checks that the sequence and packet numbers are correct at the start and end of each fragment. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-19-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Magnus Karlsson authored
Add AF_XDP multi-buffer support documentation including two pseudo-code samples. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-18-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
Set eop bit in TX desc command only for the last descriptor of the packet and do not set for all preceding descriptors. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-17-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Most of this patch is about actually supporting XDP_TX action. Pure Tx ZC support is only about looking at XDP_PKT_CONTD presence at options field and based on that generating EOP bit on Tx HW descriptor. This is that simple due to the implementation on xsk_tx_peek_release_desc_batch() where we are making sure that last produced descriptor is an EOP one. Overwrite xdp_zc_max_segs with a value that defines max scatter-gatter count on Tx side that HW can handle. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-16-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Modify xskq_cons_read_desc_batch() in a way that each processed descriptor will be checked if it is an EOP one or not and act accordingly to that. Change the behavior of mentioned function to break the processing when stumbling upon invalid descriptor instead of skipping it. Furthermore, let us give only full packets down to ZC driver. With these two assumptions ZC drivers will not have to take care of an intermediate state of incomplete frames, which will simplify its implementations a lot. Last but not least, stop processing when count of frags would exceed max supported segments on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-15-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
This patch is inspired from the multi-buffer support in non-zc path for i40e as well as from the patch to support zc on ice. Each subsequent frag is added to skb_shared_info of the first frag for possible xdp_prog use as well to xsk buffer list for accessing the buffers in af_xdp. For XDP_PASS, new pages are allocated for frags and contents are copied from memory backed by xsk_buff_pool. Replace next_to_clean with next_to_process as done in non-zc path and advance it for every buffer and change the semantics of next_to_clean to point to the first buffer of a packet. Driver will use next_to_process in the same way next_to_clean was used previously. For the non multi-buffer case, next_to_process and next_to_clean will always be the same since each packet consists of a single buffer. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-14-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
This support is strongly inspired by work that introduced multi-buffer support to regular Rx data path in ice. There are some differences, though. When adding a frag, besides adding it to skb_shared_info, use also fresh xsk_buff_add_frag() helper. Reason for doing both things is that we can not rule out the fact that AF_XDP pipeline could use XDP program that needs to access frame fragments. Without them being in skb_shared_info it will not be possible. Another difference is that XDP_PASS has to allocate a new pages for each frags and copy contents from memory backed by xsk_buff_pool. chain_len that is used for programming HW Rx descriptors no longer has to be limited to 1 when xsk_pool is present - remove this restriction. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-13-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Given that skb_shared_info relies on skb_frag_t, in order to support xskb chaining, introduce xdp_buff_xsk::xskb_list_node and xsk_buff_pool::xskb_list. This is needed so ZC drivers can add frags as xskb nodes which will make it possible to handle it both when producing AF_XDP Rx descriptors as well as freeing/recycling all the frags that a single frame carries. Speaking of latter, update xsk_buff_free() to take care of list nodes. For the former (adding as frags), introduce xsk_buff_add_frag() for ZC drivers usage that is going to be used to add a frag to xskb list from pool. xsk_buff_get_frag() will be utilized by XDP_TX and, on contrary, will return xdp_buff. One of the previous patches added a wrapper for ZC Rx so implement xskb list walk and production of Rx descriptors there. On bind() path, bail out if socket wants to use ZC multi-buffer but underlying netdev does not support it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-12-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will carry maximum fragments that underlying ZC driver is able to handle on TX side. It is going to be included in netlink response only when driver supports ZC. Any value higher than 1 implies multi-buffer ZC support on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
Descriptors with zero length are not supported by many NICs. To preserve uniform behavior discard any zero length desc as invvalid desc. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-10-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
For transmitting an AF_XDP packet, allocate skb while processing the first desc and copy data to it. The 'XDP_PKT_CONTD' flag in 'options' field of the desc indicates the EOP status of the packet. If the current desc is not EOP, store the skb, release the current desc and go on to read the next descs. Allocate a page for each subsequent desc, copy data to it and add it as a frag in the skb stored in xsk. On processing EOP, transmit the skb with frags. Addresses contained in descs have been already queued in consumer queue and skb destructor updated the completion count. On transmit failure cancel the releases, clear the descs from the completion queue and consume the skb for retrying packet transmission. For any invalid descriptor (invalid length/address/options) in the middle of a packet, all pending descriptors will be dropped by xsk core along with the invalid one and the next descriptor is treated as the start of a new packet. Maximum supported frames for a packet is MAX_SKB_FRAGS + 1. If it is exceeded, all descriptors accumulated so far are dropped. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-9-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Drivers are used to check for EOP bit whereas AF_XDP operates on inverted logic - user space indicates that current frag is not the last one and packet continues. For AF_XDP core needs, add xp_mb_desc() that will simply test XDP_PKT_CONTD from xdp_desc::options, but in order to preserve drivers default behavior, introduce an interface for ZC drivers that will negate xp_mb_desc() result and therefore make it easier to test EOP bit from during production of HW Tx descriptors. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-8-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
In Tx path, xsk core reserves space for each desc to be transmitted in the completion queue and it's address contained in it is stored in the skb destructor arg. After successful transmission the skb destructor submits the addr marking completion. To handle multiple descriptors per packet, now along with reserving space for each descriptor, the corresponding address is also stored in completion queue. The number of pending descriptors are stored in skb destructor arg and is used by the skb destructor to update completions. Introduce 'skb' in xdp_sock to store a partially built packet when __xsk_generic_xmit() must return before it sees the EOP descriptor for the current packet so that packet building can resume in next call of __xsk_generic_xmit(). Helper functions are introduced to set and get the pending descriptors in the skb destructor arg. Also, wrappers are introduced for storing descriptor addresses, submitting and cancelling (for unsuccessful transmissions) the number of completions. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-7-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
Add multi-buffer support for AF_XDP by extending the XDP multi-buffer support to be reflected in user-space when a packet is redirected to an AF_XDP socket. In the XDP implementation, the NIC driver builds the xdp_buff from the first frag of the packet and adds any subsequent frags in the skb_shinfo area of the xdp_buff. In AF_XDP core, XDP buffers are allocated from xdp_sock's pool and data is copied from the driver's xdp_buff and frags. Once an allocated XDP buffer is full and there is still data to be copied, the 'XDP_PKT_CONTD' flag in'options' field of the corresponding xdp ring descriptor is set and passed to the application. When application sees the aforementioned flag set it knows there is pending data for this packet that will be carried in the following descriptors. If there is no more data to be copied, the flag in 'options' field is cleared for that descriptor signalling EOP to the application. If application reads a batch of descriptors using for example the libxdp interfaces, it is not guaranteed that the batch will end with a full packet. It might end in the middle of a packet and the rest of the frames of that packet will arrive at the beginning of the next batch. AF_XDP ensures that only a complete packet (along with all its frags) is sent to application. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-6-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
If the data in xdp_buff exceeds the xsk frame length, the packet needs to be dropped. This check is currently being done in __xsk_rcv(). Move the described logic to xsk_rcv_check() so that such a xdp_buff will only be dropped if the application does not support multi-buffer (absence of XDP_USE_SG bind flag). This is applicable for all cases: copy mode, zero copy mode as well as skb mode. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-5-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maciej Fijalkowski authored
Currently, __xsk_rcv_zc() is a function that is responsible for producing AF_XDP Rx descriptors. It is used by both copy and zero-copy mode. Both of these modes are going to differ when multi-buffer support is going to be added. ZC will work on a chain of xdp_buff_xsk structs whereas copy-mode is going to utilize skb_shared_info contents. This means that ZC-specific changes would affect the copy mode. Let's modify __xsk_rcv_zc() to work directly on xdp_buff_xsk so the callsites have to retrieve this from xdp_buff. Also, introduce xsk_rcv_zc() which will carry all the needed later changes for supporting multi-buffer on ZC side that do not apply to copy mode. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-4-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
As of now xsk core drops any xdp_buff with data size greater than the xsk frame_size as set by the af_xdp application. With multi-buffer support introduced in the next patch xsk core can now split those buffers into multiple descriptors provided the af_xdp application can handle them. Such capability of the application needs to be independent of the xdp_prog's frag support capability since there are cases where even a single xdp_buffer may need to be split into multiple descriptors owing to a smaller xsk frame size. For e.g., with NIC rx_buffer size set to 4kB, a 3kB packet will constitute of a single buffer and so will be sent as such to AF_XDP layer irrespective of 'xdp.frags' capability of the XDP program. Now if the xsk frame size is set to 2kB by the AF_XDP application, then the packet will need to be split into 2 descriptors if AF_XDP application can handle multi-buffer, else it needs to be dropped. Applications can now advertise their frag handling capability to xsk core so that xsk core can decide if it should drop or split xdp_buffs that exceed xsk frame size. This is done using a new 'XSK_USE_SG' bind flag for the xdp socket. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-3-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tirthendu Sarkar authored
Use the 'options' field in xdp_desc as a packet continuity marker. Since 'options' field was unused till now and was expected to be set to 0, the 'eop' descriptor will have it set to 0, while the non-eop descriptors will have to set it to 1. This ensures legacy applications continue to work without needing any change for single-buffer packets. Add helper functions and extend xskq_prod_reserve_desc() to use the 'options' field. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-2-maciej.fijalkowski@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Menglong Dong authored
As Dan Carpenter reported, the variable "first_off" which is passed to clean_stack_garbage() in save_args() can be uninitialized, which can cause runtime warnings with KMEMsan. Therefore, init it with 0. Fixes: 473e3150 ("bpf, x86: allow function arguments up to 12 for TRACING") Cc: Hao Peng <flyingpeng@tencent.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/09784025-a812-493f-9829-5e26c8691e07@moroto.mountain/Signed-off-by: Menglong Dong <imagedong@tencent.com> Link: https://lore.kernel.org/r/20230719110330.2007949-1-imagedong@tencent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Anton Protopopov says: ==================== allow bpf_map_sum_elem_count for all program types This series is a follow up to the recent change [1] which added per-cpu insert/delete statistics for maps. The bpf_map_sum_elem_count kfunc presented in the original series was only available to tracing programs, so let's make it available to all. The first patch makes types listed in the reg2btf_ids[] array to be considered trusted by kfuncs. The second patch allows to treat CONST_PTR_TO_MAP as trusted pointers from kfunc's point of view by adding it to the reg2btf_ids[] array. The third patch adds missing const to the map argument of the bpf_map_sum_elem_count kfunc. The fourth patch registers the bpf_map_sum_elem_count for all programs, and patches selftests correspondingly. [1] https://lore.kernel.org/bpf/20230705160139.19967-1-aspsk@isovalent.com/ v1 -> v2: * treat the whole reg2btf_ids array as trusted (Alexei) ==================== Link: https://lore.kernel.org/r/20230719092952.41202-1-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Anton Protopopov authored
Register the bpf_map_sum_elem_count func for all programs, and update the map_ptr subtest of the test_progs test to test the new functionality. The usage is allowed as long as the pointer to the map is trusted (when using tracing programs) or is a const pointer to map, as in the following example: struct { __uint(type, BPF_MAP_TYPE_HASH); ... } hash SEC(".maps"); ... static inline int some_bpf_prog(void) { struct bpf_map *map = (struct bpf_map *)&hash; __s64 count; count = bpf_map_sum_elem_count(map); ... } Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Anton Protopopov authored
We use the map pointer only to read the counter values, no locking involved, so mark the argument as const. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Anton Protopopov authored
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs. This, in turn, allows users to execute trusted kfuncs which accept `struct bpf_map *` arguments from non-tracing programs. While exporting the btf_bpf_map_id variable, save some bytes by defining it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST (which is u32[64]). Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Anton Protopopov authored
The reg2btf_ids array contains a list of types for which we can (and need) to find a corresponding static BTF id. All the types in the list can be considered as trusted for purposes of kfuncs. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Geliang Tang authored
The code using btf_vmlinux in bpf_tcp_ca is removed by the commit 9f0265e9 ("bpf: Require only one of cong_avoid() and cong_control() from a TCP CC") so drop this useless btf_vmlinux declaration. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/4d38da4eadaba476bd92ffcd7a5a03a5e28745c0.1689582557.git.geliang.tang@suse.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Anh Tuan Phan authored
Update samples/bpf/README.rst to add pahole to the build dependencies list. Add the reference to "Documentation/process/changes.rst" for minimum version required so that the version required will not be outdated in the future. Signed-off-by: Anh Tuan Phan <tuananhlfc@gmail.com> Link: https://lore.kernel.org/r/aecaf7a2-9100-cd5b-5cf4-91e5dbb2c90d@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Dave Marchevsky says: ==================== BPF Refcount followups 2: owner field This series adds an 'owner' field to bpf_{list,rb}_node structs, to be used by the runtime to determine whether insertion or removal operations are valid in shared ownership scenarios. Both the races which the series fixes and the fix itself are inspired by Kumar's suggestions in [0]. Aside from insertion and removal having more reasons to fail, there are no user-facing changes as a result of this series. * Patch 1 reverts disabling of bpf_refcount_acquire so that the fixed logic can be exercised by CI. It should _not_ be applied. * Patch 2 adds internal definitions of bpf_{rb,list}_node so that their fields are easier to access. * Patch 3 is the meat of the series - it adds 'owner' field and enforcement of correct owner to insertion and removal helpers. * Patch 4 adds a test based on Kumar's examples. * Patch 5 disables the test until bpf_refcount_acquire is re-enabled. * Patch 6 reverts disabling of test added in this series logic can be exercised by CI. It should _not_ be applied. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2/ Changelog: v1 -> v2: lore.kernel.org/bpf/20230711175945.3298231-1-davemarchevsky@fb.com/ Patch 2 ("Introduce internal definitions for UAPI-opaque bpf_{rb,list}_node") * Rename bpf_{rb,list}_node_internal -> bpf_{list,rb}_node_kern (Alexei) Patch 3 ("bpf: Add 'owner' field to bpf_{list,rb}_node") * WARN_ON_ONCE in __bpf_list_del when node has wrong owner. This shouldn't happen, but worth checking regardless (Alexei, offline convo) * Continue previous patch's renaming changes ==================== Link: https://lore.kernel.org/r/20230718083813.3416104-1-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
The test added in previous patch will fail with bpf_refcount_acquire disabled. Until all races are fixed and bpf_refcount_acquire is re-enabled on bpf-next, disable the test so CI doesn't complain. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230718083813.3416104-6-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
This patch adds a runnable version of one of the races described by Kumar in [0]. Specifically, this interleaving: (rbtree1 and list head protected by lock1, rbtree2 protected by lock2) Prog A Prog B ====================================== n = bpf_obj_new(...) m = bpf_refcount_acquire(n) kptr_xchg(map, m) m = kptr_xchg(map, NULL) lock(lock2) bpf_rbtree_add(rbtree2, m->r, less) unlock(lock2) lock(lock1) bpf_list_push_back(head, n->l) /* make n non-owning ref */ bpf_rbtree_remove(rbtree1, n->r) unlock(lock1) The above interleaving, the node's struct bpf_rb_node *r can be used to add it to either rbtree1 or rbtree2, which are protected by different locks. If the node has been added to rbtree2, we should not be allowed to remove it while holding rbtree1's lock. Before changes in the previous patch in this series, the rbtree_remove in the second part of Prog A would succeed as the verifier has no way of knowing which tree owns a particular node at verification time. The addition of 'owner' field results in bpf_rbtree_remove correctly failing. The test added in this patch splits "Prog A" above into two separate BPF programs - A1 and A2 - and uses a second mapval + kptr_xchg to pass n from A1 to A2 similarly to the pass from A1 to B. If the test is run without the fix applied, the remove will succeed. Kumar's example had the two programs running on separate CPUs. This patch doesn't do this as it's not necessary to exercise the broken behavior / validate fixed behavior. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230718083813.3416104-5-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates using this ownership information - in order to prevent races. This patch adds an 'owner' field to struct bpf_list_node and bpf_rb_node to implement such runtime tracking. The owner field is a void * that describes the ownership state of a node. It can have the following values: NULL - the node is not owned by any data structure BPF_PTR_POISON - the node is in the process of being added to a data structure ptr_to_root - the pointee is a data structure 'root' (bpf_rb_root / bpf_list_head) which owns this node The field is initially NULL (set by bpf_obj_init_field default behavior) and transitions states in the following sequence: Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root Removal: ptr_to_root -> NULL Before a node has been successfully inserted, it is not protected by any root's lock, and therefore two programs can attempt to add the same node to different roots simultaneously. For this reason the intermediate BPF_PTR_POISON state is necessary. For removal, the node is protected by some root's lock so this intermediate hop isn't necessary. Note that bpf_list_pop_{front,back} helpers don't need to check owner before removing as the node-to-be-removed is not passed in as input and is instead taken directly from the list. Do the check anyways and WARN_ON_ONCE in this unexpected scenario. Selftest changes in this patch are entirely mechanical: some BTF tests have hardcoded struct sizes for structs that contain bpf_{list,rb}_node fields, those were adjusted to account for the new sizes. Selftest additions to validate the owner field are added in a further patch in the series. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touch their fields - nor does the verifier allow them to do so. Currently these structs are simple wrappers around structs rb_node and list_head and linked_list / rbtree implementation just casts and passes to library functions for those data structures. Later patches in this series, though, will add an "owner" field to bpf_{rb,list}_node, such that they're not just wrapping an underlying node type. Moreover, the bpf linked_list and rbtree implementations will deal with these owner pointers directly in a few different places. To avoid having to do void *owner = (void*)bpf_list_node + sizeof(struct list_head) with opaque UAPI node types, add bpf_{list,rb}_node_kern struct definitions to internal headers and modify linked_list and rbtree to use the internal types where appropriate. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 17 Jul, 2023 1 commit
-
-
David S. Miller authored
Luo Jie says: ==================== net: phy: at803x: support qca8081 1G version chip This patch series add supporting qca8081 1G version chip, the 1G version chip can be identified by the register mmd7.0x901d bit0. In addition, qca8081 does not support 1000BaseX mode and the sgmii fifo reset is added on the link changed, which assert the fifo on the link down, deassert the fifo on the link up. Changes in v1: * switch to use genphy_c45_pma_read_abilities. * remove the patch [remove 1000BaseX mode of qca8081]. * move the sgmii fifo reset to link_change_notify. Changes in v2: * split the qca8081 1G chip support patch. * improve the slave seed config, disable it if master preferred. Changes in v3: * fix the comments. * add the help function qca808x_has_fast_retrain_or_slave_seed. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-