Commit 6e98b09d authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
parents b68ee1c6 9b78d919

Too many changes to show.

To preserve performance only 1000 of 1000+ files are displayed.
...@@ -418,7 +418,6 @@ That is, the recovery API only requires that: ...@@ -418,7 +418,6 @@ That is, the recovery API only requires that:
- drivers/next/e100.c - drivers/next/e100.c
- drivers/net/e1000 - drivers/net/e1000
- drivers/net/e1000e - drivers/net/e1000e
- drivers/net/ixgb
- drivers/net/ixgbe - drivers/net/ixgbe
- drivers/net/cxgb3 - drivers/net/cxgb3
- drivers/net/s2io.c - drivers/net/s2io.c
......
...@@ -314,7 +314,7 @@ Q: What is the compatibility story for special BPF types in map values? ...@@ -314,7 +314,7 @@ Q: What is the compatibility story for special BPF types in map values?
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
values (when using BTF support for BPF maps). This allows to use helpers for values (when using BTF support for BPF maps). This allows to use helpers for
such objects on these fields inside map values. Users are also allowed to embed such objects on these fields inside map values. Users are also allowed to embed
pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the pointers to some kernel types (with __kptr_untrusted and __kptr BTF tags). Will the
kernel preserve backwards compatibility for these features? kernel preserve backwards compatibility for these features?
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else: A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
...@@ -324,7 +324,7 @@ For struct types that have been added already, like bpf_spin_lock and bpf_timer, ...@@ -324,7 +324,7 @@ For struct types that have been added already, like bpf_spin_lock and bpf_timer,
the kernel will preserve backwards compatibility, as they are part of UAPI. the kernel will preserve backwards compatibility, as they are part of UAPI.
For kptrs, they are also part of UAPI, but only with respect to the kptr For kptrs, they are also part of UAPI, but only with respect to the kptr
mechanism. The types that you can use with a __kptr and __kptr_ref tagged mechanism. The types that you can use with a __kptr_untrusted and __kptr tagged
pointer in your struct are NOT part of the UAPI contract. The supported types can pointer in your struct are NOT part of the UAPI contract. The supported types can
and will change across kernel releases. However, operations like accessing kptr and will change across kernel releases. However, operations like accessing kptr
fields and bpf_kptr_xchg() helper will continue to be supported across kernel fields and bpf_kptr_xchg() helper will continue to be supported across kernel
......
...@@ -128,7 +128,8 @@ into the bpf-next tree will make their way into net-next tree. net and ...@@ -128,7 +128,8 @@ into the bpf-next tree will make their way into net-next tree. net and
net-next are both run by David S. Miller. From there, they will go net-next are both run by David S. Miller. From there, they will go
into the kernel mainline tree run by Linus Torvalds. To read up on the into the kernel mainline tree run by Linus Torvalds. To read up on the
process of net and net-next being merged into the mainline tree, see process of net and net-next being merged into the mainline tree, see
the :ref:`netdev-FAQ` the documentation on netdev subsystem at
Documentation/process/maintainer-netdev.rst.
...@@ -147,7 +148,8 @@ request):: ...@@ -147,7 +148,8 @@ request)::
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to? Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to?
--------------------------------------------------------------------------------- ---------------------------------------------------------------------------------
A: The process is the very same as described in the :ref:`netdev-FAQ`, A: The process is the very same as described in the netdev subsystem
documentation at Documentation/process/maintainer-netdev.rst,
so please read up on it. The subject line must indicate whether the so please read up on it. The subject line must indicate whether the
patch is a fix or rather "next-like" content in order to let the patch is a fix or rather "next-like" content in order to let the
maintainers know whether it is targeted at bpf or bpf-next. maintainers know whether it is targeted at bpf or bpf-next.
...@@ -206,8 +208,9 @@ ii) run extensive BPF test suite and ...@@ -206,8 +208,9 @@ ii) run extensive BPF test suite and
Once the BPF pull request was accepted by David S. Miller, then Once the BPF pull request was accepted by David S. Miller, then
the patches end up in net or net-next tree, respectively, and the patches end up in net or net-next tree, respectively, and
make their way from there further into mainline. Again, see the make their way from there further into mainline. Again, see the
:ref:`netdev-FAQ` for additional information e.g. on how often they are documentation for netdev subsystem at
merged to mainline. Documentation/process/maintainer-netdev.rst for additional information
e.g. on how often they are merged to mainline.
Q: How long do I need to wait for feedback on my BPF patches? Q: How long do I need to wait for feedback on my BPF patches?
------------------------------------------------------------- -------------------------------------------------------------
...@@ -230,7 +233,8 @@ Q: Are patches applied to bpf-next when the merge window is open? ...@@ -230,7 +233,8 @@ Q: Are patches applied to bpf-next when the merge window is open?
----------------------------------------------------------------- -----------------------------------------------------------------
A: For the time when the merge window is open, bpf-next will not be A: For the time when the merge window is open, bpf-next will not be
processed. This is roughly analogous to net-next patch processing, processed. This is roughly analogous to net-next patch processing,
so feel free to read up on the :ref:`netdev-FAQ` about further details. so feel free to read up on the netdev docs at
Documentation/process/maintainer-netdev.rst about further details.
During those two weeks of merge window, we might ask you to resend During those two weeks of merge window, we might ask you to resend
your patch series once bpf-next is open again. Once Linus released your patch series once bpf-next is open again. Once Linus released
...@@ -394,7 +398,8 @@ netdev kernel mailing list in Cc and ask for the fix to be queued up: ...@@ -394,7 +398,8 @@ netdev kernel mailing list in Cc and ask for the fix to be queued up:
netdev@vger.kernel.org netdev@vger.kernel.org
The process in general is the same as on netdev itself, see also the The process in general is the same as on netdev itself, see also the
:ref:`netdev-FAQ`. the documentation on networking subsystem at
Documentation/process/maintainer-netdev.rst.
Q: Do you also backport to kernels not currently maintained as stable? Q: Do you also backport to kernels not currently maintained as stable?
---------------------------------------------------------------------- ----------------------------------------------------------------------
...@@ -410,7 +415,7 @@ Q: The BPF patch I am about to submit needs to go to stable as well ...@@ -410,7 +415,7 @@ Q: The BPF patch I am about to submit needs to go to stable as well
What should I do? What should I do?
A: The same rules apply as with netdev patch submissions in general, see A: The same rules apply as with netdev patch submissions in general, see
the :ref:`netdev-FAQ`. the netdev docs at Documentation/process/maintainer-netdev.rst.
Never add "``Cc: stable@vger.kernel.org``" to the patch description, but Never add "``Cc: stable@vger.kernel.org``" to the patch description, but
ask the BPF maintainers to queue the patches instead. This can be done ask the BPF maintainers to queue the patches instead. This can be done
...@@ -684,7 +689,6 @@ when: ...@@ -684,7 +689,6 @@ when:
.. Links .. Links
.. _netdev-FAQ: Documentation/process/maintainer-netdev.rst
.. _selftests: .. _selftests:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/
......
...@@ -20,6 +20,12 @@ Arithmetic instructions ...@@ -20,6 +20,12 @@ Arithmetic instructions
For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
``-Xclang -target-feature -Xclang +alu32``. In CPU version 3, support is automatically included. ``-Xclang -target-feature -Xclang +alu32``. In CPU version 3, support is automatically included.
Jump instructions
=================
If ``-O0`` is used, Clang will generate the ``BPF_CALL | BPF_X | BPF_JMP`` (0x8d)
instruction, which is not supported by the Linux kernel verifier.
Atomic operations Atomic operations
================= =================
......
...@@ -51,7 +51,7 @@ For example: ...@@ -51,7 +51,7 @@ For example:
.. code-block:: c .. code-block:: c
struct cpumask_map_value { struct cpumask_map_value {
struct bpf_cpumask __kptr_ref * cpumask; struct bpf_cpumask __kptr * cpumask;
}; };
struct array_map { struct array_map {
...@@ -117,18 +117,13 @@ For example: ...@@ -117,18 +117,13 @@ For example:
As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
a map, the reference can be removed from the map with bpf_kptr_xchg(), or a map, the reference can be removed from the map with bpf_kptr_xchg(), or
opportunistically acquired with bpf_cpumask_kptr_get(): opportunistically acquired using RCU:
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_kptr_get
Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
.. code-block:: c .. code-block:: c
/* struct containing the struct bpf_cpumask kptr which is stored in the map. */ /* struct containing the struct bpf_cpumask kptr which is stored in the map. */
struct cpumasks_kfunc_map_value { struct cpumasks_kfunc_map_value {
struct bpf_cpumask __kptr_ref * bpf_cpumask; struct bpf_cpumask __kptr * bpf_cpumask;
}; };
/* The map containing struct cpumasks_kfunc_map_value entries. */ /* The map containing struct cpumasks_kfunc_map_value entries. */
...@@ -144,7 +139,7 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map: ...@@ -144,7 +139,7 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
/** /**
* A simple example tracepoint program showing how a * A simple example tracepoint program showing how a
* struct bpf_cpumask * kptr that is stored in a map can * struct bpf_cpumask * kptr that is stored in a map can
* be acquired using the bpf_cpumask_kptr_get() kfunc. * be passed to kfuncs using RCU protection.
*/ */
SEC("tp_btf/cgroup_mkdir") SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
...@@ -158,26 +153,21 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map: ...@@ -158,26 +153,21 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
if (!v) if (!v)
return -ENOENT; return -ENOENT;
bpf_rcu_read_lock();
/* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */ /* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
kptr = bpf_cpumask_kptr_get(&v->cpumask); kptr = v->cpumask;
if (!kptr) if (!kptr) {
/* If no bpf_cpumask was present in the map, it's because /* If no bpf_cpumask was present in the map, it's because
* we're racing with another CPU that removed it with * we're racing with another CPU that removed it with
* bpf_kptr_xchg() between the bpf_map_lookup_elem() * bpf_kptr_xchg() between the bpf_map_lookup_elem()
* above, and our call to bpf_cpumask_kptr_get(). * above, and our load of the pointer from the map.
* bpf_cpumask_kptr_get() internally safely handles this
* race, and will return NULL if the cpumask is no longer
* present in the map by the time we invoke the kfunc.
*/ */
bpf_rcu_read_unlock();
return -EBUSY; return -EBUSY;
}
/* Free the reference we just took above. Note that the bpf_cpumask_setall(kptr);
* original struct bpf_cpumask * kptr is still in the map. It will bpf_rcu_read_unlock();
* be freed either at a later time if another context deletes
* it from the map, or automatically by the BPF subsystem if
* it's still present when the map is destroyed.
*/
bpf_cpumask_release(kptr);
return 0; return 0;
} }
......
...@@ -11,7 +11,8 @@ Documentation conventions ...@@ -11,7 +11,8 @@ Documentation conventions
========================= =========================
For brevity, this document uses the type notion "u64", "u32", etc. For brevity, this document uses the type notion "u64", "u32", etc.
to mean an unsigned integer whose width is the specified number of bits. to mean an unsigned integer whose width is the specified number of bits,
and "s32", etc. to mean a signed integer of the specified number of bits.
Registers and calling convention Registers and calling convention
================================ ================================
...@@ -38,14 +39,11 @@ eBPF has two instruction encodings: ...@@ -38,14 +39,11 @@ eBPF has two instruction encodings:
* the wide instruction encoding, which appends a second 64-bit immediate (i.e., * the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
constant) value after the basic instruction for a total of 128 bits. constant) value after the basic instruction for a total of 128 bits.
The basic instruction encoding is as follows, where MSB and LSB mean the most significant The fields conforming an encoded basic instruction are stored in the
bits and least significant bits, respectively: following order::
============= ======= ======= ======= ============ opcode:8 src_reg:4 dst_reg:4 offset:16 imm:32 // In little-endian BPF.
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB) opcode:8 dst_reg:4 src_reg:4 offset:16 imm:32 // In big-endian BPF.
============= ======= ======= ======= ============
imm offset src_reg dst_reg opcode
============= ======= ======= ======= ============
**imm** **imm**
signed integer immediate value signed integer immediate value
...@@ -63,6 +61,18 @@ imm offset src_reg dst_reg opcode ...@@ -63,6 +61,18 @@ imm offset src_reg dst_reg opcode
**opcode** **opcode**
operation to perform operation to perform
Note that the contents of multi-byte fields ('imm' and 'offset') are
stored using big-endian byte ordering in big-endian BPF and
little-endian byte ordering in little-endian BPF.
For example::
opcode offset imm assembly
src_reg dst_reg
07 0 1 00 00 44 33 22 11 r1 += 0x11223344 // little
dst_reg src_reg
07 1 0 00 00 11 22 33 44 r1 += 0x11223344 // big
Note that most instructions do not use all of the fields. Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero. Unused fields shall be cleared to zero.
...@@ -72,18 +82,23 @@ The 64 bits following the basic instruction contain a pseudo instruction ...@@ -72,18 +82,23 @@ The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero, using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value. and imm containing the high 32 bits of the immediate value.
================= ================== This is depicted in the following figure::
64 bits (MSB) 64 bits (LSB)
================= ================== basic_instruction
basic instruction pseudo instruction .-----------------------------.
================= ================== | |
code:8 regs:8 offset:16 imm:32 unused:32 imm:32
| |
'--------------'
pseudo instruction
Thus the 64-bit immediate value is constructed as follows: Thus the 64-bit immediate value is constructed as follows:
imm64 = (next_imm << 32) | imm imm64 = (next_imm << 32) | imm
where 'next_imm' refers to the imm value of the pseudo instruction where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction. following the basic instruction. The unused bytes in the pseudo
instruction are reserved and shall be cleared to zero.
Instruction classes Instruction classes
------------------- -------------------
...@@ -228,28 +243,58 @@ Jump instructions ...@@ -228,28 +243,58 @@ Jump instructions
otherwise identical operations. otherwise identical operations.
The 'code' field encodes the operation as below: The 'code' field encodes the operation as below:
======== ===== ========================= ============ ======== ===== === =========================================== =========================================
code value description notes code value src description notes
======== ===== ========================= ============ ======== ===== === =========================================== =========================================
BPF_JA 0x00 PC += off BPF_JMP only BPF_JA 0x0 0x0 PC += offset BPF_JMP only
BPF_JEQ 0x10 PC += off if dst == src BPF_JEQ 0x1 any PC += offset if dst == src
BPF_JGT 0x20 PC += off if dst > src unsigned BPF_JGT 0x2 any PC += offset if dst > src unsigned
BPF_JGE 0x30 PC += off if dst >= src unsigned BPF_JGE 0x3 any PC += offset if dst >= src unsigned
BPF_JSET 0x40 PC += off if dst & src BPF_JSET 0x4 any PC += offset if dst & src
BPF_JNE 0x50 PC += off if dst != src BPF_JNE 0x5 any PC += offset if dst != src
BPF_JSGT 0x60 PC += off if dst > src signed BPF_JSGT 0x6 any PC += offset if dst > src signed
BPF_JSGE 0x70 PC += off if dst >= src signed BPF_JSGE 0x7 any PC += offset if dst >= src signed
BPF_CALL 0x80 function call BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_
BPF_EXIT 0x90 function / program return BPF_JMP only BPF_CALL 0x8 0x1 call PC += offset see `Program-local functions`_
BPF_JLT 0xa0 PC += off if dst < src unsigned BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_
BPF_JLE 0xb0 PC += off if dst <= src unsigned BPF_EXIT 0x9 0x0 return BPF_JMP only
BPF_JSLT 0xc0 PC += off if dst < src signed BPF_JLT 0xa any PC += offset if dst < src unsigned
BPF_JSLE 0xd0 PC += off if dst <= src signed BPF_JLE 0xb any PC += offset if dst <= src unsigned
======== ===== ========================= ============ BPF_JSLT 0xc any PC += offset if dst < src signed
BPF_JSLE 0xd any PC += offset if dst <= src signed
======== ===== === =========================================== =========================================
The eBPF program needs to store the return value into register R0 before doing a The eBPF program needs to store the return value into register R0 before doing a
BPF_EXIT. ``BPF_EXIT``.
Example:
``BPF_JSGE | BPF_X | BPF_JMP32`` (0x7e) means::
if (s32)dst s>= (s32)src goto +offset
where 's>=' indicates a signed '>=' comparison.
Helper functions
~~~~~~~~~~~~~~~~
Helper functions are a concept whereby BPF programs can call into a
set of function calls exposed by the underlying platform.
Historically, each helper function was identified by an address
encoded in the imm field. The available helper functions may differ
for each program type, but address values are unique across all program types.
Platforms that support the BPF Type Format (BTF) support identifying
a helper function by a BTF ID encoded in the imm field, where the BTF ID
identifies the helper name and type.
Program-local functions
~~~~~~~~~~~~~~~~~~~~~~~
Program-local functions are functions exposed by the same BPF program as the
caller, and are referenced by offset from the call instruction, similar to
``BPF_JA``. A ``BPF_EXIT`` within the program-local function will return to
the caller.
Load and store instructions Load and store instructions
=========================== ===========================
...@@ -371,14 +416,56 @@ and loaded back to ``R0``. ...@@ -371,14 +416,56 @@ and loaded back to ``R0``.
----------------------------- -----------------------------
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding for an extra imm64 value. encoding defined in `Instruction encoding`_, and use the 'src' field of the
basic instruction to hold an opcode subtype.
There is currently only one such instruction.
The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
``BPF_LD | BPF_DW | BPF_IMM`` means:: with opcode subtypes in the 'src' field, using new terms such as "map"
defined further below:
dst = imm64
========================= ====== === ========================================= =========== ==============
opcode construction opcode src pseudocode imm type dst type
========================= ====== === ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = imm64 integer integer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
========================= ====== === ========================================= =========== ==============
where
* map_by_fd(imm) means to convert a 32-bit file descriptor into an address of a map (see `Maps`_)
* map_by_idx(imm) means to convert a 32-bit index into an address of a map
* map_val(map) gets the address of the first value in a given map
* var_addr(imm) gets the address of a platform variable (see `Platform Variables`_) with a given id
* code_addr(imm) gets the address of the instruction at a specified relative offset in number of (64-bit) instructions
* the 'imm type' can be used by disassemblers for display
* the 'dst type' can be used for verification and JIT compilation purposes
Maps
~~~~
Maps are shared memory regions accessible by eBPF programs on some platforms.
A map can have various semantics as defined in a separate document, and may or
may not have a single contiguous memory region, but the 'map_val(map)' is
currently only defined for maps that do have a single contiguous memory region.
Each map can have a file descriptor (fd) if supported by the platform, where
'map_by_fd(imm)' means to get the map with the specified file descriptor. Each
BPF program can also be defined to use a set of maps associated with the
program at load time, and 'map_by_idx(imm)' means to get the map with the given
index in the set associated with the BPF program containing the instruction.
Platform Variables
~~~~~~~~~~~~~~~~~~
Platform variables are memory regions, identified by integer ids, exposed by
the runtime and accessible by BPF programs on some platforms. The
'var_addr(imm)' operation means to get the address of the memory region
identified by the given id.
Legacy BPF Packet access instructions Legacy BPF Packet access instructions
------------------------------------- -------------------------------------
......
...@@ -100,6 +100,23 @@ Hence, whenever a constant scalar argument is accepted by a kfunc which is not a ...@@ -100,6 +100,23 @@ Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
size parameter, and the value of the constant matters for program safety, __k size parameter, and the value of the constant matters for program safety, __k
suffix should be used. suffix should be used.
2.2.2 __uninit Annotation
-------------------------
This annotation is used to indicate that the argument will be treated as
uninitialized.
An example is given below::
__bpf_kfunc int bpf_dynptr_from_skb(..., struct bpf_dynptr_kern *ptr__uninit)
{
...
}
Here, the dynptr will be treated as an uninitialized dynptr. Without this
annotation, the verifier will reject the program if the dynptr passed in is
not initialized.
.. _BPF_kfunc_nodef: .. _BPF_kfunc_nodef:
2.3 Using an existing kernel function 2.3 Using an existing kernel function
...@@ -162,20 +179,12 @@ both are orthogonal to each other. ...@@ -162,20 +179,12 @@ both are orthogonal to each other.
--------------------- ---------------------
The KF_RELEASE flag is used to indicate that the kfunc releases the pointer The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
passed in to it. There can be only one referenced pointer that can be passed in. passed in to it. There can be only one referenced pointer that can be passed
All copies of the pointer being released are invalidated as a result of invoking in. All copies of the pointer being released are invalidated as a result of
kfunc with this flag. invoking kfunc with this flag. KF_RELEASE kfuncs automatically receive the
protection afforded by the KF_TRUSTED_ARGS flag described below.
2.4.4 KF_KPTR_GET flag
----------------------
The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument
as a pointer to kptr, safely increments the refcount of the object it points to,
and returns a reference to the user. The rest of the arguments may be normal
arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with
KF_ACQUIRE and KF_RET_NULL flags.
2.4.5 KF_TRUSTED_ARGS flag 2.4.4 KF_TRUSTED_ARGS flag
-------------------------- --------------------------
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
...@@ -187,7 +196,7 @@ exception described below). ...@@ -187,7 +196,7 @@ exception described below).
There are two types of pointers to kernel objects which are considered "valid": There are two types of pointers to kernel objects which are considered "valid":
1. Pointers which are passed as tracepoint or struct_ops callback arguments. 1. Pointers which are passed as tracepoint or struct_ops callback arguments.
2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. 2. Pointers which were returned from a KF_ACQUIRE kfunc.
Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
...@@ -214,13 +223,13 @@ In other words, you must: ...@@ -214,13 +223,13 @@ In other words, you must:
2. Specify the type and name of the trusted nested field. This field must match 2. Specify the type and name of the trusted nested field. This field must match
the field in the original type definition exactly. the field in the original type definition exactly.
2.4.6 KF_SLEEPABLE flag 2.4.5 KF_SLEEPABLE flag
----------------------- -----------------------
The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only
be called by sleepable BPF programs (BPF_F_SLEEPABLE). be called by sleepable BPF programs (BPF_F_SLEEPABLE).
2.4.7 KF_DESTRUCTIVE flag 2.4.6 KF_DESTRUCTIVE flag
-------------------------- --------------------------
The KF_DESTRUCTIVE flag is used to indicate functions calling which is The KF_DESTRUCTIVE flag is used to indicate functions calling which is
...@@ -229,18 +238,20 @@ rebooting or panicking. Due to this additional restrictions apply to these ...@@ -229,18 +238,20 @@ rebooting or panicking. Due to this additional restrictions apply to these
calls. At the moment they only require CAP_SYS_BOOT capability, but more can be calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
added later. added later.
2.4.8 KF_RCU flag 2.4.7 KF_RCU flag
----------------- -----------------
The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. The KF_RCU flag is a weaker version of KF_TRUSTED_ARGS. The kfuncs marked with
When used together with KF_ACQUIRE, it indicates the kfunc should have a KF_RCU expect either PTR_TRUSTED or MEM_RCU arguments. The verifier guarantees
single argument which must be a trusted argument or a MEM_RCU pointer. that the objects are valid and there is no use-after-free. The pointers are not
The argument may have reference count of 0 and the kfunc must take this NULL, but the object's refcount could have reached zero. The kfuncs need to
into consideration. consider doing refcnt != 0 check, especially when returning a KF_ACQUIRE
pointer. Note as well that a KF_ACQUIRE kfunc that is KF_RCU should very likely
also be KF_RET_NULL.
.. _KF_deprecated_flag: .. _KF_deprecated_flag:
2.4.9 KF_DEPRECATED flag 2.4.8 KF_DEPRECATED flag
------------------------ ------------------------
The KF_DEPRECATED flag is used for kfuncs which are scheduled to be The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
...@@ -451,13 +462,50 @@ struct_ops callback arg. For example: ...@@ -451,13 +462,50 @@ struct_ops callback arg. For example:
struct task_struct *acquired; struct task_struct *acquired;
acquired = bpf_task_acquire(task); acquired = bpf_task_acquire(task);
if (acquired)
/*
* In a typical program you'd do something like store
* the task in a map, and the map will automatically
* release it later. Here, we release it manually.
*/
bpf_task_release(acquired);
return 0;
}
References acquired on ``struct task_struct *`` objects are RCU protected.
Therefore, when in an RCU read region, you can obtain a pointer to a task
embedded in a map value without having to acquire a reference:
.. code-block:: c
#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
private(TASK) static struct task_struct *global;
/**
* A trivial example showing how to access a task stored
* in a map using RCU.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(task_rcu_read_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *local_copy;
bpf_rcu_read_lock();
local_copy = global;
if (local_copy)
/*
* We could also pass local_copy to kfuncs or helper functions here,
* as we're guaranteed that local_copy will be valid until we exit
* the RCU read region below.
*/
bpf_printk("Global task %s is valid", local_copy->comm);
else
bpf_printk("No global task found");
bpf_rcu_read_unlock();
/* At this point we can no longer reference local_copy. */
/*
* In a typical program you'd do something like store
* the task in a map, and the map will automatically
* release it later. Here, we release it manually.
*/
bpf_task_release(acquired);
return 0; return 0;
} }
...@@ -515,80 +563,16 @@ bpf_task_release() respectively, so we won't provide examples for them. ...@@ -515,80 +563,16 @@ bpf_task_release() respectively, so we won't provide examples for them.
---- ----
You may also acquire a reference to a ``struct cgroup`` kptr that's already Other kfuncs available for interacting with ``struct cgroup *`` objects are
stored in a map using bpf_cgroup_kptr_get(): bpf_cgroup_ancestor() and bpf_cgroup_from_id(), allowing callers to access
the ancestor of a cgroup and find a cgroup by its ID, respectively. Both
return a cgroup kptr.
.. kernel-doc:: kernel/bpf/helpers.c .. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_kptr_get :identifiers: bpf_cgroup_ancestor
Here's an example of how it can be used:
.. code-block:: c
/* struct containing the struct task_struct kptr which is actually stored in the map. */
struct __cgroups_kfunc_map_value {
struct cgroup __kptr_ref * cgroup;
};
/* The map containing struct __cgroups_kfunc_map_value entries. */
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, int);
__type(value, struct __cgroups_kfunc_map_value);
__uint(max_entries, 1);
} __cgroups_kfunc_map SEC(".maps");
/* ... */
/**
* A simple example tracepoint program showing how a
* struct cgroup kptr that is stored in a map can
* be acquired using the bpf_cgroup_kptr_get() kfunc.
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
{
struct cgroup *kptr;
struct __cgroups_kfunc_map_value *v;
s32 id = cgrp->self.id;
/* Assume a cgroup kptr was previously stored in the map. */
v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
if (!v)
return -ENOENT;
/* Acquire a reference to the cgroup kptr that's already stored in the map. */
kptr = bpf_cgroup_kptr_get(&v->cgroup);
if (!kptr)
/* If no cgroup was present in the map, it's because
* we're racing with another CPU that removed it with
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
* above, and our call to bpf_cgroup_kptr_get().
* bpf_cgroup_kptr_get() internally safely handles this
* race, and will return NULL if the task is no longer
* present in the map by the time we invoke the kfunc.
*/
return -EBUSY;
/* Free the reference we just took above. Note that the
* original struct cgroup kptr is still in the map. It will
* be freed either at a later time if another context deletes
* it from the map, or automatically by the BPF subsystem if
* it's still present when the map is destroyed.
*/
bpf_cgroup_release(kptr);
return 0;
}
----
Another kfunc available for interacting with ``struct cgroup *`` objects is
bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
and return it as a cgroup kptr.
.. kernel-doc:: kernel/bpf/helpers.c .. kernel-doc:: kernel/bpf/helpers.c
:identifiers: bpf_cgroup_ancestor :identifiers: bpf_cgroup_from_id
Eventually, BPF should be updated to allow this to happen with a normal memory Eventually, BPF should be updated to allow this to happen with a normal memory
load in the program itself. This is currently not possible without more work in load in the program itself. This is currently not possible without more work in
......
...@@ -2,23 +2,32 @@ ...@@ -2,23 +2,32 @@
.. _libbpf: .. _libbpf:
======
libbpf libbpf
====== ======
If you are looking to develop BPF applications using the libbpf library, this
directory contains important documentation that you should read.
To get started, it is recommended to begin with the :doc:`libbpf Overview
<libbpf_overview>` document, which provides a high-level understanding of the
libbpf APIs and their usage. This will give you a solid foundation to start
exploring and utilizing the various features of libbpf to develop your BPF
applications.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
libbpf_overview
API Documentation <https://libbpf.readthedocs.io/en/latest/api.html> API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
program_types program_types
libbpf_naming_convention libbpf_naming_convention
libbpf_build libbpf_build
This is documentation for libbpf, a userspace library for loading and
interacting with bpf programs.
All general BPF questions, including kernel functionality, libbpf APIs and All general BPF questions, including kernel functionality, libbpf APIs and their
their application, should be sent to bpf@vger.kernel.org mailing list. application, should be sent to bpf@vger.kernel.org mailing list. You can
You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the mailing list
mailing list search its `archive <https://lore.kernel.org/bpf/>`_. search its `archive <https://lore.kernel.org/bpf/>`_. Please search the archive
Please search the archive before asking new questions. It very well might before asking new questions. It may be that this was already addressed or
be that this was already addressed or answered before. answered before.
.. SPDX-License-Identifier: GPL-2.0
===============
libbpf Overview
===============
libbpf is a C-based library containing a BPF loader that takes compiled BPF
object files and prepares and loads them into the Linux kernel. libbpf takes the
heavy lifting of loading, verifying, and attaching BPF programs to various
kernel hooks, allowing BPF application developers to focus only on BPF program
correctness and performance.
The following are the high-level features supported by libbpf:
* Provides high-level and low-level APIs for user space programs to interact
with BPF programs. The low-level APIs wrap all the bpf system call
functionality, which is useful when users need more fine-grained control
over the interactions between user space and BPF programs.
* Provides overall support for the BPF object skeleton generated by bpftool.
The skeleton file simplifies the process for the user space programs to access
global variables and work with BPF programs.
* Provides BPF-side APIS, including BPF helper definitions, BPF maps support,
and tracing helpers, allowing developers to simplify BPF code writing.
* Supports BPF CO-RE mechanism, enabling BPF developers to write portable
BPF programs that can be compiled once and run across different kernel
versions.
This document will delve into the above concepts in detail, providing a deeper
understanding of the capabilities and advantages of libbpf and how it can help
you develop BPF applications efficiently.
BPF App Lifecycle and libbpf APIs
==================================
A BPF application consists of one or more BPF programs (either cooperating or
completely independent), BPF maps, and global variables. The global
variables are shared between all BPF programs, which allows them to cooperate on
a common set of data. libbpf provides APIs that user space programs can use to
manipulate the BPF programs by triggering different phases of a BPF application
lifecycle.
The following section provides a brief overview of each phase in the BPF life
cycle:
* **Open phase**: In this phase, libbpf parses the BPF
object file and discovers BPF maps, BPF programs, and global variables. After
a BPF app is opened, user space apps can make additional adjustments
(setting BPF program types, if necessary; pre-setting initial values for
global variables, etc.) before all the entities are created and loaded.
* **Load phase**: In the load phase, libbpf creates BPF
maps, resolves various relocations, and verifies and loads BPF programs into
the kernel. At this point, libbpf validates all the parts of a BPF application
and loads the BPF program into the kernel, but no BPF program has yet been
executed. After the load phase, it’s possible to set up the initial BPF map
state without racing with the BPF program code execution.
* **Attachment phase**: In this phase, libbpf
attaches BPF programs to various BPF hook points (e.g., tracepoints, kprobes,
cgroup hooks, network packet processing pipeline, etc.). During this
phase, BPF programs perform useful work such as processing
packets, or updating BPF maps and global variables that can be read from user
space.
* **Tear down phase**: In the tear down phase,
libbpf detaches BPF programs and unloads them from the kernel. BPF maps are
destroyed, and all the resources used by the BPF app are freed.
BPF Object Skeleton File
========================
BPF skeleton is an alternative interface to libbpf APIs for working with BPF
objects. Skeleton code abstract away generic libbpf APIs to significantly
simplify code for manipulating BPF programs from user space. Skeleton code
includes a bytecode representation of the BPF object file, simplifying the
process of distributing your BPF code. With BPF bytecode embedded, there are no
extra files to deploy along with your application binary.
You can generate the skeleton header file ``(.skel.h)`` for a specific object
file by passing the BPF object to the bpftool. The generated BPF skeleton
provides the following custom functions that correspond to the BPF lifecycle,
each of them prefixed with the specific object name:
* ``<name>__open()`` – creates and opens BPF application (``<name>`` stands for
the specific bpf object name)
* ``<name>__load()`` – instantiates, loads,and verifies BPF application parts
* ``<name>__attach()`` – attaches all auto-attachable BPF programs (it’s
optional, you can have more control by using libbpf APIs directly)
* ``<name>__destroy()`` – detaches all BPF programs and
frees up all used resources
Using the skeleton code is the recommended way to work with bpf programs. Keep
in mind, BPF skeleton provides access to the underlying BPF object, so whatever
was possible to do with generic libbpf APIs is still possible even when the BPF
skeleton is used. It's an additive convenience feature, with no syscalls, and no
cumbersome code.
Other Advantages of Using Skeleton File
---------------------------------------
* BPF skeleton provides an interface for user space programs to work with BPF
global variables. The skeleton code memory maps global variables as a struct
into user space. The struct interface allows user space programs to initialize
BPF programs before the BPF load phase and fetch and update data from user
space afterward.
* The ``skel.h`` file reflects the object file structure by listing out the
available maps, programs, etc. BPF skeleton provides direct access to all the
BPF maps and BPF programs as struct fields. This eliminates the need for
string-based lookups with ``bpf_object_find_map_by_name()`` and
``bpf_object_find_program_by_name()`` APIs, reducing errors due to BPF source
code and user-space code getting out of sync.
* The embedded bytecode representation of the object file ensures that the
skeleton and the BPF object file are always in sync.
BPF Helpers
===========
libbpf provides BPF-side APIs that BPF programs can use to interact with the
system. The BPF helpers definition allows developers to use them in BPF code as
any other plain C function. For example, there are helper functions to print
debugging messages, get the time since the system was booted, interact with BPF
maps, manipulate network packets, etc.
For a complete description of what the helpers do, the arguments they take, and
the return value, see the `bpf-helpers
<https://man7.org/linux/man-pages/man7/bpf-helpers.7.html>`_ man page.
BPF CO-RE (Compile Once – Run Everywhere)
=========================================
BPF programs work in the kernel space and have access to kernel memory and data
structures. One limitation that BPF applications come across is the lack of
portability across different kernel versions and configurations. `BCC
<https://github.com/iovisor/bcc/>`_ is one of the solutions for BPF
portability. However, it comes with runtime overhead and a large binary size
from embedding the compiler with the application.
libbpf steps up the BPF program portability by supporting the BPF CO-RE concept.
BPF CO-RE brings together BTF type information, libbpf, and the compiler to
produce a single executable binary that you can run on multiple kernel versions
and configurations.
To make BPF programs portable libbpf relies on the BTF type information of the
running kernel. Kernel also exposes this self-describing authoritative BTF
information through ``sysfs`` at ``/sys/kernel/btf/vmlinux``.
You can generate the BTF information for the running kernel with the following
command:
::
$ bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
The command generates a ``vmlinux.h`` header file with all kernel types
(:doc:`BTF types <../btf>`) that the running kernel uses. Including
``vmlinux.h`` in your BPF program eliminates dependency on system-wide kernel
headers.
libbpf enables portability of BPF programs by looking at the BPF program’s
recorded BTF type and relocation information and matching them to BTF
information (vmlinux) provided by the running kernel. libbpf then resolves and
matches all the types and fields, and updates necessary offsets and other
relocatable data to ensure that BPF program’s logic functions correctly for a
specific kernel on the host. BPF CO-RE concept thus eliminates overhead
associated with BPF development and allows developers to write portable BPF
applications without modifications and runtime source code compilation on the
target machine.
The following code snippet shows how to read the parent field of a kernel
``task_struct`` using BPF CO-RE and libbf. The basic helper to read a field in a
CO-RE relocatable manner is ``bpf_core_read(dst, sz, src)``, which will read
``sz`` bytes from the field referenced by ``src`` into the memory pointed to by
``dst``.
.. code-block:: C
:emphasize-lines: 6
//...
struct task_struct *task = (void *)bpf_get_current_task();
struct task_struct *parent_task;
int err;
err = bpf_core_read(&parent_task, sizeof(void *), &task->parent);
if (err) {
/* handle error */
}
/* parent_task contains the value of task->parent pointer */
In the code snippet, we first get a pointer to the current ``task_struct`` using
``bpf_get_current_task()``. We then use ``bpf_core_read()`` to read the parent
field of task struct into the ``parent_task`` variable. ``bpf_core_read()`` is
just like ``bpf_probe_read_kernel()`` BPF helper, except it records information
about the field that should be relocated on the target kernel. i.e, if the
``parent`` field gets shifted to a different offset within
``struct task_struct`` due to some new field added in front of it, libbpf will
automatically adjust the actual offset to the proper value.
Getting Started with libbpf
===========================
Check out the `libbpf-bootstrap <https://github.com/libbpf/libbpf-bootstrap>`_
repository with simple examples of using libbpf to build various BPF
applications.
See also `libbpf API documentation
<https://libbpf.readthedocs.io/en/latest/api.html>`_.
libbpf and Rust
===============
If you are building BPF applications in Rust, it is recommended to use the
`Libbpf-rs <https://github.com/libbpf/libbpf-rs>`_ library instead of bindgen
bindings directly to libbpf. Libbpf-rs wraps libbpf functionality in
Rust-idiomatic interfaces and provides libbpf-cargo plugin to handle BPF code
compilation and skeleton generation. Using Libbpf-rs will make building user
space part of the BPF application easier. Note that the BPF program themselves
must still be written in plain C.
Additional Documentation
========================
* `Program types and ELF Sections <https://libbpf.readthedocs.io/en/latest/program_types.html>`_
* `API naming convention <https://libbpf.readthedocs.io/en/latest/libbpf_naming_convention.html>`_
* `Building libbpf <https://libbpf.readthedocs.io/en/latest/libbpf_build.html>`_
* `API documentation Convention <https://libbpf.readthedocs.io/en/latest/libbpf_naming_convention.html#api-documentation-convention>`_
...@@ -12,6 +12,36 @@ Byte swap instructions ...@@ -12,6 +12,36 @@ Byte swap instructions
``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and ``BPF_TO_BE`` respectively. ``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and ``BPF_TO_BE`` respectively.
Jump instructions
=================
``BPF_CALL | BPF_X | BPF_JMP`` (0x8d), where the helper function
integer would be read from a specified register, is not currently supported
by the verifier. Any programs with this instruction will fail to load
until such support is added.
Maps
====
Linux only supports the 'map_val(map)' operation on array maps with a single element.
Linux uses an fd_array to store maps associated with a BPF program. Thus,
map_by_idx(imm) uses the fd at that index in the array.
Variables
=========
The following 64-bit immediate instruction specifies that a variable address,
which corresponds to some integer stored in the 'imm' field, should be loaded:
========================= ====== === ========================================= =========== ==============
opcode construction opcode src pseudocode imm type dst type
========================= ====== === ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
========================= ====== === ========================================= =========== ==============
On Linux, this integer is a BTF ID.
Legacy BPF Packet access instructions Legacy BPF Packet access instructions
===================================== =====================================
......
...@@ -11,9 +11,9 @@ maps are accessed from BPF programs via BPF helpers which are documented in the ...@@ -11,9 +11,9 @@ maps are accessed from BPF programs via BPF helpers which are documented in the
`man-pages`_ for `bpf-helpers(7)`_. `man-pages`_ for `bpf-helpers(7)`_.
BPF maps are accessed from user space via the ``bpf`` syscall, which provides BPF maps are accessed from user space via the ``bpf`` syscall, which provides
commands to create maps, lookup elements, update elements and delete commands to create maps, lookup elements, update elements and delete elements.
elements. More details of the BPF syscall are available in More details of the BPF syscall are available in `ebpf-syscall`_ and in the
:doc:`/userspace-api/ebpf/syscall` and in the `man-pages`_ for `bpf(2)`_. `man-pages`_ for `bpf(2)`_.
Map Types Map Types
========= =========
...@@ -79,3 +79,4 @@ Find and delete element by key in a given map using ``attr->map_fd``, ...@@ -79,3 +79,4 @@ Find and delete element by key in a given map using ``attr->map_fd``,
.. _man-pages: https://www.kernel.org/doc/man-pages/ .. _man-pages: https://www.kernel.org/doc/man-pages/
.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html .. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html .. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
.. _ebpf-syscall: https://docs.kernel.org/userspace-api/ebpf/syscall.html
...@@ -20,6 +20,7 @@ properties: ...@@ -20,6 +20,7 @@ properties:
items: items:
- enum: - enum:
- mediatek,mt7622-wed - mediatek,mt7622-wed
- mediatek,mt7981-wed
- mediatek,mt7986-wed - mediatek,mt7986-wed
- const: syscon - const: syscon
......
MediaTek SGMIISYS controller
============================
The MediaTek SGMIISYS controller provides various clocks to the system.
Required Properties:
- compatible: Should be:
- "mediatek,mt7622-sgmiisys", "syscon"
- "mediatek,mt7629-sgmiisys", "syscon"
- "mediatek,mt7981-sgmiisys_0", "syscon"
- "mediatek,mt7981-sgmiisys_1", "syscon"
- "mediatek,mt7986-sgmiisys_0", "syscon"
- "mediatek,mt7986-sgmiisys_1", "syscon"
- #clock-cells: Must be 1
The SGMIISYS controller uses the common clk binding from
Documentation/devicetree/bindings/clock/clock-bindings.txt
The available clocks are defined in dt-bindings/clock/mt*-clk.h.
Example:
sgmiisys: sgmiisys@1b128000 {
compatible = "mediatek,mt7622-sgmiisys", "syscon";
reg = <0 0x1b128000 0 0x1000>;
#clock-cells = <1>;
};
...@@ -20,6 +20,7 @@ properties: ...@@ -20,6 +20,7 @@ properties:
- st,stm32-syscfg - st,stm32-syscfg
- st,stm32-power-config - st,stm32-power-config
- st,stm32-tamp - st,stm32-tamp
- st,stm32f4-gcan
- const: syscon - const: syscon
- items: - items:
- const: st,stm32-tamp - const: st,stm32-tamp
...@@ -42,6 +43,7 @@ if: ...@@ -42,6 +43,7 @@ if:
contains: contains:
enum: enum:
- st,stm32mp157-syscfg - st,stm32mp157-syscfg
- st,stm32f4-gcan
then: then:
required: required:
- clocks - clocks
......
...@@ -16,7 +16,7 @@ description: | ...@@ -16,7 +16,7 @@ description: |
operation modes at 10/100 Mb/s data transfer rates. operation modes at 10/100 Mb/s data transfer rates.
allOf: allOf:
- $ref: "ethernet-controller.yaml#" - $ref: ethernet-controller.yaml#
properties: properties:
compatible: compatible:
......
...@@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# ...@@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Allwinner A10 EMAC Ethernet Controller title: Allwinner A10 EMAC Ethernet Controller
allOf: allOf:
- $ref: "ethernet-controller.yaml#" - $ref: ethernet-controller.yaml#
maintainers: maintainers:
- Chen-Yu Tsai <wens@csie.org> - Chen-Yu Tsai <wens@csie.org>
......
...@@ -11,7 +11,7 @@ maintainers: ...@@ -11,7 +11,7 @@ maintainers:
- Maxime Ripard <mripard@kernel.org> - Maxime Ripard <mripard@kernel.org>
allOf: allOf:
- $ref: "mdio.yaml#" - $ref: mdio.yaml#
# Select every compatible, including the deprecated ones. This way, we # Select every compatible, including the deprecated ones. This way, we
# will be able to report a warning when we have that compatible, since # will be able to report a warning when we have that compatible, since
......
...@@ -66,7 +66,7 @@ required: ...@@ -66,7 +66,7 @@ required:
- tx-fifo-depth - tx-fifo-depth
allOf: allOf:
- $ref: "ethernet-controller.yaml#" - $ref: ethernet-controller.yaml#
- if: - if:
properties: properties:
compatible: compatible:
......
...@@ -2,8 +2,8 @@ ...@@ -2,8 +2,8 @@
# Copyright 2019 BayLibre, SAS # Copyright 2019 BayLibre, SAS
%YAML 1.2 %YAML 1.2
--- ---
$id: "http://devicetree.org/schemas/net/amlogic,meson-dwmac.yaml#" $id: http://devicetree.org/schemas/net/amlogic,meson-dwmac.yaml#
$schema: "http://devicetree.org/meta-schemas/core.yaml#" $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Amlogic Meson DWMAC Ethernet controller title: Amlogic Meson DWMAC Ethernet controller
......
...@@ -15,7 +15,7 @@ description: |+ ...@@ -15,7 +15,7 @@ description: |+
MAC. MAC.
allOf: allOf:
- $ref: "mdio.yaml#" - $ref: mdio.yaml#
properties: properties:
compatible: compatible:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment