Commits · 2d78928c9cf7bee08c3e2344e6e1755412855448 · Kirill Smelkov / linux

16 Oct, 2023 1 commit

selftests/bpf: Improve percpu_alloc test robustness · 2d78928c

Andrii Nakryiko authored Oct 11, 2023

Make these non-serial tests filter BPF programs by intended PID of
a test runner process. This makes it isolated from other parallel tests
that might interfere accidentally.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-2-andrii@kernel.org

2d78928c

13 Oct, 2023 8 commits

Merge branch 'Open-coded task_vma iter' · 0e10fd4b

Andrii Nakryiko authored Oct 13, 2023

Dave Marchevsky says:

====================
At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.

This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons, but
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:

  struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
  mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */

Since commit 763ecb03 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. Luckily,
commit f39af059 ("mm: add VMA iterator"), another commit in that series,
added struct vma_iterator and for_each_vma macro for easy vma iteration. If
similar functionality was exposed to BPF programs, it would be perfect for our
usecase.

This series adds such functionality, specifically a BPF equivalent of
for_each_vma using the open-coded iterator style.

Notes:
  * This approach was chosen after discussion on a previous series [0] which
    attempted to solve the same problem by adding a BPF_F_VMA_NEXT flag to
    bpf_find_vma.
  * Unlike the task_vma bpf_iter, the open-coded iterator kfuncs here do not
    drop the vma read lock between iterations. See Alexei's response in [0].
  * The [vsyscall] page isn't really part of task->mm's vmas, but
    /proc/PID/maps returns information about it anyways. The vma iter added
    here does not do the same. See comment on selftest in patch 3.
  * bpf_iter_task_vma allocates a _data struct which contains - among other
    things - struct vma_iterator, using BPF allocator and keeps a pointer to
    the bpf_iter_task_vma_data. This is done in order to prevent changes to
    struct ma_state - which is wrapped by struct vma_iterator - from
    necessitating changes to uapi struct bpf_iter_task_vma.

Changelog:

v6 -> v7: https://lore.kernel.org/bpf/20231010185944.3888849-1-davemarchevsky@fb.com/

Patch numbers correspond to their position in v6

Patch 2 ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c")
  * Add Andrii ack
Patch 3 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Add Andrii ack
  * Add missing __diag_ignore_all for -Wmissing-prototypes (Song)
Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Remove two unnecessary header includes (Andrii)
  * Remove extraneous !vmas_seen check (Andrii)
New Patch ("bpf: Add BPF_KFUNC_{START,END}_defs macros")
  * After talking to Andrii, this is an attempt to clean up __diag_ignore_all
    spam everywhere kfuncs are defined. If nontrivial changes are needed,
    let's apply the other 4 and I'll respin as a standalone patch.

v5 -> v6: https://lore.kernel.org/bpf/20231010175637.3405682-1-davemarchevsky@fb.com/

Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Remove extraneous blank line. I did this manually to the .patch file
    for v5, which caused BPF CI to complain about failing to apply the
    series

v4 -> v5: https://lore.kernel.org/bpf/20231002195341.2940874-1-davemarchevsky@fb.com/

Patch numbers correspond to their position in v4

New Patch ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c")
  * Patch 2's renaming of this selftest, and associated changes in the
    userspace runner, are split out into this separate commit (Andrii)

Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Remove bpf_iter_task_vma kfuncs from libbpf's bpf_helpers.h, they'll be
    added to selftests' bpf_experimental.h in selftests patch below (Andrii)
  * Split bpf_iter_task_vma.c renaming into separate commit (Andrii)

Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Add bpf_iter_task_vma kfuncs to bpf_experimental.h (Andrii)
  * Remove '?' from prog SEC, open_and_load the skel in one operation (Andrii)
  * Ensure that fclose() always happens in test runner (Andrii)
  * Use global var w/ 1000 (vm_start, vm_end) structs instead of two
    MAP_TYPE_ARRAY's w/ 1k u64s each (Andrii)

v3 -> v4: https://lore.kernel.org/bpf/20230822050558.2937659-1-davemarchevsky@fb.com/

Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num")
  * Add Andrii ack
Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * Mark bpf_iter_task_vma_new args KF_RCU and remove now-unnecessary !task
    check (Yonghong)
    * Although KF_RCU is a function-level flag, in reality it only applies to
      the task_struct *task parameter, as the other two params are a scalar int
      and a specially-handled KF_ARG_PTR_TO_ITER
   * Remove struct bpf_iter_task_vma definition from uapi headers, define in
     kernel/bpf/task_iter.c instead (Andrii)
Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter")
  * Use a local var when looping over vmas to track map idx. Update vmas_seen
    global after done iterating. Don't start iterating or update vmas_seen if
    vmas_seen global is nonzero. (Andrii)
  * Move getpgid() call to correct spot - above skel detach. (Andrii)

v2 -> v3: https://lore.kernel.org/bpf/20230821173415.1970776-1-davemarchevsky@fb.com/

Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num")
  * Add Yonghong ack

Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs")
  * UAPI bpf header and tools/ version should match
  * Add bpf_iter_task_vma_kern_data which bpf_iter_task_vma_kern points to,
    bpf_mem_alloc/free it instead of just vma_iterator. (Alexei)
    * Inner data ptr == NULL implies initialization failed

v1 -> v2: https://lore.kernel.org/bpf/20230810183513.684836-1-davemarchevsky@fb.com/
  * Patch 1
    * Now removes the unnecessary BTF_TYPE_EMIT instead of changing the
      type (Yonghong)
  * Patch 2
    * Don't do unnecessary BTF_TYPE_EMIT (Yonghong)
    * Bump task refcount to prevent ->mm reuse (Yonghong)
    * Keep a pointer to vma_iterator in bpf_iter_task_vma, alloc/free
      via BPF mem allocator (Yonghong, Stanislav)
  * Patch 3

  [0]: https://lore.kernel.org/bpf/20230801145414.418145-1-davemarchevsky@fb.com/
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

0e10fd4b

selftests/bpf: Add tests for open-coded task_vma iter · e0e1a7a5

Dave Marchevsky authored Oct 13, 2023

The open-coded task_vma iter added earlier in this series allows for
natural iteration over a task's vmas using existing open-coded iter
infrastructure, specifically bpf_for_each.

This patch adds a test demonstrating this pattern and validating
correctness. The vma->vm_start and vma->vm_end addresses of the first
1000 vmas are recorded and compared to /proc/PID/maps output. As
expected, both see the same vmas and addresses - with the exception of
the [vsyscall] vma - which is explained in a comment in the prog_tests
program.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-5-davemarchevsky@fb.com

e0e1a7a5

bpf: Introduce task_vma open-coded iterator kfuncs · 4ac45468

Dave Marchevsky authored Oct 13, 2023

This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task_vma in open-coded
iterator style. BPF programs can use these kfuncs directly or through
bpf_for_each macro for natural-looking iteration of all task vmas.

The implementation borrows heavily from bpf_find_vma helper's locking -
differing only in that it holds the mmap_read lock for all iterations
while the helper only executes its provided callback on a maximum of 1
vma. Aside from locking, struct vma_iterator and vma_next do all the
heavy lifting.

A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the
only field in struct bpf_iter_task_vma. This is because the inner data
struct contains a struct vma_iterator (not ptr), whose size is likely to
change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
such a change would require change in opaque bpf_iter_task_vma struct's
size. So better to allocate vma_iterator using BPF allocator, and since
that alloc must already succeed, might as well allocate all iter fields,
thereby freezing struct bpf_iter_task_vma size.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com

4ac45468

selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c · 45b38941

Dave Marchevsky authored Oct 13, 2023

Further patches in this series will add a struct bpf_iter_task_vma,
which will result in a name collision with the selftest prog renamed in
this patch. Rename the selftest to avoid the collision.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-3-davemarchevsky@fb.com

45b38941

bpf: Don't explicitly emit BTF for struct btf_iter_num · f10ca5da

Dave Marchevsky authored Oct 13, 2023

Commit 6018e1f4 ("bpf: implement numbers iterator") added the
BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
doesn't exist, so only a forward declaration is emitted in BTF:

  FWD 'btf_iter_num' fwd_kind=struct

That commit was probably hoping to ensure that struct bpf_iter_num is
emitted in vmlinux BTF. A previous version of this patch changed the
line to emit the correct type, but Yonghong confirmed that it would
definitely be emitted regardless in [0], so this patch simply removes
the line.

This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
causing any issues that I noticed, aside from mild confusion when I
looked through the code.

  [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com

f10ca5da

bpf: Change syscall_nr type to int in struct syscall_tp_t · ba8ea723

Artem Savkov authored Oct 13, 2023

linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com

ba8ea723

net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set · 9c1292ec

Martin KaFai Lau authored Oct 13, 2023

It was reported that there is a compiler warning on the unused variable
"sin_addr_len" in af_inet.c when CONFIG_CGROUP_BPF is not set.
This patch is to address it similar to the ipv6 counterpart
in inet6_getname(). It is to "return sin_addr_len;"
instead of "return sizeof(*sin);".

Fixes: fefba7d1 ("bpf: Propagate modified uaddrlen from cgroup sockaddr programs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/bpf/20231013185702.3993710-1-martin.lau@linux.dev

Closes: https://lore.kernel.org/bpf/20231013114007.2fb09691@canb.auug.org.au/

9c1292ec

bpf: Avoid unnecessary audit log for CPU security mitigations · 236334ae

Yafang Shao authored Oct 13, 2023

Check cpu_mitigations_off() first to avoid calling capable() if it is off.
This can avoid unnecessary audit log.

Fixes: bc5bc309 ("bpf: Inherit system settings for CPU security mitigations")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4Bza6UVUWqcWQ-66weZ-nMDr+TFU3Mtq=dumZFD-pSqU7Ow@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20231013083916.4199-1-laoar.shao@gmail.com

236334ae

12 Oct, 2023 7 commits

Merge branch 'Add cgroup sockaddr hooks for unix sockets' · d2dc885b

Martin KaFai Lau authored Oct 11, 2023

Daan De Meyer says:

====================
Changes since v10:

* Removed extra check from bpf_sock_addr_set_sun_path() again in favor of
  calling unix_validate_addr() everywhere in af_unix.c before calling the hooks.

Changes since v9:

* Renamed bpf_sock_addr_set_unix_addr() to bpf_sock_addr_set_sun_path() and
  rennamed arguments to match the new name.
* Added an extra check to bpf_sock_addr_set_sun_path() to disallow changing the
  address of an unnamed unix socket.
* Removed unnecessary NULL check on uaddrlen in
  __cgroup_bpf_run_filter_sock_addr().

Changes since v8:

* Added missing test programs to last patch

Changes since v7:

* Fixed formatting nit in comment
* Renamed from cgroup/connectun to cgroup/connect_unix (and similar for all
  other hooks)

Changes since v6:

* Actually removed bpf_bind() helper for AF_UNIX hooks.
* Fixed merge conflict
* Updated comment to mention uaddrlen is read-only for AF_INET[6]
* Removed unnecessary forward declaration of struct sock_addr_test
* Removed unused BPF_CGROUP_RUN_PROG_UNIX_CONNECT()
* Fixed formatting nit reported by checkpatch
* Added more information to commit message about recvmsg() on connected socket

Changes since v5:

* Fixed kernel version in bpftool documentation (6.3 => 6.7).
* Added connection mode socket recvmsg() test.
* Removed bpf_bind() helper for AF_UNIX hooks.
* Added missing getpeernameun and getsocknameun BPF test programs.
* Added note for bind() test being unused currently.

Changes since v4:

* Dropped support for intercepting bind() as when using bind() with unix sockets
  and a pathname sockaddr, bind() will create an inode in the filesystem that
  needs to be cleaned up. If the address is rewritten, users might try to clean
  up the wrong file and leak the actual socket file in the filesystem.
* Changed bpf_sock_addr_set_unix_addr() to use BTF_KFUNC_HOOK_CGROUP_SKB instead
  of BTF_KFUNC_HOOK_COMMON.
* Removed unix socket related changes from BPF_CGROUP_PRE_CONNECT_ENABLED() as
  unix sockets do not support pre-connect.
* Added tests for getpeernameun and getsocknameun hooks.
* We now disallow an empty sockaddr in bpf_sock_addr_set_unix_addr() similar to
  unix_validate_addr().
* Removed unnecessary cgroup_bpf_enabled() checks
* Removed unnecessary error checks

Changes since v3:

* Renamed bpf_sock_addr_set_addr() to bpf_sock_addr_set_unix_addr() and
  made it only operate on AF_UNIX sockaddrs. This is because for the other
  families, users usually want to configure more than just the address so
  a generic interface will not fit the bill here. e.g. for AF_INET and AF_INET6,
  users would generally also want to be able to configure the port which the
  current interface doesn't support. So we expose an AF_UNIX specific function
  instead.
* Made the tests in the new sock addr tests more generic (similar to test_sock_addr.c),
  this should make it easier to migrate the other sock addr tests in the future.
* Removed the new kfunc hook and attached to BTF_KFUNC_HOOK_COMMON instead
* Set uaddrlen to 0 when the family is AF_UNSPEC
* Pass in the addrlen to the hook from IPv6 code
* Fixed mount directory mkdir() to ignore EEXIST

Changes since v2:

* Configuring the sock addr is now done via a new kfunc bpf_sock_addr_set()
* The addrlen is exposed as u32 in bpf_sock_addr_kern
* Selftests are updated to use the new kfunc
* Selftests are now added as a new sock_addr test in prog_tests/
* Added BTF_KFUNC_HOOK_SOCK_ADDR for BPF_PROG_TYPE_CGROUP_SOCK_ADDR
* __cgroup_bpf_run_filter_sock_addr() now returns the modified addrlen

Changes since v1:

* Split into multiple patches instead of one single patch
* Added unix support for all socket address hooks instead of only connect()
* Switched approach to expose the socket address length to the bpf hook
instead of recalculating the socket address length in kernelspace to
properly support abstract unix socket addresses
* Modified socket address hook tests to calculate the socket address length
once and pass it around everywhere instead of recalculating the actual unix
socket address length on demand.
* Added some missing section name tests for getpeername()/getsockname()

This patch series extends the cgroup sockaddr hooks to include support for unix
sockets. To add support for unix sockets, struct bpf_sock_addr_kern is extended
to expose the socket address length to the bpf program. Along with that, a new
kfunc bpf_sock_addr_set_unix_addr() is added to safely allow modifying an
AF_UNIX sockaddr from bpf programs.

I intend to use these new hooks in systemd to reimplement the LogNamespace=
feature, which allows running multiple instances of systemd-journald to
process the logs of different services. systemd-journald also processes
syslog messages, so currently, using log namespaces means all services running
in the same log namespace have to live in the same private mount namespace
so that systemd can mount the journal namespace's associated syslog socket
over /dev/log to properly direct syslog messages from all services running
in that log namespace to the correct systemd-journald instance. We want to
relax this requirement so that processes running in disjoint mount namespaces
can still run in the same log namespace. To achieve this, we can use these
new hooks to rewrite the socket address of any connect(), sendto(), ...
syscalls to /dev/log to the socket address of the journal namespace's syslog
socket instead, which will transparently do the redirection without requiring
use of a mount namespace and mounting over /dev/log.

Aside from the above usecase, these hooks can more generally be used to
transparently redirect unix sockets to different addresses as required by
services.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

d2dc885b

selftests/bpf: Add tests for cgroup unix socket address hooks · 82ab6b50

Daan De Meyer authored Oct 11, 2023

These selftests are written in prog_tests style instead of adding
them to the existing test_sock_addr tests. Migrating the existing
sock addr tests to prog_tests style is left for future work. This
commit adds support for testing bind() sockaddr hooks, even though
there's no unix socket sockaddr hook for bind(). We leave this code
intact for when the INET and INET6 tests are migrated in the future
which do support intercepting bind().
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-10-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

82ab6b50

selftests/bpf: Make sure mount directory exists · af2752ed

Daan De Meyer authored Oct 11, 2023

The mount directory for the selftests cgroup tree might
not exist so let's make sure it does exist by creating
it ourselves if it doesn't exist.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-9-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

af2752ed

documentation/bpf: Document cgroup unix socket address hooks · 3243fef6

Daan De Meyer authored Oct 11, 2023

Update the documentation to mention the new cgroup unix sockaddr
hooks.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-8-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

3243fef6

bpftool: Add support for cgroup unix socket address hooks · 8b3cba98

Daan De Meyer authored Oct 11, 2023

Add the necessary plumbing to hook up the new cgroup unix sockaddr
hooks into bpftool.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20231011185113.140426-7-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

8b3cba98

libbpf: Add support for cgroup unix socket address hooks · bf90438c

Daan De Meyer authored Oct 11, 2023

Add the necessary plumbing to hook up the new cgroup unix sockaddr
hooks into libbpf.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-6-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

bf90438c

bpf: Implement cgroup sockaddr hooks for unix sockets · 859051dd

Daan De Meyer authored Oct 11, 2023

These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().

These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.

We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.

We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

859051dd

11 Oct, 2023 3 commits

bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf · 53e380d2

Daan De Meyer authored Oct 11, 2023

As prep for adding unix socket support to the cgroup sockaddr hooks,
let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix
sockaddr from bpf. While this is already possible for AF_INET and AF_INET6,
we'll need this kfunc when we add unix socket support since modifying the
address for those requires modifying both the address and the sockaddr
length.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

53e380d2

bpf: Propagate modified uaddrlen from cgroup sockaddr programs · fefba7d1

Daan De Meyer authored Oct 11, 2023

As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.

__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

fefba7d1

selftests/bpf: Add missing section name tests for getpeername/getsockname · feba7b63

Daan De Meyer authored Oct 11, 2023

These were missed when these hooks were first added so add them now
instead to make sure every sockaddr hook has a matching section name
test.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-2-daan.j.demeyer@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

feba7b63

09 Oct, 2023 7 commits

Merge branch 'bpf: Fix src IP addr related limitation in bpf_*_fib_lookup()' · 1ef09e12

Martin KaFai Lau authored Oct 09, 2023

Martynas Pumputis says:

====================
The patchset fixes the limitation of bpf_*_fib_lookup() helper, which
prevents it from being used in BPF dataplanes with network interfaces
which have more than one IP addr. See the first patch for more details.
Thanks!

* v2->v3: Address Martin KaFai Lau's feedback
* v1->v2: Use IPv6 stubs to fix compilation when CONFIG_IPV6=m.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

1ef09e12

selftests/bpf: Add BPF_FIB_LOOKUP_SRC tests · b0f7a8ca

Martynas Pumputis authored Oct 07, 2023

This patch extends the existing fib_lookup test suite by adding two test
cases (for each IP family):

* Test source IP selection from the egressing netdev.
* Test source IP selection when an IP route has a preferred src IP addr.
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-3-m@lambda.ltSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

b0f7a8ca

bpf: Derive source IP addr via bpf_*_fib_lookup() · dab4e1f0

Martynas Pumputis authored Oct 07, 2023

Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set.

For example, the following snippet can be used to derive the desired
source IP address:

    struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };

    ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
            BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
    if (ret != BPF_FIB_LKUP_RET_SUCCESS)
        return TC_ACT_SHOT;

    /* the p.ipv4_src now contains the source address */

The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.

For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.

The change was tested with Cilium [1].

Nikolay Aleksandrov helped to figure out the IPv6 addr selection.

[1]: https://github.com/cilium/cilium/pull/28283Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.ltSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

dab4e1f0

bpftool: Align bpf_load_and_run_opts insns and data · 1be84ca5

Ian Rogers authored Oct 06, 2023

A C string lacks alignment so use aligned arrays to avoid potential
alignment problems. Switch to using sizeof (less 1 for the \0
terminator) rather than a hardcode size constant.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20231007044439.25171-2-irogers@google.com

1be84ca5

bpftool: Align output skeleton ELF code · 23671f4d

Ian Rogers authored Oct 06, 2023

libbpf accesses the ELF data requiring at least 8 byte alignment,
however, the data is generated into a C string that doesn't guarantee
alignment. Fix this by assigning to an aligned char array. Use sizeof
on the array, less one for the \0 terminator, rather than generating a
constant.

Fixes: a6cc6b34 ("bpftool: Provide a helper method for accessing skeleton's embedded ELF data")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20231007044439.25171-1-irogers@google.com

23671f4d

selftests/bpf: Test pinning bpf timer to a core · 0d7ae068

David Vernet authored Oct 04, 2023

Now that we support pinning a BPF timer to the current core, we should
test it with some selftests. This patch adds two new testcases to the
timer suite, which verifies that a BPF timer both with and without
BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com

0d7ae068

bpf: Add ability to pin bpf timer to calling CPU · d6247ecb

David Vernet authored Oct 04, 2023

BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.

This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com

d6247ecb

06 Oct, 2023 8 commits

bpf: Annotate struct bpf_stack_map with __counted_by · 84cb9cbd

Kees Cook authored Oct 06, 2023

Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle [1], add __counted_by for struct bpf_stack_map.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org

84cb9cbd

selftests/bpf: Add pairs_redir_to_connected helper · fdd11c14

Geliang Tang authored Oct 06, 2023

Extract duplicate code from these four functions

 unix_redir_to_connected()
 udp_redir_to_connected()
 inet_unix_redir_to_connected()
 unix_inet_redir_to_connected()

to generate a new helper pairs_redir_to_connected(). Create the
different socketpairs in these four functions, then pass the
socketpairs info to the new common helper to do the connections.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/54bb28dcf764e7d4227ab160883931d2173f4f3d.1696588133.git.geliang.tang@suse.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

fdd11c14

selftests/bpf: Don't truncate #test/subtest field · 0af3aace

Andrii Nakryiko authored Oct 06, 2023

We currently expect up to a three-digit number of tests and subtests, so:

  #999/999: some_test/some_subtest: ...

Is the largest test/subtest we can see. If we happen to cross into
1000s, current logic will just truncate everything after 7th character.
This patch fixes this truncate and allows to go way higher (up to 31
characters in total). We still nicely align test numbers:

  #60/66   core_reloc_btfgen/type_based___incompat:OK
  #60/67   core_reloc_btfgen/type_based___fn_wrong_args:OK
  #60/68   core_reloc_btfgen/type_id:OK
  #60/69   core_reloc_btfgen/type_id___missing_targets:OK
  #60/70   core_reloc_btfgen/enumval:OK
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-3-andrii@kernel.org

0af3aace

selftests/bpf: Support building selftests in optimized -O2 mode · 46475cc0

Andrii Nakryiko authored Oct 06, 2023

Add support for building selftests with -O2 level of optimization, which
allows more compiler warnings detection (like lots of potentially
uninitialized usage), but also is useful to have a faster-running test
for some CPU-intensive tests.

One can build optimized versions of libbpf and selftests by running:

  $ make RELEASE=1

There is a measurable speed up of about 10 seconds for me locally,
though it's mostly capped by non-parallelized serial tests. User CPU
time goes down by total 40 seconds, from 1m10s to 0m28s.

Unoptimized build (-O0)
=======================
Summary: 430/3544 PASSED, 25 SKIPPED, 4 FAILED

real    1m59.937s
user    1m10.877s
sys     3m14.880s

Optimized build (-O2)
=====================
Summary: 425/3543 PASSED, 25 SKIPPED, 9 FAILED

real    1m50.540s
user    0m28.406s
sys     3m13.198s
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-2-andrii@kernel.org

46475cc0

selftests/bpf: Fix compiler warnings reported in -O2 mode · 925a0157

Andrii Nakryiko authored Oct 06, 2023

Fix a bunch of potentially unitialized variable usage warnings that are
reported by GCC in -O2 mode. Also silence overzealous stringop-truncation
class of warnings.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231006175744.3136675-1-andrii@kernel.org

925a0157

bpf: Inherit system settings for CPU security mitigations · bc5bc309

Yafang Shao authored Oct 05, 2023

Currently, there exists a system-wide setting related to CPU security
mitigations, denoted as 'mitigations='. When set to 'mitigations=off', it
deactivates all optional CPU mitigations. Therefore, if we implement a
system-wide 'mitigations=off' setting, it should inherently bypass Spectre
v1 and Spectre v4 in the BPF subsystem.

Please note that there is also a more specific 'nospectre_v1' setting on
x86 and ppc architectures, though it is not currently exported. For the
time being, let's disregard more fine-grained options.

This idea emerged during our discussion about potential Spectre v1 attacks
with Luis [0].

  [0] https://lore.kernel.org/bpf/b4fc15f7-b204-767e-ebb9-fdb4233961fb@iogearbox.netSigned-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Cc: Luis Gerhorst <gerhorst@cs.fau.de>
Link: https://lore.kernel.org/bpf/20231005084123.1338-1-laoar.shao@gmail.com

bc5bc309

bpf: Fix the comment for bpf_restore_data_end() · 9c8c3fa3

Akihiko Odaki authored Oct 05, 2023

The comment used to say:
> Restore data saved by bpf_compute_data_pointers().

But bpf_compute_data_pointers() does not save the data;
bpf_compute_and_save_data_end() does.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231005072137.29870-1-akihiko.odaki@daynix.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

9c8c3fa3

selftests/bpf: Enable CONFIG_VSOCKETS in config · d549854b

Geliang Tang authored Oct 05, 2023

CONFIG_VSOCKETS is required by BPF selftests, otherwise we get errors
like this:

    ./test_progs:socket_loopback_reuseport:386: socket:
		Address family not supported by protocol
    socket_loopback_reuseport:FAIL:386
    ./test_progs:vsock_unix_redir_connectible:1496:
		vsock_socketpair_connectible() failed
    vsock_unix_redir_connectible:FAIL:1496

So this patch enables it in tools/testing/selftests/bpf/config.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/472e73d285db2ea59aca9bbb95eb5d4048327588.1696490003.git.geliang.tang@suse.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

d549854b

04 Oct, 2023 6 commits

Merge branch 'selftest/bpf, riscv: Improved cross-building support' · 3157b7ce

Andrii Nakryiko authored Oct 04, 2023

Björn Töpel says:

====================
From: Björn Töpel <bjorn@rivosinc.com>

Yet another "more cross-building support for RISC-V" series.

An example how to invoke a gen_tar build:

  | make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- CC=riscv64-linux-gnu-gcc \
  |    HOSTCC=gcc O=/workspace/kbuild FORMAT= \
  |    SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" -j $(($(nproc)-1)) \
  |    -C tools/testing/selftests gen_tar

Björn
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

3157b7ce

selftests/bpf: Add uprobe_multi to gen_tar target · e096ab9d

Björn Töpel authored Oct 04, 2023

The uprobe_multi program was not picked up for the gen_tar target. Fix
by adding it to TEST_GEN_FILES.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-4-bjorn@kernel.org

e096ab9d

selftests/bpf: Enable lld usage for RISC-V · 72fae631

Björn Töpel authored Oct 04, 2023

RISC-V has proper lld support. Use that, similar to what x86 does, for
urandom_read et al.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-3-bjorn@kernel.org

72fae631

selftests/bpf: Add cross-build support for urandom_read et al · 97a79e50

Björn Töpel authored Oct 04, 2023

Some userland programs in the BPF test suite, e.g. urandom_read, is
missing cross-build support. Add cross-build support for these
programs
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231004122721.54525-2-bjorn@kernel.org

97a79e50

Merge branch 'libbpf/selftests syscall wrapper fixes for RISC-V' · cbcb199b

Andrii Nakryiko authored Oct 04, 2023

Björn Töpel says:

====================
From: Björn Töpel <bjorn@rivosinc.com>

Commit 08d0ce30 ("riscv: Implement syscall wrappers") introduced
some regressions in libbpf, and the kselftests BPF suite, which are
fixed with these three patches.

Note that there's an outstanding fix [1] for ftrace syscall tracing
which is also a fallout from the commit above.

Björn

[1] https://lore.kernel.org/linux-riscv/20231003182407.32198-1-alexghiti@rivosinc.com/

Alexandre Ghiti (1):
  libbpf: Fix syscall access arguments on riscv
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

cbcb199b

selftests/bpf: Define SYS_NANOSLEEP_KPROBE_NAME for riscv · b55b775f

Björn Töpel authored Oct 04, 2023

Add missing sys_nanosleep name for RISC-V, which is used by some tests
(e.g. attach_probe).

Fixes: 08d0ce30 ("riscv: Implement syscall wrappers")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/bpf/20231004110905.49024-4-bjorn@kernel.org

b55b775f