• Andrii Nakryiko's avatar
    uprobes: prepare uprobe args buffer lazily · 1b8f85de
    Andrii Nakryiko authored
    uprobe_cpu_buffer and corresponding logic to store uprobe args into it
    are used for uprobes/uretprobes that are created through tracefs or
    perf events.
    
    BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
    need uprobe_cpu_buffer and associated data. For BPF-only use cases this
    buffer handling and preparation is a pure overhead. At the same time,
    BPF-only uprobe/uretprobe usage is very common in practice. Also, for
    a lot of cases applications are very senstivie to performance overheads,
    as they might be tracing a very high frequency functions like
    malloc()/free(), so every bit of performance improvement matters.
    
    All that is to say that this uprobe_cpu_buffer preparation is an
    unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
    This patch is changing this by making uprobe_cpu_buffer preparation
    optional. It will happen only if either tracefs-based or perf event-based
    uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
    BPF-only use cases this step will be skipped.
    
    We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
    to estimate the improvements. We have 3 uprobe and 3 uretprobe
    scenarios, which vary an instruction that is replaced by uprobe: nop
    (fastest uprobe case), `push rbp` (typical case), and non-simulated
    `ret` instruction (slowest case). Benchmark thread is constantly calling
    user space function in a tight loop. User space function has attached
    BPF uprobe or uretprobe program doing nothing but atomic counter
    increments to count number of triggering calls. Benchmark emits
    throughput in millions of executions per second.
    
    BEFORE these changes
    ====================
    uprobe-nop     :    2.657 ± 0.024M/s
    uprobe-push    :    2.499 ± 0.018M/s
    uprobe-ret     :    1.100 ± 0.006M/s
    uretprobe-nop  :    1.356 ± 0.004M/s
    uretprobe-push :    1.317 ± 0.019M/s
    uretprobe-ret  :    0.785 ± 0.007M/s
    
    AFTER these changes
    ===================
    uprobe-nop     :    2.732 ± 0.022M/s (+2.8%)
    uprobe-push    :    2.621 ± 0.016M/s (+4.9%)
    uprobe-ret     :    1.105 ± 0.007M/s (+0.5%)
    uretprobe-nop  :    1.396 ± 0.007M/s (+2.9%)
    uretprobe-push :    1.347 ± 0.008M/s (+2.3%)
    uretprobe-ret  :    0.800 ± 0.006M/s (+1.9)
    
    So the improvements on this particular machine seems to be between 2% and 5%.
    
      [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.cReviewed-by: default avatarJiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/all/20240318181728.2795838-3-andrii@kernel.org/Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
    Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
    1b8f85de
trace_uprobe.c 38.2 KB