• Andrii Nakryiko's avatar
    uprobes: add speculative lockless system-wide uprobe filter check · cdf355cc
    Andrii Nakryiko authored
    It's very common with BPF-based uprobe/uretprobe use cases to have
    a system-wide (not PID specific) probes used. In this case uprobe's
    trace_uprobe_filter->nr_systemwide counter is bumped at registration
    time, and actual filtering is short circuited at the time when
    uprobe/uretprobe is triggered.
    
    This is a great optimization, and the only issue with it is that to even
    get to checking this counter uprobe subsystem is taking
    read-side trace_uprobe_filter->rwlock. This is actually noticeable in
    profiles and is just another point of contention when uprobe is
    triggered on multiple CPUs simultaneously.
    
    This patch moves this nr_systemwide check outside of filter list's
    rwlock scope, as rwlock is meant to protect list modification, while
    nr_systemwide-based check is speculative and racy already, despite the
    lock (as discussed in [0]). trace_uprobe_filter_remove() and
    trace_uprobe_filter_add() already check for filter->nr_systewide
    explicitly outside of __uprobe_perf_filter, so no modifications are
    required there.
    
    Confirming with BPF selftests's based benchmarks.
    
    BEFORE (based on changes in previous patch)
    ===========================================
    uprobe-nop     :    2.732 ± 0.022M/s
    uprobe-push    :    2.621 ± 0.016M/s
    uprobe-ret     :    1.105 ± 0.007M/s
    uretprobe-nop  :    1.396 ± 0.007M/s
    uretprobe-push :    1.347 ± 0.008M/s
    uretprobe-ret  :    0.800 ± 0.006M/s
    
    AFTER
    =====
    uprobe-nop     :    2.878 ± 0.017M/s (+5.5%, total +8.3%)
    uprobe-push    :    2.753 ± 0.013M/s (+5.3%, total +10.2%)
    uprobe-ret     :    1.142 ± 0.010M/s (+3.8%, total +3.8%)
    uretprobe-nop  :    1.444 ± 0.008M/s (+3.5%, total +6.5%)
    uretprobe-push :    1.410 ± 0.010M/s (+4.8%, total +7.1%)
    uretprobe-ret  :    0.816 ± 0.002M/s (+2.0%, total +3.9%)
    
    In the above, first percentage value is based on top of previous patch
    (lazy uprobe buffer optimization), while the "total" percentage is
    based on kernel without any of the changes in this patch set.
    
    As can be seen, we get about 4% - 10% speed up, in total, with both lazy
    uprobe buffer and speculative filter check optimizations.
    
      [0] https://lore.kernel.org/bpf/20240313131926.GA19986@redhat.com/Reviewed-by: default avatarJiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/all/20240318181728.2795838-4-andrii@kernel.org/Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
    Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
    cdf355cc
trace_uprobe.c 38.3 KB