• Mark Rutland's avatar
    arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS · baaf553d
    Mark Rutland authored
    This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64.
    This allows each ftrace callsite to provide an ftrace_ops to the common
    ftrace trampoline, allowing each callsite to invoke distinct tracer
    functions without the need to fall back to list processing or to
    allocate custom trampolines for each callsite. This significantly speeds
    up cases where multiple distinct trace functions are used and callsites
    are mostly traced by a single tracer.
    
    The main idea is to place a pointer to the ftrace_ops as a literal at a
    fixed offset from the function entry point, which can be recovered by
    the common ftrace trampoline. Using a 64-bit literal avoids branch range
    limitations, and permits the ops to be swapped atomically without
    special considerations that apply to code-patching. In future this will
    also allow for the implementation of DYNAMIC_FTRACE_WITH_DIRECT_CALLS
    without branch range limitations by using additional fields in struct
    ftrace_ops.
    
    As noted in the core patch adding support for
    DYNAMIC_FTRACE_WITH_CALL_OPS, this approach allows for directly invoking
    ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or
    part of a module), without going via ftrace_ops_list_func.
    
    Currently, this approach is not compatible with CLANG_CFI, as the
    presence/absence of pre-function NOPs changes the offset of the
    pre-function type hash, and there's no existing mechanism to ensure a
    consistent offset for instrumented and uninstrumented functions. When
    CLANG_CFI is enabled, the existing scheme with a global ops->func
    pointer is used, and there should be no functional change. I am
    currently working with others to allow the two to work together in
    future (though this will liekly require updated compiler support).
    
    I've benchamrked this with the ftrace_ops sample module [1], which is
    not currently upstream, but available at:
    
      https://lore.kernel.org/lkml/20230103124912.2948963-1-mark.rutland@arm.com
      git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git ftrace-ops-sample-20230109
    
    Using that module I measured the total time taken for 100,000 calls to a
    trivial instrumented function, with a number of tracers enabled with
    relevant filters (which would apply to the instrumented function) and a
    number of tracers enabled with irrelevant filters (which would not apply
    to the instrumented function). I tested on an M1 MacBook Pro, running
    under a HVF-accelerated QEMU VM (i.e. on real hardware).
    
    Before this patch:
    
      Number of tracers     || Total time  | Per-call average time (ns)
      Relevant | Irrelevant || (ns)        | Total        | Overhead
      =========+============++=============+==============+============
             0 |          0 ||      94,583 |         0.95 |           -
             0 |          1 ||      93,709 |         0.94 |           -
             0 |          2 ||      93,666 |         0.94 |           -
             0 |         10 ||      93,709 |         0.94 |           -
             0 |        100 ||      93,792 |         0.94 |           -
      ---------+------------++-------------+--------------+------------
             1 |          1 ||   6,467,833 |        64.68 |       63.73
             1 |          2 ||   7,509,708 |        75.10 |       74.15
             1 |         10 ||  23,786,792 |       237.87 |      236.92
             1 |        100 || 106,432,500 |     1,064.43 |     1063.38
      ---------+------------++-------------+--------------+------------
             1 |          0 ||   1,431,875 |        14.32 |       13.37
             2 |          0 ||   6,456,334 |        64.56 |       63.62
            10 |          0 ||  22,717,000 |       227.17 |      226.22
           100 |          0 || 103,293,667 |      1032.94 |     1031.99
      ---------+------------++-------------+--------------+--------------
    
      Note: per-call overhead is estimated relative to the baseline case
      with 0 relevant tracers and 0 irrelevant tracers.
    
    After this patch
    
      Number of tracers     || Total time  | Per-call average time (ns)
      Relevant | Irrelevant || (ns)        | Total        | Overhead
      =========+============++=============+==============+============
             0 |          0 ||      94,541 |         0.95 |           -
             0 |          1 ||      93,666 |         0.94 |           -
             0 |          2 ||      93,709 |         0.94 |           -
             0 |         10 ||      93,667 |         0.94 |           -
             0 |        100 ||      93,792 |         0.94 |           -
      ---------+------------++-------------+--------------+------------
             1 |          1 ||     281,000 |         2.81 |        1.86
             1 |          2 ||     281,042 |         2.81 |        1.87
             1 |         10 ||     280,958 |         2.81 |        1.86
             1 |        100 ||     281,250 |         2.81 |        1.87
      ---------+------------++-------------+--------------+------------
             1 |          0 ||     280,959 |         2.81 |        1.86
             2 |          0 ||   6,502,708 |        65.03 |       64.08
            10 |          0 ||  18,681,209 |       186.81 |      185.87
           100 |          0 || 103,550,458 |     1,035.50 |     1034.56
      ---------+------------++-------------+--------------+------------
    
      Note: per-call overhead is estimated relative to the baseline case
      with 0 relevant tracers and 0 irrelevant tracers.
    
    As can be seen from the above:
    
    a) Whenever there is a single relevant tracer function associated with a
       tracee, the overhead of invoking the tracer is constant, and does not
       scale with the number of tracers which are *not* associated with that
       tracee.
    
    b) The overhead for a single relevant tracer has dropped to ~1/7 of the
       overhead prior to this series (from 13.37ns to 1.86ns). This is
       largely due to permitting calls to dynamically-allocated ftrace_ops
       without going through ftrace_ops_list_func.
    
    I've run the ftrace selftests from v6.2-rc3, which reports:
    
    | # of passed:  110
    | # of failed:  0
    | # of unresolved:  3
    | # of untested:  0
    | # of unsupported:  0
    | # of xfailed:  1
    | # of undefined(test bug):  0
    
    ... where the unresolved entries were the tests for DIRECT functions
    (which are not supported), and the checkbashisms selftest (which is
    irrelevant here):
    
    | [8] Test ftrace direct functions against tracers        [UNRESOLVED]
    | [9] Test ftrace direct functions against kprobes        [UNRESOLVED]
    | [62] Meta-selftest: Checkbashisms       [UNRESOLVED]
    
    ... with all other tests passing (or failing as expected).
    Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
    Cc: Florent Revest <revest@chromium.org>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Will Deacon <will@kernel.org>
    Link: https://lore.kernel.org/r/20230123134603.1064407-9-mark.rutland@arm.comSigned-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
    baaf553d
ftrace.h 4.29 KB