1. 14 Oct, 2008 13 commits
    • Steven Rostedt's avatar
      ftrace: create __mcount_loc section · 8da3821b
      Steven Rostedt authored
      This patch creates a section in the kernel called "__mcount_loc".
      This will hold a list of pointers to the mcount relocation for
      each call site of mcount.
      
      For example:
      
      objdump -dr init/main.o
      [...]
      Disassembly of section .text:
      
      0000000000000000 <do_one_initcall>:
         0:   55                      push   %rbp
      [...]
      000000000000017b <init_post>:
       17b:   55                      push   %rbp
       17c:   48 89 e5                mov    %rsp,%rbp
       17f:   53                      push   %rbx
       180:   48 83 ec 08             sub    $0x8,%rsp
       184:   e8 00 00 00 00          callq  189 <init_post+0xe>
                              185: R_X86_64_PC32      mcount+0xfffffffffffffffc
      [...]
      
      We will add a section to point to each function call.
      
         .section __mcount_loc,"a",@progbits
      [...]
         .quad .text + 0x185
      [...]
      
      The offset to of the mcount call site in init_post is an offset from
      the start of the section, and not the start of the function init_post.
      The mcount relocation is at the call site 0x185 from the start of the
      .text section.
      
        .text + 0x185  == init_post + 0xa
      
      We need a way to add this __mcount_loc section in a way that we do not
      lose the relocations after final link.  The .text section here will
      be attached to all other .text sections after final link and the
      offsets will be meaningless.  We need to keep track of where these
      .text sections are.
      
      To do this, we use the start of the first function in the section.
      do_one_initcall.  We can make a tmp.s file with this function as a reference
      to the start of the .text section.
      
         .section __mcount_loc,"a",@progbits
      [...]
         .quad do_one_initcall + 0x185
      [...]
      
      Then we can compile the tmp.s into a tmp.o
      
        gcc -c tmp.s -o tmp.o
      
      And link it into back into main.o.
      
        ld -r main.o tmp.o -o tmp_main.o
        mv tmp_main.o main.o
      
      But we have a problem.  What happens if the first function in a section
      is not exported, and is a static function. The linker will not let
      the tmp.o use it.  This case exists in main.o as well.
      
      Disassembly of section .init.text:
      
      0000000000000000 <set_reset_devices>:
         0:   55                      push   %rbp
         1:   48 89 e5                mov    %rsp,%rbp
         4:   e8 00 00 00 00          callq  9 <set_reset_devices+0x9>
                              5: R_X86_64_PC32        mcount+0xfffffffffffffffc
      
      The first function in .init.text is a static function.
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
      
      The lowercase 't' means that set_reset_devices is local and is not exported.
      If we simply try to link the tmp.o with the set_reset_devices we end
      up with two symbols: one local and one global.
      
       .section __mcount_loc,"a",@progbits
       .quad set_reset_devices + 0x10
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
                       U set_reset_devices
      
      We still have an undefined reference to set_reset_devices, and if we try
      to compile the kernel, we will end up with an undefined reference to
      set_reset_devices, or even worst, it could be exported someplace else,
      and then we will have a reference to the wrong location.
      
      To handle this case, we make an intermediate step using objcopy.
      We convert set_reset_devices into a global exported symbol before linking
      it with tmp.o and set it back afterwards.
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 T set_reset_devices
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 T set_reset_devices
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
      
      Now we have a section in main.o called __mcount_loc that we can place
      somewhere in the kernel using vmlinux.ld.S and access it to convert
      all these locations that call mcount into nops before starting SMP
      and thus, eliminating the need to do this with kstop_machine.
      
      Note, A well documented perl script (scripts/recordmcount.pl) is used
      to do all this in one location.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8da3821b
    • Ingo Molnar's avatar
      ftrace: mark lapic_wd_event() notrace · 8b1fa1d7
      Ingo Molnar authored
      it can be called in the NMI path:
      
      [    0.645999] calling  ftrace_dynamic_init+0x0/0xd6
      [    0.647521] ------------[ cut here ]------------
      [    0.647521] WARNING: at kernel/trace/ftrace.c:348 ftrace_record_ip+0x4e/0x252()
      [    0.647521] Modules linked in:
      [    0.647521] Pid: 15, comm: kstop1 Not tainted 2.6.27-rc1-tip #22686
      [    0.647521]
      [    0.647521] Call Trace:
      [    0.647521]  <NMI>  [<ffffffff8024593f>] warn_on_slowpath+0x5d/0x84
      [    0.647521]  [<ffffffff80220b99>] ? lapic_wd_event+0xb/0x5c
      [    0.647521]  [<ffffffff80287b3b>] ftrace_record_ip+0x4e/0x252
      [    0.647521]  [<ffffffff80211274>] mcount_call+0x5/0x31
      [    0.647521]  [<ffffffff80220b9e>] ? lapic_wd_event+0x10/0x5c
      [    0.647521]  [<ffffffff8083f3ec>] nmi_watchdog_tick+0x19d/0x1ad
      [    0.647521]  [<ffffffff8083e875>] default_do_nmi+0x75/0x1e3
      [    0.647521]  [<ffffffff8083f0b3>] do_nmi+0x5d/0x94
      [    0.647521]  [<ffffffff8083e2d2>] nmi+0xa2/0xc2
      [    0.647521]  [<ffffffff802b48c3>] ? check_bytes_and_report+0x11/0xcc
      [    0.647521]  <<EOE>>  [<ffffffff80211274>] ? mcount_call+0x5/0x31
      [    0.647521]  [<ffffffff802b49df>] check_object+0x61/0x1b0
      [    0.647521]  [<ffffffff802b502a>] __slab_free+0x169/0x2ae
      [    0.647521]  [<ffffffff80242dbf>] ? __cleanup_sighand+0x25/0x27
      [    0.647521]  [<ffffffff80242dbf>] ? __cleanup_sighand+0x25/0x27
      [    0.647521]  [<ffffffff802b60cd>] kmem_cache_free+0x85/0xb9
      [    0.647521]  [<ffffffff80242dbf>] __cleanup_sighand+0x25/0x27
      [    0.647521]  [<ffffffff80247b3d>] release_task+0x256/0x339
      [    0.647521]  [<ffffffff802490b4>] do_exit+0x764/0x7ef
      [    0.647521]  [<ffffffff8027624c>] __xchg+0x0/0x38
      [    0.647521]  [<ffffffff8027619a>] ? stop_cpu+0x0/0xb2
      [    0.647521]  [<ffffffff8027619a>] ? stop_cpu+0x0/0xb2
      [    0.647521]  [<ffffffff8025922f>] kthread+0x4e/0x7b
      [    0.647521]  [<ffffffff80212979>] child_rip+0xa/0x11
      [    0.647521]  [<ffffffff80211c17>] ? restore_args+0x0/0x30
      [    0.647521]  [<ffffffff802283a5>] ? native_load_tls+0x14/0x2e
      [    0.647521]  [<ffffffff802591e1>] ? kthread+0x0/0x7b
      [    0.647521]  [<ffffffff8021296f>] ? child_rip+0x0/0x11
      [    0.647521]
      [    0.647521] ---[ end trace 4eaa2a86a8e2da22 ]---
      [    0.672032] initcall ftrace_dynamic_init+0x0/0xd6 returned 0 after 19 msecs
      
      also mark it no-kprobes while at it.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8b1fa1d7
    • Ingo Molnar's avatar
      ftrace: ignore functions that cannot be kprobe-ed · 36dcd67a
      Ingo Molnar authored
      kprobes already has an extensive list of annotations for functions
      that should not be instrumented. Add notrace annotations to these
      functions as well.
      
      This is particularly useful for functions called by the NMI path.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      36dcd67a
    • Mathieu Desnoyers's avatar
      tracepoints: use TABLE_SIZE macro · 9795302a
      Mathieu Desnoyers authored
      Steven Rostedt suggested:
      
      | Wouldn't it look nicer to have: (TRACEPOINT_TABLE_SIZE - 1) ?
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      9795302a
    • Pekka Paalanen's avatar
      x86: fix mmiotrace 8-bit register decoding · 611b1597
      Pekka Paalanen authored
      When SIL, DIL, BPL or SPL registers were used in MMIO, the datum
      was extracted from AH, BH, CH, or DH, which are incorrect.
      Signed-off-by: default avatarPekka Paalanen <pq@iki.fi>
      Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
      Cc: "Steven Rostedt" <srostedt@redhat.com>
      Cc: proski@gnu.org
      Cc: "Pekka Enberg"
      	<penberg@cs.helsinki.fi>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      611b1597
    • Ingo Molnar's avatar
      tracing: clean up tracepoints kconfig structure · 5f87f112
      Ingo Molnar authored
      do not expose users to CONFIG_TRACEPOINTS - tracers can select it
      just fine.
      
      update ftrace to select CONFIG_TRACEPOINTS.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5f87f112
    • Ingo Molnar's avatar
      sched: clean up tracepoints · cf569a93
      Ingo Molnar authored
      make it a bit more structured hence more readable.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cf569a93
    • Ingo Molnar's avatar
      tracing: disable tracepoints by default · fa340d9c
      Ingo Molnar authored
      while it's arguably low overhead, we dont enable new features by default.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      fa340d9c
    • Mathieu Desnoyers's avatar
      ftrace: port to tracepoints · b07c3f19
      Mathieu Desnoyers authored
      Porting the trace_mark() used by ftrace to tracepoints. (cleanup)
      
      Changelog :
      - Change error messages : marker -> tracepoint
      
      [ mingo@elte.hu: conflict resolutions ]
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b07c3f19
    • Mathieu Desnoyers's avatar
      tracing, sched: LTTng instrumentation - scheduler · 0a16b607
      Mathieu Desnoyers authored
      Instrument the scheduler activity (sched_switch, migration, wakeups,
      wait for a task, signal delivery) and process/thread
      creation/destruction (fork, exit, kthread stop). Actually, kthread
      creation is not instrumented in this patch because it is architecture
      dependent. It allows to connect tracers such as ftrace which detects
      scheduling latencies, good/bad scheduler decisions. Tools like LTTng can
      export this scheduler information along with instrumentation of the rest
      of the kernel activity to perform post-mortem analysis on the scheduler
      activity.
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added. See the "Tracepoints" patch header for
      performance result detail.
      
      Changelog :
      
      - Change instrumentation location and parameter to match ftrace
        instrumentation, previously done with kernel markers.
      
      [ mingo@elte.hu: conflict resolutions ]
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      0a16b607
    • Mathieu Desnoyers's avatar
      tracing: tracepoints, samples · 4a089752
      Mathieu Desnoyers authored
      Tracepoint example code under samples/.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      4a089752
    • Mathieu Desnoyers's avatar
      tracing: tracepoints, documentation · 24b8d831
      Mathieu Desnoyers authored
      Documentation of tracepoint usage.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      24b8d831
    • Mathieu Desnoyers's avatar
      tracing: Kernel Tracepoints · 97e1c18e
      Mathieu Desnoyers authored
      Implementation of kernel tracepoints. Inspired from the Linux Kernel
      Markers. Allows complete typing verification by declaring both tracing
      statement inline functions and probe registration/unregistration static
      inline functions within the same macro "DEFINE_TRACE". No format string
      is required. See the tracepoint Documentation and Samples patches for
      usage examples.
      
      Taken from the documentation patch :
      
      "A tracepoint placed in code provides a hook to call a function (probe)
      that you can provide at runtime. A tracepoint can be "on" (a probe is
      connected to it) or "off" (no probe is attached). When a tracepoint is
      "off" it has no effect, except for adding a tiny time penalty (checking
      a condition for a branch) and space penalty (adding a few bytes for the
      function call at the end of the instrumented function and adds a data
      structure in a separate section).  When a tracepoint is "on", the
      function you provide is called each time the tracepoint is executed, in
      the execution context of the caller. When the function provided ends its
      execution, it returns to the caller (continuing from the tracepoint
      site).
      
      You can put tracepoints at important locations in the code. They are
      lightweight hooks that can pass an arbitrary number of parameters, which
      prototypes are described in a tracepoint declaration placed in a header
      file."
      
      Addition and removal of tracepoints is synchronized by RCU using the
      scheduler (and preempt_disable) as guarantees to find a quiescent state
      (this is really RCU "classic"). The update side uses rcu_barrier_sched()
      with call_rcu_sched() and the read/execute side uses
      "preempt_disable()/preempt_enable()".
      
      We make sure the previous array containing probes, which has been
      scheduled for deletion by the rcu callback, is indeed freed before we
      proceed to the next update. It therefore limits the rate of modification
      of a single tracepoint to one update per RCU period. The objective here
      is to permit fast batch add/removal of probes on _different_
      tracepoints.
      
      Changelog :
      - Use #name ":" #proto as string to identify the tracepoint in the
        tracepoint table. This will make sure not type mismatch happens due to
        connexion of a probe with the wrong type to a tracepoint declared with
        the same name in a different header.
      - Add tracepoint_entry_free_old.
      - Change __TO_TRACE to get rid of the 'i' iterator.
      
      Masami Hiramatsu <mhiramat@redhat.com> :
      Tested on x86-64.
      
      Performance impact of a tracepoint : same as markers, except that it
      adds about 70 bytes of instructions in an unlikely branch of each
      instrumented function (the for loop, the stack setup and the function
      call). It currently adds a memory read, a test and a conditional branch
      at the instrumentation site (in the hot path). Immediate values will
      eventually change this into a load immediate, test and branch, which
      removes the memory read which will make the i-cache impact smaller
      (changing the memory read for a load immediate removes 3-4 bytes per
      site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
      also saves the d-cache hit).
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added.
      
      Quoting Hideo Aoki about Markers :
      
      I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
      tree, which includes several markers for LTTng, using an ia64 server.
      
      While the immediate trace mark feature isn't implemented on ia64, there
      is no major performance regression. So, I think that we don't have any
      issues to propose merging marker point patches into Linus's tree from
      the viewpoint of performance impact.
      
      I prepared two kernels to evaluate. The first one was compiled without
      CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
      
      I downloaded the original hackbench from the following URL:
      http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
      
      I ran hackbench 5 times in each condition and calculated the average and
      difference between the kernels.
      
          The parameter of hackbench: every 50 from 50 to 800
          The number of CPUs of the server: 2, 4, and 8
      
      Below is the results. As you can see, major performance regression
      wasn't found in any case. Even if number of processes increases,
      differences between marker-enabled kernel and marker- disabled kernel
      doesn't increase. Moreover, if number of CPUs increases, the differences
      doesn't increase either.
      
      Curiously, marker-enabled kernel is better than marker-disabled kernel
      in more than half cases, although I guess it comes from the difference
      of memory access pattern.
      
      * 2 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
            100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
            150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
            200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
            250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
            300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
            350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
            400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
            450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
            500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
            550 |     58.401   |      57.700  |  -0.701  |  -1.2   |
            600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
            650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
            700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
            750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
            800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
      --------------------------------------------------------------
      
      * 4 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
            100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
            150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
            200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
            250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
            300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
            350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
            400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
            450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
            500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
            550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
            600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
            650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
            700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
            750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
            800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
      --------------------------------------------------------------
      
      * 8 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
            100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
            150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
            200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
            250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
            300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
            350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
            400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
            450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
            500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
            550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
            600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
            650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
            700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
            750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
            800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
      --------------------------------------------------------------
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      97e1c18e
  2. 13 Oct, 2008 27 commits