- 07 Aug, 2020 1 commit
-
-
Steven Rostedt (VMware) authored
As trace_array_printk() used with not global instances will not add noise to the main buffer, they are OK to have in the kernel (unlike trace_printk()). This require the subsystem to create their own tracing instance, and the trace_array_printk() only writes into those instances. Add trace_array_init_printk() to initialize the trace_printk() buffers without printing out the WARNING message. Reported-by: Sean Paul <sean@poorly.run> Reviewed-by: Sean Paul <sean@poorly.run> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 06 Aug, 2020 1 commit
-
-
Muchun Song authored
Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE. kernel/kprobes.c: In function 'kill_kprobe': kernel/kprobes.c:1116:33: warning: statement with no effect [-Wunused-value] 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV) | ^ kernel/kprobes.c:2154:3: note: in expansion of macro 'disarm_kprobe_ftrace' 2154 | disarm_kprobe_ftrace(p); Link: https://lore.kernel.org/r/20200805142136.0331f7ea@canb.auug.org.au Link: https://lkml.kernel.org/r/20200805172046.19066-1-songmuchun@bytedance.comReported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 0cb2f137 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 05 Aug, 2020 1 commit
-
-
Steven Rostedt (VMware) authored
On exit, if a process is preempted after the trace_sched_process_exit() tracepoint but before the process is done exiting, then when it gets scheduled in, the function tracers will not filter it properly against the function tracing pid filters. That is because the function tracing pid filters hooks to the sched_process_exit() tracepoint to remove the exiting task's pid from the filter list. Because the filtering happens at the sched_switch tracepoint, when the exiting task schedules back in to finish up the exit, it will no longer be in the function pid filtering tables. This was noticeable in the notrace self tests on a preemptable kernel, as the tests would fail as it exits and preempted after being taken off the notrace filter table and on scheduling back in it would not be in the notrace list, and then the ending of the exit function would trace. The test detected this and would fail. Cc: stable@vger.kernel.org Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 1e10486f ("ftrace: Add 'function-fork' trace option") Fixes: c37775d5 ("tracing: Add infrastructure to allow set_event_pid to follow children" Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 04 Aug, 2020 1 commit
-
-
Masami Hiramatsu authored
Since the parse_args() stops parsing at '--', bootconfig_params() will never get the '--' as param and initargs_found never be true. In the result, if we pass some init arguments via the bootconfig, those are always appended to the kernel command line with '--' even if the kernel command line already has '--'. To fix this correctly, check the return value of parse_args() and set initargs_found true if the return value is not an error but a valid address. Link: https://lkml.kernel.org/r/159650953285.270383.14822353843556363851.stgit@devnote2 Fixes: f61872bb ("bootconfig: Use parse_args() to find bootconfig and '--'") Cc: stable@vger.kernel.org Reported-by: Arvind Sankar <nivedita@alum.mit.edu> Suggested-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 03 Aug, 2020 10 commits
-
-
Masami Hiramatsu authored
Add a sentence about bootconfig override operator (":=") to bootconfig.rst. Link: https://lkml.kernel.org/r/159482884682.126704.7198860675721719878.stgit@devnote2Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Masami Hiramatsu authored
Add some testcases and examples for value override operator. Link: https://lkml.kernel.org/r/159482883824.126704.2166030493721357163.stgit@devnote2Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Masami Hiramatsu authored
Add the value override operator (":=") support to the bootconfig. This value override operator will be useful for the bootloaders which will only update the existing bootconfig according to the bootloader boot options. Without this override operator, the bootloader needs to parse the existing bootconfig and update it. However, with this assignment, it can just append the updated (partial) bootconfig text at the tail of existing one without parsing it. (Of course, it must update the size, checksum and magic, but that will be done easily) Link: https://lkml.kernel.org/r/159482882954.126704.16209517125614438640.stgit@devnote2Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Masami Hiramatsu authored
Remove show_registers() function prototype because this function has been renamed by commit 57da8b96 ("x86: Avoid double stack traces with show_regs()"), and commit 80006dbe ("kprobes/x86: Remove jprobe implementation") has removed the caller in kprobes. So this doesn't exist anymore. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Peng Fan authored
In the function trace_uprobe_register(), the statement "return 0;" out of switch case is dead code, remove it. Link: https://lkml.kernel.org/r/1595561064-29186-1-git-send-email-fanpeng@loongson.cnSigned-off-by: Peng Fan <fanpeng@loongson.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Muchun Song authored
We found a case of kernel panic on our server. The stack trace is as follows(omit some irrelevant information): BUG: kernel NULL pointer dereference, address: 0000000000000080 RIP: 0010:kprobe_ftrace_handler+0x5e/0xe0 RSP: 0018:ffffb512c6550998 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8e9d16eea018 RCX: 0000000000000000 RDX: ffffffffbe1179c0 RSI: ffffffffc0535564 RDI: ffffffffc0534ec0 RBP: ffffffffc0534ec1 R08: ffff8e9d1bbb0f00 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8e9d1f797060 R14: 000000000000bacc R15: ffff8e9ce13eca00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000080 CR3: 00000008453d0005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ftrace_ops_assist_func+0x56/0xe0 ftrace_call+0x5/0x34 tcpa_statistic_send+0x5/0x130 [ttcp_engine] The tcpa_statistic_send is the function being kprobed. After analysis, the root cause is that the fourth parameter regs of kprobe_ftrace_handler is NULL. Why regs is NULL? We use the crash tool to analyze the kdump. crash> dis tcpa_statistic_send -r <tcpa_statistic_send>: callq 0xffffffffbd8018c0 <ftrace_caller> The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller. So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler is NULL. In theory, we should call the ftrace_regs_caller instead of the ftrace_caller. After in-depth analysis, we found a reproducible path. Writing a simple kernel module which starts a periodic timer. The timer's handler is named 'kprobe_test_timer_handler'. The module name is kprobe_test.ko. 1) insmod kprobe_test.ko 2) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}' 3) echo 0 > /proc/sys/kernel/ftrace_enabled 4) rmmod kprobe_test 5) stop step 2) kprobe 6) insmod kprobe_test.ko 7) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}' We mark the kprobe as GONE but not disarm the kprobe in the step 4). The step 5) also do not disarm the kprobe when unregister kprobe. So we do not remove the ip from the filter. In this case, when the module loads again in the step 6), we will replace the code to ftrace_caller via the ftrace_module_enable(). When we register kprobe again, we will not replace ftrace_caller to ftrace_regs_caller because the ftrace is disabled in the step 3). So the step 7) will trigger kernel panic. Fix this problem by disarming the kprobe when the module is going away. Link: https://lkml.kernel.org/r/20200728064536.24405-1-songmuchun@bytedance.com Cc: stable@vger.kernel.org Fixes: ae6aa16f ("kprobes: introduce ftrace based optimization") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Josef Bacik authored
I was attempting to use pid filtering with function_graph, but it wasn't allowing anything to make it through. Turns out ftrace_trace_task returns false if ftrace_ignore_pid is not-empty, which isn't correct anymore. We're now setting it to FTRACE_PID_IGNORE if we need to ignore that pid, otherwise it's set to the pid (which is weird considering the name) or to FTRACE_PID_TRACE. Fix the check to check for != FTRACE_PID_IGNORE. With this we can now use function_graph with pid filtering. Link: https://lkml.kernel.org/r/20200725005048.1790-1-josef@toxicpanda.com Fixes: 717e3f5e ("ftrace: Make function trace pid filtering a bit more exact") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Nick Desaulniers authored
Just a small cleanup while I was touching this header. compiler_attributes.h does feature detection of these __attributes__(()) and provides more concise ways to invoke them. Link: https://lkml.kernel.org/r/20200730224555.2142154-3-ndesaulniers@google.comAcked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Nick Desaulniers authored
__tracepoint_string's have their string data stored in .rodata, and an address to that data stored in the "__tracepoint_str" section. Functions that refer to those strings refer to the symbol of the address. Compiler optimization can replace those address references with references directly to the string data. If the address doesn't appear to have other uses, then it appears dead to the compiler and is removed. This can break the /tracing/printk_formats sysfs node which iterates the addresses stored in the "__tracepoint_str" section. Like other strings stored in custom sections in this header, mark these __used to inform the compiler that there are other non-obvious users of the address, so they should still be emitted. Link: https://lkml.kernel.org/r/20200730224555.2142154-2-ndesaulniers@google.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: stable@vger.kernel.org Fixes: 102c9323 ("tracing: Add __tracepoint_string() to export string pointers") Reported-by: Tim Murray <timmurray@google.com> Reported-by: Simon MacMullen <simonmacm@google.com> Suggested-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Zhaoyang Huang authored
High order memory stuff within trace could introduce OOM, use kvzalloc instead. Please find the bellowing for the call stack we run across in an android system. The scenario happens when traced_probes is woken up to get a large quantity of trace even if free memory is even higher than watermark_low. traced_probes invoked oom-killer: gfp_mask=0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null), order=2, oom_score_adj=-1 traced_probes cpuset=system-background mems_allowed=0 CPU: 3 PID: 588 Comm: traced_probes Tainted: G W O 4.14.181 #1 Hardware name: Generic DT based system (unwind_backtrace) from [<c010d824>] (show_stack+0x20/0x24) (show_stack) from [<c0b2e174>] (dump_stack+0xa8/0xec) (dump_stack) from [<c027d584>] (dump_header+0x9c/0x220) (dump_header) from [<c027cfe4>] (oom_kill_process+0xc0/0x5c4) (oom_kill_process) from [<c027cb94>] (out_of_memory+0x220/0x310) (out_of_memory) from [<c02816bc>] (__alloc_pages_nodemask+0xff8/0x13a4) (__alloc_pages_nodemask) from [<c02a6a1c>] (kmalloc_order+0x30/0x48) (kmalloc_order) from [<c02a6a64>] (kmalloc_order_trace+0x30/0x118) (kmalloc_order_trace) from [<c0223d7c>] (tracing_buffers_open+0x50/0xfc) (tracing_buffers_open) from [<c02e6f58>] (do_dentry_open+0x278/0x34c) (do_dentry_open) from [<c02e70d0>] (vfs_open+0x50/0x70) (vfs_open) from [<c02f7c24>] (path_openat+0x5fc/0x169c) (path_openat) from [<c02f75c4>] (do_filp_open+0x94/0xf8) (do_filp_open) from [<c02e7650>] (do_sys_open+0x168/0x26c) (do_sys_open) from [<c02e77bc>] (SyS_openat+0x34/0x38) (SyS_openat) from [<c0108bc0>] (ret_fast_syscall+0x0/0x28) Link: https://lkml.kernel.org/r/1596155265-32365-1-git-send-email-zhaoyang.huang@unisoc.comSigned-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 31 Jul, 2020 2 commits
-
-
Vincent Whitchurch authored
This comment describes the behaviour before commit 2a820bf7 ("tracing: Use percpu stack trace buffer more intelligently"). Since that commit, interrupts and NMIs do use the per-cpu stacks so the comment is no longer correct. Remove it. (Note that the FTRACE_STACK_SIZE mentioned in the comment has never existed, it probably should have said FTRACE_STACK_ENTRIES.) Link: https://lkml.kernel.org/r/20200727092840.18659-1-vincent.whitchurch@axis.comSigned-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Chengming Zhou authored
When inserting a module, we find all ftrace_ops referencing it on the ftrace_ops_list. But FTRACE_OPS_FL_DIRECT and FTRACE_OPS_FL_IPMODIFY flags are special, and should not be set automatically. So warn and skip ftrace_ops that have these two flags set and adding new code. Also check if only one ftrace_ops references the module, in which case we can use a trampoline as an optimization. Link: https://lkml.kernel.org/r/20200728180554.65203-2-zhouchengming@bytedance.comSigned-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 30 Jul, 2020 3 commits
-
-
Chengming Zhou authored
When module loaded and enabled, we will use __ftrace_replace_code for module if any ftrace_ops referenced it found. But we will get wrong ftrace_addr for module rec in ftrace_get_addr_new, because rec->flags has not been setup correctly. It can cause the callback function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called with pt_regs set to NULL. So setup correct FTRACE_FL_REGS flags for rec when we call referenced_filters to find ftrace_ops references it. Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com Cc: stable@vger.kernel.org Fixes: 8c4f3c3f ("ftrace: Check module functions being traced on reload") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Kevin Hao authored
In calculation of the cpu mask for the hwlat kernel thread, the wrong cpu mask is used instead of the tracing_cpumask, this causes the tracing/tracing_cpumask useless for hwlat tracer. Fixes it. Link: https://lkml.kernel.org/r/20200730082318.42584-2-haokexin@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 0330f7aa ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs") Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Kevin Hao authored
We have set 'current_mask' to '&save_cpumask' in its declaration, so there is no need to assign again. Link: https://lkml.kernel.org/r/20200730082318.42584-1-haokexin@gmail.comSigned-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 09 Jul, 2020 2 commits
-
-
Wei Yang authored
Static defined trace_event->type stops at (__TRACE_LAST_TYPE - 1) and dynamic trace_event->type starts from (__TRACE_LAST_TYPE + 1). To save one trace_event->type index, let's use __TRACE_LAST_TYPE. Link: https://lkml.kernel.org/r/20200703020612.12930-3-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Wei Yang authored
The value to be used and compared in trace_search_list() is "last + 1". Let's just define next to be "last + 1" instead of doing the addition each time. Link: https://lkml.kernel.org/r/20200703020612.12930-2-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 02 Jul, 2020 4 commits
-
-
Steven Rostedt (VMware) authored
After tweaking the ring buffer to be a bit faster, a warning is triggering on one of my machines, and causing my tests to fail. This warning is caused when the delta (current time stamp minus previous time stamp), is larger than the max time held by the ring buffer (59 bits). If the clock were to go backwards slightly, this would then easily trigger this warning. The machine that it triggered on, the clock did go backwards by around 450 nanoseconds, and this happened after a recalibration of the TSC clock. Now that the ring buffer is faster, it detects this, and the delta that is used larger than the max, the warning is triggered and my test fails. To handle the clock going backwards, look at the saved before and after time stamps. If they are the same, it means that the current event did not interrupt another event, and that those timestamp are of a previous event that was recorded. If the max delta is triggered, look at those time stamps, make sure they are the same, then use them to compare with the current timestamp. If the current timestamp is less than the before/after time stamps, then that means the clock being used went backward. Print out a message that this has happened, but do not warn about it (and only print the message once). Still do the warning if the delta is indeed larger than what can be used. Also remove the unneeded KERN_WARNING from the WARN_ONCE() print. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
After doing some benchmarks and examining the code, I found that the ring buffer clock calls were quite expensive, and noticed that it uses retpolines. This is because the ring buffer clock is programmable, and can be set. But in most cases it simply uses the fastest ns unit clock which is the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer clock is set to trace_clock_local() and then calling it directly has brought the time of an event on my i7 box from an average of 93 nanoseconds an event down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to 68 nanoseconds! Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
Make a helper function rb_add_timestamp() that moves the adding of the extended time stamps into its own function. Also, remove the noinline and inline for the functions it calls, as recent benchmarks appear they do not make a difference (just let gcc decide). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
Reorganize a little the logic to handle adding the absolute time stamp, extended and forced time stamps, in such a way to remove a branch or two. This is just a micro optimization. Also add before and after time stamps to the rb_event_info structure to display those values in the rb_check_timestamps() code, if something were to go wrong. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 30 Jun, 2020 10 commits
-
-
Steven Rostedt (VMware) authored
It is the uncommon case where an event crosses a sub buffer boundary (page) mark that check at the end of reserving an event as unlikely. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Nicholas Piggin authored
On a 144 thread system, `perf ftrace` takes about 20 seconds to start up, due to calling synchronize_rcu() for each CPU. cat /proc/108560/stack 0xc0003e7eb336f470 __switch_to+0x2e0/0x480 __wait_rcu_gp+0x20c/0x220 synchronize_rcu+0x9c/0xc0 ring_buffer_reset_cpu+0x88/0x2e0 tracing_reset_online_cpus+0x84/0xe0 tracing_open+0x1d4/0x1f0 On a system with 10x more threads, it starts to become an annoyance. Batch these up so we disable all the per-cpu buffers first, then synchronize_rcu() once, then reset each of the buffers. This brings the time down to about 0.5s. Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.comTested-by: Anton Blanchard <anton@ozlabs.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
After a discussion with the new time algorithm to have nested events still have proper time keeping but required using local64_t atomic operations. Mathieu was concerned about the performance this would have on 32 bit machines, as in most cases, atomic 64 bit operations on them can be expensive. As the ring buffer's timing needs do not require full features of local64_t, a wrapper is made to implement a new rb_time_t operation that uses two longs on 32 bit machines but still uses the local64_t operations on 64 bit machines. There's a switch that can be made in the file to force 64 bit to use the 32 bit version just for testing purposes. All reads do not need to succeed if a read happened while the stamp being read is in the process of being updated. The requirement is that all reads must succed that were done by an interrupting event (where this event was interrupted by another event that did the write). Or if the event itself did the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will always succeed (even if it gets interrupted by another event that writes to t. The result of the read will be either the previous set, or a set performed by an interrupting event. If the read is done by an event that interrupted another event that was in the process of setting the time stamp, and no other event came along to write to that time stamp, it will fail and the rb_time_read() will return that it failed (the value to read will be undefined). A set will always write to the time stamp and return with a valid time stamp, such that any read after it will be valid. A cmpxchg may fail if it interrupted an event that was in the process of updating the time stamp just like the reads do. Other than that, it will act like a normal cmpxchg. The way this works is that the rb_time_t is made of of three fields. A cnt, that gets updated atomically everyting a modification is made. A top that represents the most significant 30 bits of the time, and a bottom to represent the least significant 30 bits of the time. Notice, that the time values is only 60 bits long (where the ring buffer only uses 59 bits, which gives us 18 years of nanoseconds!). The top two bits of both the top and bottom is a 2 bit counter that gets set by the value of the least two significant bits of the cnt. A read of the top and the bottom where both the top and bottom have the same most significant top 2 bits, are considered a match and a valid 60 bit number can be created from it. If they do not match, then the number is considered invalid, and this must only happen if an event interrupted another event in the midst of updating the time stamp. This is only used for 32 bits machines as 64 bit machines can get better performance out of the local64_t. This has been tested heavily by forcing 64 bit to use this logic. Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.orgInspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
Instead of calling out the absolute test for each time to check if the ring buffer wants absolute time stamps for all its recording, incorporate it with the add_timestamp field and turn it into flags for faster processing between wanting a absolute tag and needing to force one. Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.orgSigned-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
Up until now, if an event is interrupted while it is recorded by an interrupt, and that interrupt records events, the time of those events will all be the same. This is because events only record the delta of the time since the previous event (or beginning of a page), and to handle updating the time keeping for that of nested events is extremely racy. After years of thinking about this and several failed attempts, I finally have a solution to solve this puzzle. The problem is that you need to atomically calculate the delta and then update the time stamp you made the delta from, as well as then record it into the buffer, all this while at any time an interrupt can come in and do the same thing. This is easy to solve with heavy weight atomics, but that would be detrimental to the performance of the ring buffer. The current state of affairs sacrificed the time deltas for nested events for performance. The reason for previous failed attempts at solving this puzzle was because I was trying to completely avoid slow atomic operations like cmpxchg. I final came to the conclusion to always avoid cmpxchg is not possible, which is why those previous attempts always failed. But it is possible to pick one path (the most common case) and avoid cmpxchg in that path, which is the "fast path". The most common case is that an event will not be interrupted and have other events added into it. An event can detect if it has interrupted another event, and for these cases we can make it the slow path and use the heavy operations like cmpxchg. One more player was added to the game that made this possible, and that is the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59 bit time stamp. (Of course this breaks if a machine is running for more than 18 years without a reboot!). There's barrier() placements around for being paranoid, even when they are not needed because of other atomic functions near by. But those should not hurt, as if they are not needed, they basically become a nop. Note, this also makes the race window much smaller, which means there are less slow paths to slow down the performance. The basic idea is that there's two main paths taken. 1) Not being interrupted between time stamps and reserving buffer space. In this case, the time stamps taken are true to the location in the buffer. 2) Was interrupted by another path between taking time stamps and reserving buffer space. The objective is to know what the delta is from the last reserved location in the buffer. As it is possible to detect if an event is interrupting another event before reserving data, space is added to the length to be reserved to inject a full time stamp along with the event being reserved. When an event is not interrupted, the write stamp is always the time of the last event written to the buffer. In path 1, there's two sub paths we care about: a) The event did not interrupt another event. b) The event interrupted another event. In case a, as the write stamp was read and known to be correct, the delta between the current time stamp and the write stamp is the delta between the current event and the previously recorded event. In case b, extra space was reserved to just put the full time stamp into the buffer. Which is done, as stated, in this path the time stamp taken is known to match the location in the buffer. In path 2, there's also two sub paths we care about: a) The event was not interrupted by another event since it reserved space on the buffer and re-reading the write stamp. b) The event was interrupted by another event. In case a, the write stamp is that of the last event that interrupted this event between taking the time stamps and reserving. As no event came in after re-reading the write stamp, that event is known to be the time of the event directly before this event and the delta can be the new time stamp and the write stamp. In case b, one or more events came in between reserving the event and re-reading he write stamp. Since this event's buffer reservation is between other events at this path, there's no way to know what the delta is. But because an event interrupted this event after it started, its fine to just give a zero delta, and take the same time stamp as the events that happened within the event being recorded. Here's the implementation of the design of this solution: All this is per cpu, and only needs to worry about nested events (not parallel events). The players: write_tail: The index in the buffer where new events can be written to. It is incremented via local_add() to reserve space for a new event. before_stamp: A time stamp set by all events before reserving space. write_stamp: A time stamp updated by events after it has successfully reserved space. /* Save the current position of write */ [A] w = local_read(write_tail); barrier(); /* Read both before and write stamps before touching anything */ before = local_read(before_stamp); after = local_read(write_stamp); barrier(); /* * If before and after are the same, then this event is not * interrupting a time update. If it is, then reserve space for adding * a full time stamp (this can turn into a time extend which is * just an extended time delta but fill up the extra space). */ if (after != before) abs = true; ts = clock(); /* Now update the before_stamp (everyone does this!) */ [B] local_set(before_stamp, ts); /* Now reserve space on the buffer */ [C] write = local_add_return(len, write_tail); /* Set tail to be were this event's data is */ tail = write - len; if (w == tail) { /* Nothing interrupted this between A and C */ [D] local_set(write_stamp, ts); barrier(); [E] save_before = local_read(before_stamp); if (!abs) { /* This did not interrupt a time update */ delta = ts - after; } else { delta = ts; /* The full time stamp will be in use */ } if (ts != save_before) { /* slow path - Was interrupted between C and E */ /* The update to write_stamp could have overwritten the update to * it by the interrupting event, but before and after should be * the same for all completed top events */ after = local_read(write_stamp); if (save_before > after) local_cmpxchg(write_stamp, after, save_before); } } else { /* slow path - Interrupted between A and C */ after = local_read(write_stamp); temp_ts = clock(); barrier(); [F] if (write == local_read(write_tail) && after < temp_ts) { /* This was not interrupted since C and F * The last write_stamp is still valid for the previous event * in the buffer. */ delta = temp_ts - after; /* OK to keep this new time stamp */ ts = temp_ts; } else { /* Interrupted between C and F * Well, there's no use to try to know what the time stamp * is for the previous event. Just set delta to zero and * be the same time as that event that interrupted us before * the reservation of the buffer. */ delta = 0; } /* No need to use full timestamps here */ abs = 0; } Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.orgReviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
If a process has the trace_pipe open on a trace_array, the current tracer for that trace array should not be changed. This was original enforced by a global lock, but when instances were introduced, it was moved to the current_trace. But this structure is shared by all instances, and a trace_pipe is for a single instance. There's no reason that a process that has trace_pipe open on one instance should prevent another instance from changing its current tracer. Move the reference counter to the trace_array instead. This is marked as "Fixes" but is more of a clean up than a true fix. Backport if you want, but its not critical. Fixes: cf6ab6d9 ("tracing: Add ref count to tracer for when they are being read by pipe") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Wei Yang authored
After the previous cleanup, DEFINE_EVENT_PRINT's definition has no relationship with DEFINE_EVENT. So After we re-define DEFINE_EVENT, it is not necessary to define DEFINE_EVENT_PRINT to be empty again. Link: http://lkml.kernel.org/r/20200612092844.56107-5-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Wei Yang authored
Current definition define DEFINE_EVENT_PRINT to be DEFINE_EVENT. Actually, at this point DEFINE_EVENT is already an empty macro. Let's cut the relationship between DEFINE_EVENT_PRINT and DEFINE_EVENT. Link: http://lkml.kernel.org/r/20200612092844.56107-4-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Wei Yang authored
The definition of DEFINE_EVENT_PRINT is not changed after previous one, so not necessary to re-define is as the same form. Link: http://lkml.kernel.org/r/20200612092844.56107-3-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Wei Yang authored
After un-define DEFINE_EVENT in Stage 2, DEFINE_EVENT is not defined to a specific form. It is not necessary to un-define it again. Let's skip this. Link: http://lkml.kernel.org/r/20200612092844.56107-2-richard.weiyang@linux.alibaba.comSigned-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 29 Jun, 2020 4 commits
-
-
Steven Rostedt (VMware) authored
When creating a trampoline based on the ftrace_regs_caller code, nop out the jnz test that would jmup to the code that would return to a direct caller (stored in the ORIG_RAX field) and not back to the function that called it. Link: http://lkml.kernel.org/r/20200422162750.638839749@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
If a direct hook is attached to a function that ftrace also has a function attached to it, then it is required that the ftrace_ops_list_func() is used to iterate over the registered ftrace callbacks. This will also include the direct ftrace_ops helper, that tells ftrace_regs_caller where to return to (the direct callback and not the function that called it). As this direct helper is only to handle the case of ftrace callbacks attached to the same function as the direct callback, the ftrace callback allocated trampolines (used to only call them), should never be used to return back to a direct callback. Only copy the portion of the ftrace_regs_caller that will return back to what called it, and not the portion that returns back to the direct caller. The direct ftrace_ops must then pick the ftrace_regs_caller builtin function as its own trampoline to ensure that it will never have one allocated for it (which would not include the handling of direct callbacks). Link: http://lkml.kernel.org/r/20200422162750.495903799@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
If a direct function is hooked along with one of the ftrace registered functions, then the ftrace_regs_caller is attached to the function that shares the direct hook as well as the ftrace hook. The ftrace_regs_caller will call ftrace_ops_list_func() that iterates through all the registered ftrace callbacks, and if there's a direct callback attached to that function, the direct ftrace_ops callback is called to notify that ftrace_regs_caller to return to the direct caller instead of going back to the function that called it. But this is a very uncommon case. Currently, the code has it as the default case. Modify ftrace_regs_caller to make the default case (the non jump) to just return normally, and have the jump to the handling of the direct caller. Link: http://lkml.kernel.org/r/20200422162750.350373278@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
Steven Rostedt (VMware) authored
To prevent default "trace_printks()" from spamming the top level tracing ring buffer, only allow trace instances to use trace_array_printk() (which can be used without the trace_printk() start up warning). Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-
- 28 Jun, 2020 1 commit
-
-
Linus Torvalds authored
-