1. 16 Aug, 2021 14 commits
  2. 12 Aug, 2021 6 commits
    • Steven Rostedt (VMware)'s avatar
      tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name · 5acce0bf
      Steven Rostedt (VMware) authored
      The following commands:
      
       # echo 'read_max u64 size;' > synthetic_events
       # echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger
      
      Causes:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
      v03.03 07/14/2016
       RIP: 0010:strcmp+0xc/0x20
       Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
      c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
      3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
       RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000
       RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8
       R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210
       FS:  00007f1fa6785740(0000) GS:ffff9714da400000(0000)
      knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0
       Call Trace:
        __find_event_file+0x4e/0x80
        action_create+0x6b7/0xeb0
        ? kstrdup+0x44/0x60
        event_hist_trigger_func+0x1a07/0x2130
        trigger_process_regex+0xbd/0x110
        event_trigger_write+0x71/0xd0
        vfs_write+0xe9/0x310
        ksys_write+0x68/0xe0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f1fa6879e87
      
      The problem was the "trace(read_max,count)" where the "count" should be
      "$count" as "onmax()" only handles variables (although it really should be
      able to figure out that "count" is a field of sys_enter_read). But there's
      a path that does not find the variable and ends up passing a NULL for the
      event, which ends up getting passed to "strcmp()".
      
      Add a check for NULL to return and error on the command with:
      
       # cat error_log
        hist:syscalls:sys_enter_read: error: Couldn't create or find variable
        Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)
                                      ^
      Link: https://lkml.kernel.org/r/20210808003011.4037f8d0@oasis.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 50450603 tracing: Add 'onmax' hist trigger action support
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      5acce0bf
    • Masami Hiramatsu's avatar
      init: Suppress wrong warning for bootconfig cmdline parameter · d0ac5fba
      Masami Hiramatsu authored
      Since the 'bootconfig' command line parameter is handled before
      parsing the command line, it doesn't use early_param(). But in
      this case, kernel shows a wrong warning message about it.
      
      [    0.013714] Kernel command line: ro console=ttyS0  bootconfig console=tty0
      [    0.013741] Unknown command line parameters: bootconfig
      
      To suppress this message, add a dummy handler for 'bootconfig'.
      
      Link: https://lkml.kernel.org/r/162812945097.77369.1849780946468010448.stgit@devnote2
      
      Fixes: 86d1919a ("init: print out unknown kernel parameters")
      Reviewed-by: default avatarAndrew Halaney <ahalaney@redhat.com>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d0ac5fba
    • Lukas Bulwahn's avatar
      tracing: define needed config DYNAMIC_FTRACE_WITH_ARGS · 12f9951d
      Lukas Bulwahn authored
      Commit 2860cd8a ("livepatch: Use the default ftrace_ops instead of
      REGS when ARGS is available") intends to enable config LIVEPATCH when
      ftrace with ARGS is available. However, the chain of configs to enable
      LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available,
      but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE
      and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing
      in the commit.
      
      Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns:
      
      DYNAMIC_FTRACE_WITH_ARGS
      Referencing files: kernel/livepatch/Kconfig
      
      So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already
      existing similar configs, DYNAMIC_FTRACE_WITH_REGS and
      DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the
      chain of configs.
      
      Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20210806195027.16808-1-lukas.bulwahn@gmail.com
      
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: stable@vger.kernel.org
      Fixes: 2860cd8a ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available")
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      12f9951d
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Print a stop tracing message · 0e05ba49
      Daniel Bristot de Oliveira authored
      When using osnoise/timerlat with stop tracing, sometimes it is
      not clear in which CPU the stop condition was hit, mainly
      when using some extra events.
      
      Print a message informing in which CPU the trace stopped, like
      in the example below:
      
                <idle>-0       [006] d.h.  2932.676616: #1672599 context    irq timer_latency     34689 ns
                <idle>-0       [006] dNh.  2932.676618: irq_noise: local_timer:236 start 2932.676615639 duration 2391 ns
                <idle>-0       [006] dNh.  2932.676620: irq_noise: virtio0-output.0:47 start 2932.676620180 duration 86 ns
                <idle>-0       [003] d.h.  2932.676621: #1673374 context    irq timer_latency      1200 ns
                <idle>-0       [006] d...  2932.676623: thread_noise: swapper/6:0 start 2932.676615964 duration 4339 ns
                <idle>-0       [003] dNh.  2932.676623: irq_noise: local_timer:236 start 2932.676620597 duration 1881 ns
                <idle>-0       [006] d...  2932.676623: sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/6 next_pid=852 next_prio=4
            timerlat/6-852     [006] ....  2932.676623: #1672599 context thread timer_latency     41931 ns
                <idle>-0       [003] d...  2932.676623: thread_noise: swapper/3:0 start 2932.676620854 duration 880 ns
                <idle>-0       [003] d...  2932.676624: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/3 next_pid=849 next_prio=4
            timerlat/6-852     [006] ....  2932.676624: timerlat_main: stop tracing hit on cpu 6
            timerlat/3-849     [003] ....  2932.676624: #1673374 context thread timer_latency      4310 ns
      
      Link: https://lkml.kernel.org/r/b30a0d7542adba019185f44ee648e60e14923b11.1626598844.git.bristot@kernel.org
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      0e05ba49
    • Daniel Bristot de Oliveira's avatar
      trace/timerlat: Add a header with PREEMPT_RT additional fields · e1c4ad4a
      Daniel Bristot de Oliveira authored
      Some extra flags are printed to the trace header when using the
      PREEMPT_RT config. The extra flags are: need-resched-lazy,
      preempt-lazy-depth, and migrate-disable.
      
      Without printing these fields, the timerlat specific fields are
      shifted by three positions, for example:
      
       # tracer: timerlat
       #
       #                                _-----=> irqs-off
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              || /
       #                              ||||             ACTIVATION
       #           TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY
       #              | |         |   ||||      |         |                  |                       |
                 <idle>-0       [000] d..h...  3279.798871: #1     context    irq timer_latency       830 ns
                  <...>-807     [000] .......  3279.798881: #1     context thread timer_latency     11301 ns
      
      Add a new header for timerlat with the missing fields, to be used
      when the PREEMPT_RT is enabled.
      
      Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      e1c4ad4a
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Add a header with PREEMPT_RT additional fields · d03721a6
      Daniel Bristot de Oliveira authored
      Some extra flags are printed to the trace header when using the
      PREEMPT_RT config. The extra flags are: need-resched-lazy,
      preempt-lazy-depth, and migrate-disable.
      
      Without printing these fields, the osnoise specific fields are
      shifted by three positions, for example:
      
       # tracer: osnoise
       #
       #                                _-----=> irqs-off
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth                            MAX
       #                              || /                                             SINGLE      Interference counters:
       #                              ||||               RUNTIME      NOISE  %% OF CPU  NOISE    +-----------------------------+
       #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
       #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                  <...>-741     [000] .......  1105.690909: 1000000        234  99.97660      36     21      0   1001     22      3
                  <...>-742     [001] .......  1105.691923: 1000000        281  99.97190     197      7      0   1012     35     14
                  <...>-743     [002] .......  1105.691958: 1000000       1324  99.86760     118     11      0   1016    155    143
                  <...>-744     [003] .......  1105.691998: 1000000        109  99.98910      21      4      0   1004     33      7
                  <...>-745     [004] .......  1105.692015: 1000000       2023  99.79770      97     37      0   1023     52     18
      
      Add a new header for osnoise with the missing fields, to be used
      when the PREEMPT_RT is enabled.
      
      Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d03721a6
  3. 06 Aug, 2021 1 commit
  4. 05 Aug, 2021 3 commits
  5. 04 Aug, 2021 4 commits
    • Hui Su's avatar
      scripts/tracing: fix the bug that can't parse raw_trace_func · 1c0cec64
      Hui Su authored
      Since commit 77271ce4 ("tracing: Add irq, preempt-count and need resched info
      to default trace output"), the default trace output format has been changed to:
                <idle>-0       [009] d.h. 22420.068695: _raw_spin_lock_irqsave <-hrtimer_interrupt
                <idle>-0       [000] ..s. 22420.068695: _nohz_idle_balance <-run_rebalance_domains
                <idle>-0       [011] d.h. 22420.068695: account_process_tick <-update_process_times
      
      origin trace output format:(before v3.2.0)
           # tracer: nop
           #
           #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
           #              | |       |          |         |
                migration/0-6     [000]    50.025810: rcu_note_context_switch <-__schedule
                migration/0-6     [000]    50.025812: trace_rcu_utilization <-rcu_note_context_switch
                migration/0-6     [000]    50.025813: rcu_sched_qs <-rcu_note_context_switch
                migration/0-6     [000]    50.025815: rcu_preempt_qs <-rcu_note_context_switch
                migration/0-6     [000]    50.025817: trace_rcu_utilization <-rcu_note_context_switch
                migration/0-6     [000]    50.025818: debug_lockdep_rcu_enabled <-__schedule
                migration/0-6     [000]    50.025820: debug_lockdep_rcu_enabled <-__schedule
      
      The draw_functrace.py(introduced in v2.6.28) can't parse the new version format trace_func,
      So we need modify draw_functrace.py to adapt the new version trace output format.
      
      Link: https://lkml.kernel.org/r/20210611022107.608787-1-suhui@zeku.com
      
      Cc: stable@vger.kernel.org
      Fixes: 77271ce4 tracing: Add irq, preempt-count and need resched info to default trace output
      Signed-off-by: default avatarHui Su <suhui@zeku.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1c0cec64
    • Nathan Chancellor's avatar
      scripts/recordmcount.pl: Remove check_objcopy() and $can_use_local · b18b851b
      Nathan Chancellor authored
      When building ARCH=riscv allmodconfig with llvm-objcopy, the objcopy
      version warning from this script appears:
      
      WARNING: could not find objcopy version or version is less than 2.17.
              Local function references are disabled.
      
      The check_objcopy() function in scripts/recordmcount.pl is set up to
      parse GNU objcopy's version string, not llvm-objcopy's, which triggers
      the warning.
      
      Commit 799c4341 ("kbuild: thin archives make default for all archs")
      made binutils 2.20 mandatory and commit ba64beb1 ("kbuild: check the
      minimum assembler version in Kconfig") enforces this at configuration
      time so just remove check_objcopy() and $can_use_local instead, assuming
      --globalize-symbol is always available.
      
      llvm-objcopy has supported --globalize-symbol since LLVM 7.0.0 in 2018
      and the minimum version for building the kernel with LLVM is 10.0.1 so
      there is no issue introduced:
      
      Link: https://github.com/llvm/llvm-project/commit/ee5be798dae30d5f9414b01f76ff807edbc881aa
      Link: https://lkml.kernel.org/r/20210802210307.3202472-1-nathan@kernel.orgReviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b18b851b
    • Masami Hiramatsu's avatar
      tracing: Reject string operand in the histogram expression · a9d10ca4
      Masami Hiramatsu authored
      Since the string type can not be the target of the addition / subtraction
      operation, it must be rejected. Without this fix, the string type silently
      converted to digits.
      
      Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2
      
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a9d10ca4
    • Steven Rostedt (VMware)'s avatar
      tracing / histogram: Give calculation hist_fields a size · 2c05caa7
      Steven Rostedt (VMware) authored
      When working on my user space applications, I found a bug in the synthetic
      event code where the automated synthetic event field was not matching the
      event field calculation it was attached to. Looking deeper into it, it was
      because the calculation hist_field was not given a size.
      
      The synthetic event fields are matched to their hist_fields either by
      having the field have an identical string type, or if that does not match,
      then the size and signed values are used to match the fields.
      
      The problem arose when I tried to match a calculation where the fields
      were "unsigned int". My tool created a synthetic event of type "u32". But
      it failed to match. The string was:
      
        diff=field1-field2:onmatch(event).trace(synth,$diff)
      
      Adding debugging into the kernel, I found that the size of "diff" was 0.
      And since it was given "unsigned int" as a type, the histogram fallback
      code used size and signed. The signed matched, but the size of u32 (4) did
      not match zero, and the event failed to be created.
      
      This can be worse if the field you want to match is not one of the
      acceptable fields for a synthetic event. As event fields can have any type
      that is supported in Linux, this can cause an issue. For example, if a
      type is an enum. Then there's no way to use that with any calculations.
      
      Have the calculation field simply take on the size of what it is
      calculating.
      
      Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      2c05caa7
  6. 30 Jul, 2021 1 commit
  7. 23 Jul, 2021 6 commits
    • Steven Rostedt (VMware)'s avatar
      tracepoints: Update static_call before tp_funcs when adding a tracepoint · 352384d5
      Steven Rostedt (VMware) authored
      Because of the significant overhead that retpolines pose on indirect
      calls, the tracepoint code was updated to use the new "static_calls" that
      can modify the running code to directly call a function instead of using
      an indirect caller, and this function can be changed at runtime.
      
      In the tracepoint code that calls all the registered callbacks that are
      attached to a tracepoint, the following is done:
      
      	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
      	if (it_func_ptr) {
      		__data = (it_func_ptr)->data;
      		static_call(tp_func_##name)(__data, args);
      	}
      
      If there's just a single callback, the static_call is updated to just call
      that callback directly. Once another handler is added, then the static
      caller is updated to call the iterator, that simply loops over all the
      funcs in the array and calls each of the callbacks like the old method
      using indirect calling.
      
      The issue was discovered with a race between updating the funcs array and
      updating the static_call. The funcs array was updated first and then the
      static_call was updated. This is not an issue as long as the first element
      in the old array is the same as the first element in the new array. But
      that assumption is incorrect, because callbacks also have a priority
      field, and if there's a callback added that has a higher priority than the
      callback on the old array, then it will become the first callback in the
      new array. This means that it is possible to call the old callback with
      the new callback data element, which can cause a kernel panic.
      
      	static_call = callback1()
      	funcs[] = {callback1,data1};
      	callback2 has higher priority than callback1
      
      	CPU 1				CPU 2
      	-----				-----
      
         new_funcs = {callback2,data2},
                     {callback1,data1}
      
         rcu_assign_pointer(tp->funcs, new_funcs);
      
        /*
         * Now tp->funcs has the new array
         * but the static_call still calls callback1
         */
      
      				it_func_ptr = tp->funcs [ new_funcs ]
      				data = it_func_ptr->data [ data2 ]
      				static_call(callback1, data);
      
      				/* Now callback1 is called with
      				 * callback2's data */
      
      				[ KERNEL PANIC ]
      
         update_static_call(iterator);
      
      To prevent this from happening, always switch the static_call to the
      iterator before assigning the tp->funcs to the new array. The iterator will
      always properly match the callback with its data.
      
      To trigger this bug:
      
        In one terminal:
      
          while :; do hackbench 50; done
      
        In another terminal
      
          echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
          while :; do
              echo 1 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
              echo 0 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
         done
      
      And it doesn't take long to crash. This is because the set_event_pid adds
      a callback to the sched_waking tracepoint with a high priority, which will
      be called before the sched_waking trace event callback is called.
      
      Note, the removal to a single callback updates the array first, before
      changing the static_call to single callback, which is the proper order as
      the first element in the array is the same as what the static_call is
      being changed to.
      
      Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
      
      Cc: stable@vger.kernel.org
      Fixes: d25e37d8 ("tracepoint: Optimize using static_call()")
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      tested-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      352384d5
    • Colin Ian King's avatar
      ftrace: Remove redundant initialization of variable ret · 3b1a8f45
      Colin Ian King authored
      The variable ret is being initialized with a value that is never
      read, it is being updated later on. The assignment is redundant and
      can be removed.
      
      Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3b1a8f45
    • Nicolas Saenz Julienne's avatar
      ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary · 68e83498
      Nicolas Saenz Julienne authored
      synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
      all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
      very disrupting (hence the name). So avoid calling it when 'old_hash'
      doesn't need to be freed.
      
      Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.comSigned-off-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      68e83498
    • Steven Rostedt (VMware)'s avatar
      tracing: Clean up alloc_synth_event() · 9528c195
      Steven Rostedt (VMware) authored
      alloc_synth_event() currently has the following code to initialize the
      event fields and dynamic_fields:
      
      	for (i = 0, j = 0; i < n_fields; i++) {
      		event->fields[i] = fields[i];
      
      		if (fields[i]->is_dynamic) {
      			event->dynamic_fields[j] = fields[i];
      			event->dynamic_fields[j]->field_pos = i;
      			event->dynamic_fields[j++] = fields[i];
      			event->n_dynamic_fields++;
      		}
      	}
      
      1) It would make more sense to have all fields keep track of their
         field_pos.
      
      2) event->dynmaic_fields[j] is assigned twice for no reason.
      
      3) We can move updating event->n_dynamic_fields outside the loop, and just
         assign it to j.
      
      This combination makes the code much cleaner.
      
      Link: https://lkml.kernel.org/r/20210721195341.29bb0f77@oasis.local.homeSigned-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9528c195
    • Steven Rostedt (VMware)'s avatar
      tracing/histogram: Rename "cpu" to "common_cpu" · 1e3bac71
      Steven Rostedt (VMware) authored
      Currently the histogram logic allows the user to write "cpu" in as an
      event field, and it will record the CPU that the event happened on.
      
      The problem with this is that there's a lot of events that have "cpu"
      as a real field, and using "cpu" as the CPU it ran on, makes it
      impossible to run histograms on the "cpu" field of events.
      
      For example, if I want to have a histogram on the count of the
      workqueue_queue_work event on its cpu field, running:
      
       ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
      
      Gives a misleading and wrong result.
      
      Change the command to "common_cpu" as no event should have "common_*"
      fields as that's a reserved name for fields used by all events. And
      this makes sense here as common_cpu would be a field used by all events.
      
      Now we can even do:
      
       ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
       ># cat events/workqueue/workqueue_queue_work/hist
       # event histogram
       #
       # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
       #
      
       { common_cpu:          0, cpu:          2 } hitcount:          1
       { common_cpu:          0, cpu:          4 } hitcount:          1
       { common_cpu:          7, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          1 } hitcount:          1
       { common_cpu:          0, cpu:          6 } hitcount:          2
       { common_cpu:          0, cpu:          5 } hitcount:          2
       { common_cpu:          1, cpu:          1 } hitcount:          4
       { common_cpu:          6, cpu:          6 } hitcount:          4
       { common_cpu:          5, cpu:          5 } hitcount:         14
       { common_cpu:          4, cpu:          4 } hitcount:         26
       { common_cpu:          0, cpu:          0 } hitcount:         39
       { common_cpu:          2, cpu:          2 } hitcount:        184
      
      Now for backward compatibility, I added a trick. If "cpu" is used, and
      the field is not found, it will fall back to "common_cpu" and work as
      it did before. This way, it will still work for old programs that use
      "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
      will get that event's "cpu" field, which is probably what it wants
      anyway.
      
      I updated the tracefs/README to include documentation about both the
      common_timestamp and the common_cpu. This way, if that text is present in
      the README, then an application can know that common_cpu is supported over
      just plain "cpu".
      
      Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 8b7622bf ("tracing: Add cpu field for hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1e3bac71
    • Steven Rostedt (VMware)'s avatar
      tracing: Synthetic event field_pos is an index not a boolean · 3b13911a
      Steven Rostedt (VMware) authored
      Performing the following:
      
       ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
       ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
       ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
            > events/sched/sched_switch/trigger
       ># echo 1 > events/synthetic/enable
      
      Crashed the kernel:
      
       BUG: kernel NULL pointer dereference, address: 000000000000001b
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:strlen+0x0/0x20
       Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
        20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
        48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
       RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
       RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
       RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
       RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
       R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
       R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
       FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
       Call Trace:
        trace_event_raw_event_synth+0x90/0x1d0
        action_trace+0x5b/0x70
        event_hist_trigger+0x4bd/0x4e0
        ? cpumask_next_and+0x20/0x30
        ? update_sd_lb_stats.constprop.0+0xf6/0x840
        ? __lock_acquire.constprop.0+0x125/0x550
        ? find_held_lock+0x32/0x90
        ? sched_clock_cpu+0xe/0xd0
        ? lock_release+0x155/0x440
        ? update_load_avg+0x8c/0x6f0
        ? enqueue_entity+0x18a/0x920
        ? __rb_reserve_next+0xe5/0x460
        ? ring_buffer_lock_reserve+0x12a/0x3f0
        event_triggers_call+0x52/0xe0
        trace_event_buffer_commit+0x1ae/0x240
        trace_event_raw_event_sched_switch+0x114/0x170
        __traceiter_sched_switch+0x39/0x50
        __schedule+0x431/0xb00
        schedule_idle+0x28/0x40
        do_idle+0x198/0x2e0
        cpu_startup_entry+0x19/0x20
        secondary_startup_64_no_verify+0xc2/0xcb
      
      The reason is that the dynamic events array keeps track of the field
      position of the fields array, via the field_pos variable in the
      synth_field structure. Unfortunately, that field is a boolean for some
      reason, which means any field_pos greater than 1 will be a bug (in this
      case it was 2).
      
      Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: bd82631d ("tracing: Add support for dynamic strings to synthetic events")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3b13911a
  8. 22 Jul, 2021 1 commit
    • Haoran Luo's avatar
      tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. · 67f0d6d9
      Haoran Luo authored
      The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
      "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
      the same buffer page, whose "buffer_data_page" is empty and "read" field
      is non-zero.
      
      An error scenario could be constructed as followed (kernel perspective):
      
      1. All pages in the buffer has been accessed by reader(s) so that all of
      them will have non-zero "read" field.
      
      2. Read and clear all buffer pages so that "rb_num_of_entries()" will
      return 0 rendering there's no more data to read. It is also required
      that the "read_page", "commit_page" and "tail_page" points to the same
      page, while "head_page" is the next page of them.
      
      3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
      so that it shot pass the end of current tail buffer page. Now the
      "head_page", "commit_page" and "tail_page" points to the same page.
      
      4. Discard current event with "ring_buffer_discard_commit()", so that
      "head_page", "commit_page" and "tail_page" points to a page whose buffer
      data page is now empty.
      
      When the error scenario has been constructed, "tracing_read_pipe" will
      be trapped inside a deadloop: "trace_empty()" returns 0 since
      "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
      constructed ring buffer. Then "trace_find_next_entry_inc()" always
      return NULL since "rb_num_of_entries()" reports there's no more entry
      to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
      "tracing_read_pipe" back to the start of the "waitagain" loop.
      
      I've also written a proof-of-concept script to construct the scenario
      and trigger the bug automatically, you can use it to trace and validate
      my reasoning above:
      
        https://github.com/aegistudio/RingBufferDetonator.git
      
      Tests has been carried out on linux kernel 5.14-rc2
      (2734d6c1), my fixed version
      of kernel (for testing whether my update fixes the bug) and
      some older kernels (for range of affected kernels). Test result is
      also attached to the proof-of-concept repository.
      
      Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
      Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
      
      Cc: stable@vger.kernel.org
      Fixes: bf41a158 ("ring-buffer: make reentrant")
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: default avatarHaoran Luo <www@aegistudio.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      67f0d6d9
  9. 18 Jul, 2021 4 commits
    • Linus Torvalds's avatar
      Linux 5.14-rc2 · 2734d6c1
      Linus Torvalds authored
      2734d6c1
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.14-2021-07-18' of... · 8c25c447
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.14-2021-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Skip invalid hybrid PMU on hybrid systems when the atom (little) CPUs
         are offlined.
      
       - Fix 'perf test' problems related to the recently added hybrid
         (BIG/little) code.
      
       - Split ARM's coresight (hw tracing) decode by aux records to avoid
         fatal decoding errors.
      
       - Fix add event failure in 'perf probe' when running 32-bit perf in a
         64-bit kernel.
      
       - Fix 'perf sched record' failure when CONFIG_SCHEDSTATS is not set.
      
       - Fix memory and refcount leaks detected by ASAn when running 'perf
         test', should be clean of warnings now.
      
       - Remove broken definition of __LITTLE_ENDIAN from tools'
         linux/kconfig.h, which was breaking the build in some systems.
      
       - Cast PTHREAD_STACK_MIN to int as it may turn into 'long
         sysconf(__SC_THREAD_STACK_MIN_VALUE), breaking the build in some
         systems.
      
       - Fix libperf build error with LIBPFM4=1.
      
       - Sync UAPI files changed by the memfd_secret new syscall.
      
      * tag 'perf-tools-fixes-for-v5.14-2021-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (35 commits)
        perf sched: Fix record failure when CONFIG_SCHEDSTATS is not set
        perf probe: Fix add event failure when running 32-bit perf in a 64-bit kernel
        perf data: Close all files in close_dir()
        perf probe-file: Delete namelist in del_events() on the error path
        perf test bpf: Free obj_buf
        perf trace: Free strings in trace__parse_events_option()
        perf trace: Free syscall tp fields in evsel->priv
        perf trace: Free syscall->arg_fmt
        perf trace: Free malloc'd trace fields on exit
        perf lzma: Close lzma stream on exit
        perf script: Fix memory 'threads' and 'cpus' leaks on exit
        perf script: Release zstd data
        perf session: Cleanup trace_event
        perf inject: Close inject.output on exit
        perf report: Free generated help strings for sort option
        perf env: Fix memory leak of cpu_pmu_caps
        perf test maps__merge_in: Fix memory leak of maps
        perf dso: Fix memory leak in dso__new_map()
        perf test event_update: Fix memory leak of unit
        perf test event_update: Fix memory leak of evlist
        ...
      8c25c447
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · f0eb870a
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "A few fixes for issues in the new online shrink code, additional
        corrections for my recent bug-hunt w.r.t. extent size hints on
        realtime, and improved input checking of the GROWFSRT ioctl.
      
        IOW, the usual 'I somehow got bored during the merge window and
        resumed auditing the farther reaches of xfs':
      
         - Fix shrink eligibility checking when sparse inode clusters enabled
      
         - Reset '..' directory entries when unlinking directories to prevent
           verifier errors if fs is shrinked later
      
         - Don't report unusable extent size hints to FSGETXATTR
      
         - Don't warn when extent size hints are unusable because the sysadmin
           configured them that way
      
         - Fix insufficient parameter validation in GROWFSRT ioctl
      
         - Fix integer overflow when adding rt volumes to filesystem"
      
      * tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: detect misaligned rtinherit directory extent size hints
        xfs: fix an integer overflow error in xfs_growfs_rt
        xfs: improve FSGROWFSRT precondition checking
        xfs: don't expose misaligned extszinherit hints to userspace
        xfs: correct the narrative around misaligned rtinherit/extszinherit dirs
        xfs: reset child dir '..' entry when unlinking child
        xfs: check for sparse inode clusters that cross new EOAG when shrinking
      f0eb870a
    • Linus Torvalds's avatar
      Merge tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · fbf1bddc
      Linus Torvalds authored
      Pull iomap fixes from Darrick Wong:
       "A handful of bugfixes for the iomap code.
      
        There's nothing especially exciting here, just fixes for UBSAN (not
        KASAN as I erroneously wrote in the tag message) warnings about
        undefined behavior in the SEEK_DATA/SEEK_HOLE code, and some
        reshuffling of per-page block state info to fix some problems with
        gfs2.
      
         - Fix KASAN warnings due to integer overflow in SEEK_DATA/SEEK_HOLE
      
         - Fix assertion errors when using inlinedata files on gfs2"
      
      * tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor
        iomap: Don't create iomap_page objects for inline files
        iomap: Permit pages without an iop to enter writeback
        iomap: remove the length variable in iomap_seek_hole
        iomap: remove the length variable in iomap_seek_data
      fbf1bddc