Commit af9c191a authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'trace-ring-buffer-v6.12' of...

Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - tracing/ring-buffer: persistent buffer across reboots

   This allows for the tracing instance ring buffer to stay persistent
   across reboots. The way this is done is by adding to the kernel
   command line:

     trace_instance=boot_map@0x285400000:12M

   This will reserve 12 megabytes at the address 0x285400000, and then
   map the tracing instance "boot_map" ring buffer to that memory. This
   will appear as a normal instance in the tracefs system:

     /sys/kernel/tracing/instances/boot_map

   A user could enable tracing in that instance, and on reboot or kernel
   crash, if the memory is not wiped by the firmware, it will recreate
   the trace in that instance. For example, if one was debugging a
   shutdown of a kernel reboot:

     # cd /sys/kernel/tracing
     # echo function > instances/boot_map/current_tracer
     # reboot
     [..]
     # cd /sys/kernel/tracing
     # tail instances/boot_map/trace
           swapper/0-1       [000] d..1.   164.549800: restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549802: disconnect_bsp_APIC <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549811: hpet_disable <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549812: iommu_shutdown_noop <-native_machine_restart
           swapper/0-1       [000] d..1.   164.549813: native_machine_emergency_restart <-__do_sys_reboot
           swapper/0-1       [000] d..1.   164.549813: tboot_shutdown <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549820: acpi_reboot <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549821: acpi_reset <-acpi_reboot
           swapper/0-1       [000] d..1.   164.549822: acpi_os_write_port <-acpi_reboot

   On reboot, the buffer is examined to make sure it is valid. The
   validation check even steps through every event to make sure the meta
   data of the event is correct. If any test fails, it will simply reset
   the buffer, and the buffer will be empty on boot.

 - Allow the tracing persistent boot buffer to use the "reserve_mem"
   option

   Instead of having the admin find a physical address to store the
   persistent buffer, which can be very tedious if they have to
   administrate several different machines, allow them to use the
   "reserve_mem" option that will find a location for them. It is not as
   reliable because of KASLR, as the loading of the kernel in different
   locations can cause the memory allocated to be inconsistent. Booting
   with "nokaslr" can make reserve_mem more reliable.

 - Have function graph tracer handle offsets from a previous boot.

   The ring buffer output from a previous boot may have different
   addresses due to kaslr. Have the function graph tracer handle these
   by using the delta from the previous boot to the new boot address
   space.

 - Only reset the saved meta offset when the buffer is started or reset

   In the persistent memory meta data, it holds the previous address
   space information, so that it can calculate the delta to have
   function tracing work. But this gets updated after being read to hold
   the new address space. But if the buffer isn't used for that boot, on
   reboot, the delta is now calculated from the previous boot and not
   the boot that holds the data in the ring buffer. This causes the
   functions not to be shown. Do not save the address space information
   of the current kernel until it is being recorded.

 - Add a magic variable to test the valid meta data

   Add a magic variable in the meta data that can also be used for
   validation. The validator of the previous buffer doesn't need this
   magic data, but it can be used if the meta data is changed by a new
   kernel, which may have the same format that passes the validator but
   is used differently. This magic number can also be used as a
   "versioning" of the meta data.

 - Align user space mapped ring buffer sub buffers to improve TLB
   entries

   Linus mentioned that the mapped ring buffer sub buffers were
   misaligned between the meta page and the sub-buffers, so that if the
   sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
   use bigger entries.

 - Add new kernel command line "traceoff" to disable tracing on boot for
   instances

   If tracing is enabled for a boot instance, there needs a way to be
   able to disable it on boot so that new events do not get entered into
   the ring buffer and be mixed with events from a previous boot, as
   that can be confusing.

 - Allow trace_printk() to go to other instances

   Currently, trace_printk() can only go to the top level instance. When
   debugging with a persistent buffer, it is really useful to be able to
   add trace_printk() to go to that buffer, so that you have access to
   them after a crash.

 - Do not use "bin_printk()" for traces to a boot instance

   The bin_printk() saves only a pointer to the printk format in the
   ring buffer, as the reader of the buffer can still have access to it.
   But this is not the case if the buffer is from a previous boot. If
   the trace_printk() is going to a "persistent" buffer, it will use the
   slower version that writes the printk format into the buffer.

 - Add command line option to allow trace_printk() to go to an instance

   Allow the kernel command line to define which instance the
   trace_printk() goes to, instead of forcing the admin to set it for
   every boot via the tracefs options.

 - Start a document that explains how to use tracefs to debug the kernel

 - Add some more kernel selftests to test user mapped ring buffer

* tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
  selftests/ring-buffer: Handle meta-page bigger than the system
  selftests/ring-buffer: Verify the entire meta-page padding
  tracing/Documentation: Start a document on how to debug with tracing
  tracing: Add option to set an instance to be the trace_printk destination
  tracing: Have trace_printk not use binary prints if boot buffer
  tracing: Allow trace_printk() to go to other instance buffers
  tracing: Add "traceoff" flag to boot time tracing instances
  ring-buffer: Align meta-page to sub-buffers for improved TLB usage
  ring-buffer: Add magic and struct size to boot up meta data
  ring-buffer: Don't reset persistent ring-buffer meta saved addresses
  tracing/fgraph: Have fgraph handle previous boot function addresses
  tracing: Allow boot instances to use reserve_mem boot memory
  tracing: Fix ifdef of snapshots to not prevent last_boot_info file
  ring-buffer: Use vma_pages() helper function
  tracing: Fix NULL vs IS_ERR() check in enable_instances()
  tracing: Add last boot delta offset for stack traces
  tracing: Update function tracing output for previous boot buffer
  tracing: Handle old buffer mappings for event strings and functions
  tracing/ring-buffer: Add last_boot_info file to boot instance
  ring-buffer: Save text and data locations in mapped meta data
  ...
parents dd609b8a 75d7ff9a
......@@ -6808,6 +6808,51 @@
the same thing would happen if it was left off). The irq_handler_entry
event, and all events under the "initcall" system.
Flags can be added to the instance to modify its behavior when it is
created. The flags are separated by '^'.
The available flags are:
traceoff - Have the tracing instance tracing disabled after it is created.
traceprintk - Have trace_printk() write into this trace instance
(note, "printk" and "trace_printk" can also be used)
trace_instance=foo^traceoff^traceprintk,sched,irq
The flags must come before the defined events.
If memory has been reserved (see memmap for x86), the instance
can use that memory:
memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M
The above will create a "boot_map" instance that uses the physical
memory at 0x284500000 that is 12Megs. The per CPU buffers of that
instance will be split up accordingly.
Alternatively, the memory can be reserved by the reserve_mem option:
reserve_mem=12M:4096:trace trace_instance=boot_map@trace
This will reserve 12 megabytes at boot up with a 4096 byte alignment
and place the ring buffer in this memory. Note that due to KASLR, the
memory may not be the same location each time, which will not preserve
the buffer content.
Also note that the layout of the ring buffer data may change between
kernel versions where the validator will fail and reset the ring buffer
if the layout is not the same as the previous kernel.
If the ring buffer is used for persistent bootups and has events enabled,
it is recommend to disable tracing so that events from a previous boot do not
mix with events of the current boot (unless you are debugging a random crash
at boot up).
reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq
See also Documentation/trace/debugging.rst
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
......
==============================
Using the tracer for debugging
==============================
Copyright 2024 Google LLC.
:Author: Steven Rostedt <rostedt@goodmis.org>
:License: The GNU Free Documentation License, Version 1.2
(dual licensed under the GPL v2)
- Written for: 6.12
Introduction
------------
The tracing infrastructure can be very useful for debugging the Linux
kernel. This document is a place to add various methods of using the tracer
for debugging.
First, make sure that the tracefs file system is mounted::
$ sudo mount -t tracefs tracefs /sys/kernel/tracing
Using trace_printk()
--------------------
trace_printk() is a very lightweight utility that can be used in any context
inside the kernel, with the exception of "noinstr" sections. It can be used
in normal, softirq, interrupt and even NMI context. The trace data is
written to the tracing ring buffer in a lockless way. To make it even
lighter weight, when possible, it will only record the pointer to the format
string, and save the raw arguments into the buffer. The format and the
arguments will be post processed when the ring buffer is read. This way the
trace_printk() format conversions are not done during the hot path, where
the trace is being recorded.
trace_printk() is meant only for debugging, and should never be added into
a subsystem of the kernel. If you need debugging traces, add trace events
instead. If a trace_printk() is found in the kernel, the following will
appear in the dmesg::
**********************************************************
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
** **
** trace_printk() being used. Allocating extra memory. **
** **
** This means that this is a DEBUG kernel and it is **
** unsafe for production use. **
** **
** If you see this message and you are not debugging **
** the kernel, report this immediately to your vendor! **
** **
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
**********************************************************
Debugging kernel crashes
------------------------
There is various methods of acquiring the state of the system when a kernel
crash occurs. This could be from the oops message in printk, or one could
use kexec/kdump. But these just show what happened at the time of the crash.
It can be very useful in knowing what happened up to the point of the crash.
The tracing ring buffer, by default, is a circular buffer than will
overwrite older events with newer ones. When a crash happens, the content of
the ring buffer will be all the events that lead up to the crash.
There are several kernel command line parameters that can be used to help in
this. The first is "ftrace_dump_on_oops". This will dump the tracing ring
buffer when a oops occurs to the console. This can be useful if the console
is being logged somewhere. If a serial console is used, it may be prudent to
make sure the ring buffer is relatively small, otherwise the dumping of the
ring buffer may take several minutes to hours to finish. Here's an example
of the kernel command line::
ftrace_dump_on_oops trace_buf_size=50K
Note, the tracing buffer is made up of per CPU buffers where each of these
buffers is broken up into sub-buffers that are by default PAGE_SIZE. The
above trace_buf_size option above sets each of the per CPU buffers to 50K,
so, on a machine with 8 CPUs, that's actually 400K total.
Persistent buffers across boots
-------------------------------
If the system memory allows it, the tracing ring buffer can be specified at
a specific location in memory. If the location is the same across boots and
the memory is not modified, the tracing buffer can be retrieved from the
following boot. There's two ways to reserve memory for the use of the ring
buffer.
The more reliable way (on x86) is to reserve memory with the "memmap" kernel
command line option and then use that memory for the trace_instance. This
requires a bit of knowledge of the physical memory layout of the system. The
advantage of using this method, is that the memory for the ring buffer will
always be the same::
memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M
The memmap above reserves 12 megabytes of memory at the physical memory
location 0x284500000. Then the trace_instance option will create a trace
instance "boot_map" at that same location with the same amount of memory
reserved. As the ring buffer is broke up into per CPU buffers, the 12
megabytes will be broken up evenly between those CPUs. If you have 8 CPUs,
each per CPU ring buffer will be 1.5 megabytes in size. Note, that also
includes meta data, so the amount of memory actually used by the ring buffer
will be slightly smaller.
Another more generic but less robust way to allocate a ring buffer mapping
at boot is with the "reserve_mem" option::
reserve_mem=12M:4096:trace trace_instance=boot_map@trace
The reserve_mem option above will find 12 megabytes that are available at
boot up, and align it by 4096 bytes. It will label this memory as "trace"
that can be used by later command line options.
The trace_instance option creates a "boot_map" instance and will use the
memory reserved by reserve_mem that was labeled as "trace". This method is
more generic but may not be as reliable. Due to KASLR, the memory reserved
by reserve_mem may not be located at the same location. If this happens,
then the ring buffer will not be from the previous boot and will be reset.
Sometimes, by using a larger alignment, it can keep KASLR from moving things
around in such a way that it will move the location of the reserve_mem. By
using a larger alignment, you may find better that the buffer is more
consistent to where it is placed::
reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace
On boot up, the memory reserved for the ring buffer is validated. It will go
through a series of tests to make sure that the ring buffer contains valid
data. If it is, it will then set it up to be available to read from the
instance. If it fails any of the tests, it will clear the entire ring buffer
and initialize it as new.
The layout of this mapped memory may not be consistent from kernel to
kernel, so only the same kernel is guaranteed to work if the mapping is
preserved. Switching to a different kernel version may find a different
layout and mark the buffer as invalid.
Using trace_printk() in the boot instance
-----------------------------------------
By default, the content of trace_printk() goes into the top level tracing
instance. But this instance is never preserved across boots. To have the
trace_printk() content, and some other internal tracing go to the preserved
buffer (like dump stacks), either set the instance to be the trace_printk()
destination from the kernel command line, or set it after boot up via the
trace_printk_dest option.
After boot up::
echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest
From the kernel command line::
reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace
If setting it from the kernel command line, it is recommended to also
disable tracing with the "traceoff" flag, and enable tracing after boot up.
Otherwise the trace from the most recent boot will be mixed with the trace
from the previous boot, and may make it confusing to read.
......@@ -1186,6 +1186,18 @@ Here are the available options:
trace_printk
Can disable trace_printk() from writing into the buffer.
trace_printk_dest
Set to have trace_printk() and similar internal tracing functions
write into this instance. Note, only one trace instance can have
this set. By setting this flag, it clears the trace_printk_dest flag
of the instance that had it set previously. By default, the top
level trace has this set, and will get it set again if another
instance has it set then clears it.
This flag cannot be cleared by the top level instance, as it is the
default instance. The only way the top level instance has this flag
cleared, is by it being set in another instance.
annotate
It is sometimes confusing when the CPU buffers are full
and one CPU buffer had a lot of events recently, thus
......
......@@ -89,6 +89,14 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
struct trace_buffer *
__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key);
struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long range_size,
struct lock_class_key *key);
bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
long *data);
/*
* Because the ring buffer is generic, if other users of the ring buffer get
* traced by ftrace, it can produce lockdep warnings. We need to keep each
......@@ -100,6 +108,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key); \
})
/*
* Because the ring buffer is generic, if other users of the ring buffer get
* traced by ftrace, it can produce lockdep warnings. We need to keep each
* ring buffer's lock class separate.
*/
#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
({ \
static struct lock_class_key __key; \
__ring_buffer_alloc_range((size), (flags), (order), (start), \
(range_size), &__key); \
})
typedef bool (*ring_buffer_cond_fn)(void *data);
int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
ring_buffer_cond_fn cond, void *data);
......
......@@ -32,6 +32,8 @@
#include <asm/local64.h>
#include <asm/local.h>
#include "trace.h"
/*
* The "absolute" timestamp in the buffer is only 59 bits.
* If a clock has the 5 MSBs set, it needs to be saved and
......@@ -42,6 +44,21 @@
static void update_pages_handler(struct work_struct *work);
#define RING_BUFFER_META_MAGIC 0xBADFEED
struct ring_buffer_meta {
int magic;
int struct_size;
unsigned long text_addr;
unsigned long data_addr;
unsigned long first_buffer;
unsigned long head_buffer;
unsigned long commit_buffer;
__u32 subbuf_size;
__u32 nr_subbufs;
int buffers[];
};
/*
* The ring buffer header is special. We must manually up keep it.
*/
......@@ -342,7 +359,8 @@ struct buffer_page {
local_t entries; /* entries on this page */
unsigned long real_end; /* real end of data */
unsigned order; /* order of the page */
u32 id; /* ID for external mapping */
u32 id:30; /* ID for external mapping */
u32 range:1; /* Mapped via a range */
struct buffer_data_page *page; /* Actual data page */
};
......@@ -373,6 +391,8 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage)
static void free_buffer_page(struct buffer_page *bpage)
{
/* Range pages are not to be freed */
if (!bpage->range)
free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
}
......@@ -491,9 +511,11 @@ struct ring_buffer_per_cpu {
unsigned long pages_removed;
unsigned int mapped;
unsigned int user_mapped; /* user space mapping */
struct mutex mapping_lock;
unsigned long *subbuf_ids; /* ID to subbuf VA */
struct trace_buffer_meta *meta_page;
struct ring_buffer_meta *ring_meta;
/* ring buffer pages to update, > 0 to add, < 0 to remove */
long nr_pages_to_update;
......@@ -523,6 +545,12 @@ struct trace_buffer {
struct rb_irq_work irq_work;
bool time_stamp_abs;
unsigned long range_addr_start;
unsigned long range_addr_end;
long last_text_delta;
long last_data_delta;
unsigned int subbuf_size;
unsigned int subbuf_order;
unsigned int max_data_size;
......@@ -1239,6 +1267,11 @@ static void rb_head_page_activate(struct ring_buffer_per_cpu *cpu_buffer)
* Set the previous list pointer to have the HEAD flag.
*/
rb_set_list_to_head(head->list.prev);
if (cpu_buffer->ring_meta) {
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
meta->head_buffer = (unsigned long)head->page;
}
}
static void rb_list_head_clear(struct list_head *list)
......@@ -1478,9 +1511,484 @@ static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
}
}
/*
* Take an address, add the meta data size as well as the array of
* array subbuffer indexes, then align it to a subbuffer size.
*
* This is used to help find the next per cpu subbuffer within a mapped range.
*/
static unsigned long
rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
{
addr += sizeof(struct ring_buffer_meta) +
sizeof(int) * nr_subbufs;
return ALIGN(addr, subbuf_size);
}
/*
* Return the ring_buffer_meta for a given @cpu.
*/
static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
{
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
unsigned long ptr = buffer->range_addr_start;
struct ring_buffer_meta *meta;
int nr_subbufs;
if (!ptr)
return NULL;
/* When nr_pages passed in is zero, the first meta has already been initialized */
if (!nr_pages) {
meta = (struct ring_buffer_meta *)ptr;
nr_subbufs = meta->nr_subbufs;
} else {
meta = NULL;
/* Include the reader page */
nr_subbufs = nr_pages + 1;
}
/*
* The first chunk may not be subbuffer aligned, where as
* the rest of the chunks are.
*/
if (cpu) {
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
ptr += subbuf_size * nr_subbufs;
/* We can use multiplication to find chunks greater than 1 */
if (cpu > 1) {
unsigned long size;
unsigned long p;
/* Save the beginning of this CPU chunk */
p = ptr;
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
ptr += subbuf_size * nr_subbufs;
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
}
}
return (void *)ptr;
}
/* Return the start of subbufs given the meta pointer */
static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
{
int subbuf_size = meta->subbuf_size;
unsigned long ptr;
ptr = (unsigned long)meta;
ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
return (void *)ptr;
}
/*
* Return a specific sub-buffer for a given @cpu defined by @idx.
*/
static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
{
struct ring_buffer_meta *meta;
unsigned long ptr;
int subbuf_size;
meta = rb_range_meta(cpu_buffer->buffer, 0, cpu_buffer->cpu);
if (!meta)
return NULL;
if (WARN_ON_ONCE(idx >= meta->nr_subbufs))
return NULL;
subbuf_size = meta->subbuf_size;
/* Map this buffer to the order that's in meta->buffers[] */
idx = meta->buffers[idx];
ptr = (unsigned long)rb_subbufs_from_meta(meta);
ptr += subbuf_size * idx;
if (ptr + subbuf_size > cpu_buffer->buffer->range_addr_end)
return NULL;
return (void *)ptr;
}
/*
* See if the existing memory contains valid ring buffer data.
* As the previous kernel must be the same as this kernel, all
* the calculations (size of buffers and number of buffers)
* must be the same.
*/
static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
struct trace_buffer *buffer, int nr_pages)
{
int subbuf_size = PAGE_SIZE;
struct buffer_data_page *subbuf;
unsigned long buffers_start;
unsigned long buffers_end;
int i;
/* Check the meta magic and meta struct size */
if (meta->magic != RING_BUFFER_META_MAGIC ||
meta->struct_size != sizeof(*meta)) {
pr_info("Ring buffer boot meta[%d] mismatch of magic or struct size\n", cpu);
return false;
}
/* The subbuffer's size and number of subbuffers must match */
if (meta->subbuf_size != subbuf_size ||
meta->nr_subbufs != nr_pages + 1) {
pr_info("Ring buffer boot meta [%d] mismatch of subbuf_size/nr_pages\n", cpu);
return false;
}
buffers_start = meta->first_buffer;
buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
/* Is the head and commit buffers within the range of buffers? */
if (meta->head_buffer < buffers_start ||
meta->head_buffer >= buffers_end) {
pr_info("Ring buffer boot meta [%d] head buffer out of range\n", cpu);
return false;
}
if (meta->commit_buffer < buffers_start ||
meta->commit_buffer >= buffers_end) {
pr_info("Ring buffer boot meta [%d] commit buffer out of range\n", cpu);
return false;
}
subbuf = rb_subbufs_from_meta(meta);
/* Is the meta buffers and the subbufs themselves have correct data? */
for (i = 0; i < meta->nr_subbufs; i++) {
if (meta->buffers[i] < 0 ||
meta->buffers[i] >= meta->nr_subbufs) {
pr_info("Ring buffer boot meta [%d] array out of range\n", cpu);
return false;
}
if ((unsigned)local_read(&subbuf->commit) > subbuf_size) {
pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu);
return false;
}
subbuf = (void *)subbuf + subbuf_size;
}
return true;
}
static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu,
unsigned long long *timestamp, u64 *delta_ptr)
{
struct ring_buffer_event *event;
u64 ts, delta;
int events = 0;
int e;
*delta_ptr = 0;
*timestamp = 0;
ts = dpage->time_stamp;
for (e = 0; e < tail; e += rb_event_length(event)) {
event = (struct ring_buffer_event *)(dpage->data + e);
switch (event->type_len) {
case RINGBUF_TYPE_TIME_EXTEND:
delta = rb_event_time_stamp(event);
ts += delta;
break;
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
delta = rb_fix_abs_ts(delta, ts);
if (delta < ts) {
*delta_ptr = delta;
*timestamp = ts;
return -1;
}
ts = delta;
break;
case RINGBUF_TYPE_PADDING:
if (event->time_delta == 1)
break;
fallthrough;
case RINGBUF_TYPE_DATA:
events++;
ts += event->time_delta;
break;
default:
return -1;
}
}
*timestamp = ts;
return events;
}
static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
{
unsigned long long ts;
u64 delta;
int tail;
tail = local_read(&dpage->commit);
return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
}
/* If the meta data has been validated, now validate the events */
static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
{
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
struct buffer_page *head_page;
unsigned long entry_bytes = 0;
unsigned long entries = 0;
int ret;
int i;
if (!meta || !meta->head_buffer)
return;
/* Do the reader page first */
ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
if (ret < 0) {
pr_info("Ring buffer reader page is invalid\n");
goto invalid;
}
entries += ret;
entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
local_set(&cpu_buffer->reader_page->entries, ret);
head_page = cpu_buffer->head_page;
/* If both the head and commit are on the reader_page then we are done. */
if (head_page == cpu_buffer->reader_page &&
head_page == cpu_buffer->commit_page)
goto done;
/* Iterate until finding the commit page */
for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
/* Reader page has already been done */
if (head_page == cpu_buffer->reader_page)
continue;
ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
if (ret < 0) {
pr_info("Ring buffer meta [%d] invalid buffer page\n",
cpu_buffer->cpu);
goto invalid;
}
entries += ret;
entry_bytes += local_read(&head_page->page->commit);
local_set(&cpu_buffer->head_page->entries, ret);
if (head_page == cpu_buffer->commit_page)
break;
}
if (head_page != cpu_buffer->commit_page) {
pr_info("Ring buffer meta [%d] commit page not found\n",
cpu_buffer->cpu);
goto invalid;
}
done:
local_set(&cpu_buffer->entries, entries);
local_set(&cpu_buffer->entries_bytes, entry_bytes);
pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu);
return;
invalid:
/* The content of the buffers are invalid, reset the meta data */
meta->head_buffer = 0;
meta->commit_buffer = 0;
/* Reset the reader page */
local_set(&cpu_buffer->reader_page->entries, 0);
local_set(&cpu_buffer->reader_page->page->commit, 0);
/* Reset all the subbuffers */
for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) {
local_set(&head_page->entries, 0);
local_set(&head_page->page->commit, 0);
}
}
/* Used to calculate data delta */
static char rb_data_ptr[] = "";
#define THIS_TEXT_PTR ((unsigned long)rb_meta_init_text_addr)
#define THIS_DATA_PTR ((unsigned long)rb_data_ptr)
static void rb_meta_init_text_addr(struct ring_buffer_meta *meta)
{
meta->text_addr = THIS_TEXT_PTR;
meta->data_addr = THIS_DATA_PTR;
}
static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
{
struct ring_buffer_meta *meta;
unsigned long delta;
void *subbuf;
int cpu;
int i;
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
void *next_meta;
meta = rb_range_meta(buffer, nr_pages, cpu);
if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
/* Make the mappings match the current address */
subbuf = rb_subbufs_from_meta(meta);
delta = (unsigned long)subbuf - meta->first_buffer;
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
buffer->last_text_delta = THIS_TEXT_PTR - meta->text_addr;
buffer->last_data_delta = THIS_DATA_PTR - meta->data_addr;
continue;
}
if (cpu < nr_cpu_ids - 1)
next_meta = rb_range_meta(buffer, nr_pages, cpu + 1);
else
next_meta = (void *)buffer->range_addr_end;
memset(meta, 0, next_meta - (void *)meta);
meta->magic = RING_BUFFER_META_MAGIC;
meta->struct_size = sizeof(*meta);
meta->nr_subbufs = nr_pages + 1;
meta->subbuf_size = PAGE_SIZE;
subbuf = rb_subbufs_from_meta(meta);
meta->first_buffer = (unsigned long)subbuf;
rb_meta_init_text_addr(meta);
/*
* The buffers[] array holds the order of the sub-buffers
* that are after the meta data. The sub-buffers may
* be swapped out when read and inserted into a different
* location of the ring buffer. Although their addresses
* remain the same, the buffers[] array contains the
* index into the sub-buffers holding their actual order.
*/
for (i = 0; i < meta->nr_subbufs; i++) {
meta->buffers[i] = i;
rb_init_page(subbuf);
subbuf += meta->subbuf_size;
}
}
}
static void *rbm_start(struct seq_file *m, loff_t *pos)
{
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
unsigned long val;
if (!meta)
return NULL;
if (*pos > meta->nr_subbufs)
return NULL;
val = *pos;
val++;
return (void *)val;
}
static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
{
(*pos)++;
return rbm_start(m, pos);
}
static int rbm_show(struct seq_file *m, void *v)
{
struct ring_buffer_per_cpu *cpu_buffer = m->private;
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
unsigned long val = (unsigned long)v;
if (val == 1) {
seq_printf(m, "head_buffer: %d\n",
rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
seq_printf(m, "commit_buffer: %d\n",
rb_meta_subbuf_idx(meta, (void *)meta->commit_buffer));
seq_printf(m, "subbuf_size: %d\n", meta->subbuf_size);
seq_printf(m, "nr_subbufs: %d\n", meta->nr_subbufs);
return 0;
}
val -= 2;
seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]);
return 0;
}
static void rbm_stop(struct seq_file *m, void *p)
{
}
static const struct seq_operations rb_meta_seq_ops = {
.start = rbm_start,
.next = rbm_next,
.show = rbm_show,
.stop = rbm_stop,
};
int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, int cpu)
{
struct seq_file *m;
int ret;
ret = seq_open(file, &rb_meta_seq_ops);
if (ret)
return ret;
m = file->private_data;
m->private = buffer->buffers[cpu];
return 0;
}
/* Map the buffer_pages to the previous head and commit pages */
static void rb_meta_buffer_update(struct ring_buffer_per_cpu *cpu_buffer,
struct buffer_page *bpage)
{
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
if (meta->head_buffer == (unsigned long)bpage->page)
cpu_buffer->head_page = bpage;
if (meta->commit_buffer == (unsigned long)bpage->page) {
cpu_buffer->commit_page = bpage;
cpu_buffer->tail_page = bpage;
}
}
static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
{
struct trace_buffer *buffer = cpu_buffer->buffer;
struct ring_buffer_meta *meta = NULL;
struct buffer_page *bpage, *tmp;
bool user_thread = current->mm != NULL;
gfp_t mflags;
......@@ -1515,6 +2023,10 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
*/
if (user_thread)
set_current_oom_origin();
if (buffer->range_addr_start)
meta = rb_range_meta(buffer, nr_pages, cpu_buffer->cpu);
for (i = 0; i < nr_pages; i++) {
struct page *page;
......@@ -1525,16 +2037,32 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
rb_check_bpage(cpu_buffer, bpage);
list_add(&bpage->list, pages);
/*
* Append the pages as for mapped buffers we want to keep
* the order
*/
list_add_tail(&bpage->list, pages);
if (meta) {
/* A range was given. Use that for the buffer page */
bpage->page = rb_range_buffer(cpu_buffer, i + 1);
if (!bpage->page)
goto free_pages;
/* If this is valid from a previous boot */
if (meta->head_buffer)
rb_meta_buffer_update(cpu_buffer, bpage);
bpage->range = 1;
bpage->id = i + 1;
} else {
page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
mflags | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
bpage->page = page_address(page);
bpage->order = cpu_buffer->buffer->subbuf_order;
rb_init_page(bpage->page);
}
bpage->order = cpu_buffer->buffer->subbuf_order;
if (user_thread && fatal_signal_pending(current))
goto free_pages;
......@@ -1584,6 +2112,7 @@ static struct ring_buffer_per_cpu *
rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_meta *meta;
struct buffer_page *bpage;
struct page *page;
int ret;
......@@ -1614,12 +2143,28 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
cpu_buffer->reader_page = bpage;
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
if (buffer->range_addr_start) {
/*
* Range mapped buffers have the same restrictions as memory
* mapped ones do.
*/
cpu_buffer->mapped = 1;
cpu_buffer->ring_meta = rb_range_meta(buffer, nr_pages, cpu);
bpage->page = rb_range_buffer(cpu_buffer, 0);
if (!bpage->page)
goto fail_free_reader;
if (cpu_buffer->ring_meta->head_buffer)
rb_meta_buffer_update(cpu_buffer, bpage);
bpage->range = 1;
} else {
page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
rb_init_page(bpage->page);
}
INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
INIT_LIST_HEAD(&cpu_buffer->new_pages);
......@@ -1628,12 +2173,36 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
if (ret < 0)
goto fail_free_reader;
rb_meta_validate_events(cpu_buffer);
/* If the boot meta was valid then this has already been updated */
meta = cpu_buffer->ring_meta;
if (!meta || !meta->head_buffer ||
!cpu_buffer->head_page || !cpu_buffer->commit_page || !cpu_buffer->tail_page) {
if (meta && meta->head_buffer &&
(cpu_buffer->head_page || cpu_buffer->commit_page || cpu_buffer->tail_page)) {
pr_warn("Ring buffer meta buffers not all mapped\n");
if (!cpu_buffer->head_page)
pr_warn(" Missing head_page\n");
if (!cpu_buffer->commit_page)
pr_warn(" Missing commit_page\n");
if (!cpu_buffer->tail_page)
pr_warn(" Missing tail_page\n");
}
cpu_buffer->head_page
= list_entry(cpu_buffer->pages, struct buffer_page, list);
cpu_buffer->tail_page = cpu_buffer->commit_page = cpu_buffer->head_page;
rb_head_page_activate(cpu_buffer);
if (cpu_buffer->ring_meta)
meta->commit_buffer = meta->head_buffer;
} else {
/* The valid meta buffer still needs to activate the head page */
rb_head_page_activate(cpu_buffer);
}
return cpu_buffer;
fail_free_reader:
......@@ -1669,22 +2238,14 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
/**
* __ring_buffer_alloc - allocate a new ring_buffer
* @size: the size in bytes per cpu that is needed.
* @flags: attributes to set for the ring buffer.
* @key: ring buffer reader_lock_key.
*
* Currently the only flag that is available is the RB_FL_OVERWRITE
* flag. This flag means that the buffer will overwrite old data
* when the buffer wraps. If this flag is not set, the buffer will
* drop data when the tail hits the head.
*/
struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long end,
struct lock_class_key *key)
{
struct trace_buffer *buffer;
long nr_pages;
int subbuf_size;
int bsize;
int cpu;
int ret;
......@@ -1698,14 +2259,13 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
if (!zalloc_cpumask_var(&buffer->cpumask, GFP_KERNEL))
goto fail_free_buffer;
/* Default buffer page size - one system page */
buffer->subbuf_order = 0;
buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
buffer->subbuf_order = order;
subbuf_size = (PAGE_SIZE << order);
buffer->subbuf_size = subbuf_size - BUF_PAGE_HDR_SIZE;
/* Max payload is buffer page size - header (8bytes) */
buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
......@@ -1713,10 +2273,6 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&buffer->irq_work.waiters);
/* need at least two pages */
if (nr_pages < 2)
nr_pages = 2;
buffer->cpus = nr_cpu_ids;
bsize = sizeof(void *) * nr_cpu_ids;
......@@ -1725,6 +2281,56 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
if (!buffer->buffers)
goto fail_free_cpumask;
/* If start/end are specified, then that overrides size */
if (start && end) {
unsigned long ptr;
int n;
size = end - start;
size = size / nr_cpu_ids;
/*
* The number of sub-buffers (nr_pages) is determined by the
* total size allocated minus the meta data size.
* Then that is divided by the number of per CPU buffers
* needed, plus account for the integer array index that
* will be appended to the meta data.
*/
nr_pages = (size - sizeof(struct ring_buffer_meta)) /
(subbuf_size + sizeof(int));
/* Need at least two pages plus the reader page */
if (nr_pages < 3)
goto fail_free_buffers;
again:
/* Make sure that the size fits aligned */
for (n = 0, ptr = start; n < nr_cpu_ids; n++) {
ptr += sizeof(struct ring_buffer_meta) +
sizeof(int) * nr_pages;
ptr = ALIGN(ptr, subbuf_size);
ptr += subbuf_size * nr_pages;
}
if (ptr > end) {
if (nr_pages <= 3)
goto fail_free_buffers;
nr_pages--;
goto again;
}
/* nr_pages should not count the reader page */
nr_pages--;
buffer->range_addr_start = start;
buffer->range_addr_end = end;
rb_range_meta_init(buffer, nr_pages);
} else {
/* need at least two pages */
nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
if (nr_pages < 2)
nr_pages = 2;
}
cpu = raw_smp_processor_id();
cpumask_set_cpu(cpu, buffer->cpumask);
buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
......@@ -1753,8 +2359,72 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
kfree(buffer);
return NULL;
}
/**
* __ring_buffer_alloc - allocate a new ring_buffer
* @size: the size in bytes per cpu that is needed.
* @flags: attributes to set for the ring buffer.
* @key: ring buffer reader_lock_key.
*
* Currently the only flag that is available is the RB_FL_OVERWRITE
* flag. This flag means that the buffer will overwrite old data
* when the buffer wraps. If this flag is not set, the buffer will
* drop data when the tail hits the head.
*/
struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
struct lock_class_key *key)
{
/* Default buffer page size - one system page */
return alloc_buffer(size, flags, 0, 0, 0,key);
}
EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
/**
* __ring_buffer_alloc_range - allocate a new ring_buffer from existing memory
* @size: the size in bytes per cpu that is needed.
* @flags: attributes to set for the ring buffer.
* @start: start of allocated range
* @range_size: size of allocated range
* @order: sub-buffer order
* @key: ring buffer reader_lock_key.
*
* Currently the only flag that is available is the RB_FL_OVERWRITE
* flag. This flag means that the buffer will overwrite old data
* when the buffer wraps. If this flag is not set, the buffer will
* drop data when the tail hits the head.
*/
struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags,
int order, unsigned long start,
unsigned long range_size,
struct lock_class_key *key)
{
return alloc_buffer(size, flags, order, start, start + range_size, key);
}
/**
* ring_buffer_last_boot_delta - return the delta offset from last boot
* @buffer: The buffer to return the delta from
* @text: Return text delta
* @data: Return data delta
*
* Returns: The true if the delta is non zero
*/
bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
long *data)
{
if (!buffer)
return false;
if (!buffer->last_text_delta)
return false;
*text = buffer->last_text_delta;
*data = buffer->last_data_delta;
return true;
}
/**
* ring_buffer_free - free a ring buffer.
* @buffer: the buffer to free.
......@@ -2364,6 +3034,52 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
iter->next_event = 0;
}
/* Return the index into the sub-buffers for a given sub-buffer */
static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf)
{
void *subbuf_array;
subbuf_array = (void *)meta + sizeof(int) * meta->nr_subbufs;
subbuf_array = (void *)ALIGN((unsigned long)subbuf_array, meta->subbuf_size);
return (subbuf - subbuf_array) / meta->subbuf_size;
}
static void rb_update_meta_head(struct ring_buffer_per_cpu *cpu_buffer,
struct buffer_page *next_page)
{
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
unsigned long old_head = (unsigned long)next_page->page;
unsigned long new_head;
rb_inc_page(&next_page);
new_head = (unsigned long)next_page->page;
/*
* Only move it forward once, if something else came in and
* moved it forward, then we don't want to touch it.
*/
(void)cmpxchg(&meta->head_buffer, old_head, new_head);
}
static void rb_update_meta_reader(struct ring_buffer_per_cpu *cpu_buffer,
struct buffer_page *reader)
{
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
void *old_reader = cpu_buffer->reader_page->page;
void *new_reader = reader->page;
int id;
id = reader->id;
cpu_buffer->reader_page->id = id;
reader->id = 0;
meta->buffers[0] = rb_meta_subbuf_idx(meta, new_reader);
meta->buffers[id] = rb_meta_subbuf_idx(meta, old_reader);
/* The head pointer is the one after the reader */
rb_update_meta_head(cpu_buffer, reader);
}
/*
* rb_handle_head_page - writer hit the head page
*
......@@ -2413,6 +3129,8 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes);
local_inc(&cpu_buffer->pages_lost);
if (cpu_buffer->ring_meta)
rb_update_meta_head(cpu_buffer, next_page);
/*
* The entries will be zeroed out when we move the
* tail page.
......@@ -2974,6 +3692,10 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
local_set(&cpu_buffer->commit_page->page->commit,
rb_page_write(cpu_buffer->commit_page));
rb_inc_page(&cpu_buffer->commit_page);
if (cpu_buffer->ring_meta) {
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
meta->commit_buffer = (unsigned long)cpu_buffer->commit_page->page;
}
/* add barrier to keep gcc from optimizing too much */
barrier();
}
......@@ -3420,11 +4142,10 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
struct rb_event_info *info,
unsigned long tail)
{
struct ring_buffer_event *event;
struct buffer_data_page *bpage;
u64 ts, delta;
bool full = false;
int e;
int ret;
bpage = info->tail_page->page;
......@@ -3450,39 +4171,12 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
goto out;
ts = bpage->time_stamp;
for (e = 0; e < tail; e += rb_event_length(event)) {
event = (struct ring_buffer_event *)(bpage->data + e);
switch (event->type_len) {
case RINGBUF_TYPE_TIME_EXTEND:
delta = rb_event_time_stamp(event);
ts += delta;
break;
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
delta = rb_fix_abs_ts(delta, ts);
ret = rb_read_data_buffer(bpage, tail, cpu_buffer->cpu, &ts, &delta);
if (ret < 0) {
if (delta < ts) {
buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT BACKWARDS: last ts: %lld absolute ts: %lld\n",
cpu_buffer->cpu, ts, delta);
}
ts = delta;
break;
case RINGBUF_TYPE_PADDING:
if (event->time_delta == 1)
break;
fallthrough;
case RINGBUF_TYPE_DATA:
ts += event->time_delta;
break;
default:
RB_WARN_ON(cpu_buffer, 1);
goto out;
}
}
if ((full && ts > info->ts) ||
......@@ -4591,6 +5285,9 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
if (!ret)
goto spin;
if (cpu_buffer->ring_meta)
rb_update_meta_reader(cpu_buffer, reader);
/*
* Yay! We succeeded in replacing the page.
*
......@@ -5212,6 +5909,9 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
{
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
if (!meta)
return;
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
......@@ -5268,11 +5968,16 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->lost_events = 0;
cpu_buffer->last_overrun = 0;
if (cpu_buffer->mapped)
rb_update_meta_page(cpu_buffer);
rb_head_page_activate(cpu_buffer);
cpu_buffer->pages_removed = 0;
if (cpu_buffer->mapped) {
rb_update_meta_page(cpu_buffer);
if (cpu_buffer->ring_meta) {
struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
meta->commit_buffer = meta->head_buffer;
}
}
}
/* Must have disabled the cpu buffer then done a synchronize_rcu */
......@@ -5303,6 +6008,7 @@ static void reset_disabled_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
struct ring_buffer_meta *meta;
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return;
......@@ -5321,6 +6027,11 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu)
atomic_dec(&cpu_buffer->record_disabled);
atomic_dec(&cpu_buffer->resize_disabled);
/* Make sure persistent meta now uses this buffer's addresses */
meta = rb_range_meta(buffer, 0, cpu_buffer->cpu);
if (meta)
rb_meta_init_text_addr(meta);
mutex_unlock(&buffer->mutex);
}
EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
......@@ -5335,6 +6046,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu);
void ring_buffer_reset_online_cpus(struct trace_buffer *buffer)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_meta *meta;
int cpu;
/* prevent another thread from changing buffer sizes */
......@@ -5362,6 +6074,11 @@ void ring_buffer_reset_online_cpus(struct trace_buffer *buffer)
reset_disabled_cpu_buffer(cpu_buffer);
/* Make sure persistent meta now uses this buffer's addresses */
meta = rb_range_meta(buffer, 0, cpu_buffer->cpu);
if (meta)
rb_meta_init_text_addr(meta);
atomic_dec(&cpu_buffer->record_disabled);
atomic_sub(RESET_BIT, &cpu_buffer->resize_disabled);
}
......@@ -6135,10 +6852,10 @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
/* install subbuf ID to kern VA translation */
cpu_buffer->subbuf_ids = subbuf_ids;
meta->meta_page_size = PAGE_SIZE;
meta->meta_struct_len = sizeof(*meta);
meta->nr_subbufs = nr_subbufs;
meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
meta->meta_page_size = meta->subbuf_size;
rb_update_meta_page(cpu_buffer);
}
......@@ -6155,7 +6872,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
mutex_lock(&cpu_buffer->mapping_lock);
if (!cpu_buffer->mapped) {
if (!cpu_buffer->user_mapped) {
mutex_unlock(&cpu_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
......@@ -6179,19 +6896,26 @@ static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
lockdep_assert_held(&cpu_buffer->mapping_lock);
/* mapped is always greater or equal to user_mapped */
if (WARN_ON(cpu_buffer->mapped < cpu_buffer->user_mapped))
return -EINVAL;
if (inc && cpu_buffer->mapped == UINT_MAX)
return -EBUSY;
if (WARN_ON(!inc && cpu_buffer->mapped == 0))
if (WARN_ON(!inc && cpu_buffer->user_mapped == 0))
return -EINVAL;
mutex_lock(&cpu_buffer->buffer->mutex);
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
if (inc)
if (inc) {
cpu_buffer->user_mapped++;
cpu_buffer->mapped++;
else
} else {
cpu_buffer->user_mapped--;
cpu_buffer->mapped--;
}
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
mutex_unlock(&cpu_buffer->buffer->mutex);
......@@ -6214,7 +6938,7 @@ static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
struct vm_area_struct *vma)
{
unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
unsigned long nr_subbufs, nr_pages, nr_vma_pages, pgoff = vma->vm_pgoff;
unsigned int subbuf_pages, subbuf_order;
struct page **pages;
int p = 0, s = 0;
......@@ -6225,6 +6949,12 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
!(vma->vm_flags & VM_MAYSHARE))
return -EPERM;
subbuf_order = cpu_buffer->buffer->subbuf_order;
subbuf_pages = 1 << subbuf_order;
if (subbuf_order && pgoff % subbuf_pages)
return -EINVAL;
/*
* Make sure the mapping cannot become writable later. Also tell the VM
* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND).
......@@ -6234,37 +6964,38 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
lockdep_assert_held(&cpu_buffer->mapping_lock);
subbuf_order = cpu_buffer->buffer->subbuf_order;
subbuf_pages = 1 << subbuf_order;
nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff; /* + meta-page */
vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
if (!vma_pages || vma_pages > nr_pages)
nr_vma_pages = vma_pages(vma);
if (!nr_vma_pages || nr_vma_pages > nr_pages)
return -EINVAL;
nr_pages = vma_pages;
nr_pages = nr_vma_pages;
pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
if (!pages)
return -ENOMEM;
if (!pgoff) {
unsigned long meta_page_padding;
pages[p++] = virt_to_page(cpu_buffer->meta_page);
/*
* TODO: Align sub-buffers on their size, once
* vm_insert_pages() supports the zero-page.
* Pad with the zero-page to align the meta-page with the
* sub-buffers.
*/
} else {
/* Skip the meta-page */
pgoff--;
meta_page_padding = subbuf_pages - 1;
while (meta_page_padding-- && p < nr_pages) {
unsigned long __maybe_unused zero_addr =
vma->vm_start + (PAGE_SIZE * p);
if (pgoff % subbuf_pages) {
err = -EINVAL;
goto out;
pages[p++] = ZERO_PAGE(zero_addr);
}
} else {
/* Skip the meta-page */
pgoff -= subbuf_pages;
s += pgoff / subbuf_pages;
}
......@@ -6316,7 +7047,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
mutex_lock(&cpu_buffer->mapping_lock);
if (cpu_buffer->mapped) {
if (cpu_buffer->user_mapped) {
err = __rb_map_vma(cpu_buffer, vma);
if (!err)
err = __rb_inc_dec_mapped(cpu_buffer, true);
......@@ -6347,12 +7078,15 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
*/
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
cpu_buffer->mapped = 1;
/* This is the first time it is mapped by user */
cpu_buffer->mapped++;
cpu_buffer->user_mapped = 1;
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
......@@ -6380,10 +7114,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
mutex_lock(&cpu_buffer->mapping_lock);
if (!cpu_buffer->mapped) {
if (!cpu_buffer->user_mapped) {
err = -ENODEV;
goto out;
} else if (cpu_buffer->mapped > 1) {
} else if (cpu_buffer->user_mapped > 1) {
__rb_inc_dec_mapped(cpu_buffer, false);
goto out;
}
......@@ -6391,7 +7125,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
mutex_lock(&buffer->mutex);
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
cpu_buffer->mapped = 0;
/* This is the last user space mapping */
if (!WARN_ON_ONCE(cpu_buffer->mapped < cpu_buffer->user_mapped))
cpu_buffer->mapped--;
cpu_buffer->user_mapped = 0;
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
......
......@@ -482,7 +482,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | \
TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE | \
TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | \
TRACE_ITER_HASH_PTR)
TRACE_ITER_HASH_PTR | TRACE_ITER_TRACE_PRINTK)
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER_PRINTK | \
......@@ -490,7 +490,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
(TRACE_ITER_EVENT_FORK | TRACE_ITER_FUNC_FORK)
(TRACE_ITER_EVENT_FORK | TRACE_ITER_FUNC_FORK | TRACE_ITER_TRACE_PRINTK)
/*
* The global_trace is the descriptor that holds the top-level tracing
......@@ -500,6 +500,29 @@ static struct trace_array global_trace = {
.trace_flags = TRACE_DEFAULT_FLAGS,
};
static struct trace_array *printk_trace = &global_trace;
static __always_inline bool printk_binsafe(struct trace_array *tr)
{
/*
* The binary format of traceprintk can cause a crash if used
* by a buffer from another boot. Force the use of the
* non binary version of trace_printk if the trace_printk
* buffer is a boot mapped ring buffer.
*/
return !(tr->flags & TRACE_ARRAY_FL_BOOT);
}
static void update_printk_trace(struct trace_array *tr)
{
if (printk_trace == tr)
return;
printk_trace->trace_flags &= ~TRACE_ITER_TRACE_PRINTK;
printk_trace = tr;
tr->trace_flags |= TRACE_ITER_TRACE_PRINTK;
}
void trace_set_ring_buffer_expanded(struct trace_array *tr)
{
if (!tr)
......@@ -1117,7 +1140,7 @@ EXPORT_SYMBOL_GPL(__trace_array_puts);
*/
int __trace_puts(unsigned long ip, const char *str, int size)
{
return __trace_array_puts(&global_trace, ip, str, size);
return __trace_array_puts(printk_trace, ip, str, size);
}
EXPORT_SYMBOL_GPL(__trace_puts);
......@@ -1128,6 +1151,7 @@ EXPORT_SYMBOL_GPL(__trace_puts);
*/
int __trace_bputs(unsigned long ip, const char *str)
{
struct trace_array *tr = READ_ONCE(printk_trace);
struct ring_buffer_event *event;
struct trace_buffer *buffer;
struct bputs_entry *entry;
......@@ -1135,14 +1159,17 @@ int __trace_bputs(unsigned long ip, const char *str)
int size = sizeof(struct bputs_entry);
int ret = 0;
if (!(global_trace.trace_flags & TRACE_ITER_PRINTK))
if (!printk_binsafe(tr))
return __trace_puts(ip, str, strlen(str));
if (!(tr->trace_flags & TRACE_ITER_PRINTK))
return 0;
if (unlikely(tracing_selftest_running || tracing_disabled))
return 0;
trace_ctx = tracing_gen_ctx();
buffer = global_trace.array_buffer.buffer;
buffer = tr->array_buffer.buffer;
ring_buffer_nest_start(buffer);
event = __trace_buffer_lock_reserve(buffer, TRACE_BPUTS, size,
......@@ -1155,7 +1182,7 @@ int __trace_bputs(unsigned long ip, const char *str)
entry->str = str;
__buffer_unlock_commit(buffer, event);
ftrace_trace_stack(&global_trace, buffer, trace_ctx, 4, NULL);
ftrace_trace_stack(tr, buffer, trace_ctx, 4, NULL);
ret = 1;
out:
......@@ -3021,7 +3048,7 @@ void trace_dump_stack(int skip)
/* Skip 1 to skip this function. */
skip++;
#endif
__ftrace_trace_stack(global_trace.array_buffer.buffer,
__ftrace_trace_stack(printk_trace->array_buffer.buffer,
tracing_gen_ctx(), skip, NULL);
}
EXPORT_SYMBOL_GPL(trace_dump_stack);
......@@ -3240,12 +3267,15 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
struct trace_event_call *call = &event_bprint;
struct ring_buffer_event *event;
struct trace_buffer *buffer;
struct trace_array *tr = &global_trace;
struct trace_array *tr = READ_ONCE(printk_trace);
struct bprint_entry *entry;
unsigned int trace_ctx;
char *tbuffer;
int len = 0, size;
if (!printk_binsafe(tr))
return trace_vprintk(ip, fmt, args);
if (unlikely(tracing_selftest_running || tracing_disabled))
return 0;
......@@ -3338,7 +3368,7 @@ __trace_array_vprintk(struct trace_buffer *buffer,
memcpy(&entry->buf, tbuffer, len + 1);
if (!call_filter_check_discard(call, entry, buffer, event)) {
__buffer_unlock_commit(buffer, event);
ftrace_trace_stack(&global_trace, buffer, trace_ctx, 6, NULL);
ftrace_trace_stack(printk_trace, buffer, trace_ctx, 6, NULL);
}
out:
......@@ -3434,7 +3464,7 @@ int trace_array_printk_buf(struct trace_buffer *buffer,
int ret;
va_list ap;
if (!(global_trace.trace_flags & TRACE_ITER_PRINTK))
if (!(printk_trace->trace_flags & TRACE_ITER_PRINTK))
return 0;
va_start(ap, fmt);
......@@ -3446,7 +3476,7 @@ int trace_array_printk_buf(struct trace_buffer *buffer,
__printf(2, 0)
int trace_vprintk(unsigned long ip, const char *fmt, va_list args)
{
return trace_array_vprintk(&global_trace, ip, fmt, args);
return trace_array_vprintk(printk_trace, ip, fmt, args);
}
EXPORT_SYMBOL_GPL(trace_vprintk);
......@@ -3667,8 +3697,11 @@ static void test_can_verify(void)
void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
va_list ap)
{
long text_delta = iter->tr->text_delta;
long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
bool good;
int i, j;
if (WARN_ON_ONCE(!fmt))
......@@ -3687,7 +3720,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
j = 0;
/* We only care about %s and variants */
/*
* We only care about %s and variants
* as well as %p[sS] if delta is non-zero
*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
......@@ -3716,6 +3752,11 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
}
if (p[i+j] == 's')
break;
if (text_delta && p[i+1] == 'p' &&
((p[i+2] == 's' || p[i+2] == 'S')))
break;
star = false;
}
j = 0;
......@@ -3729,6 +3770,24 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(&iter->seq, iter->fmt, ap);
/* Add delta to %pS pointers */
if (p[i+1] == 'p') {
unsigned long addr;
char fmt[4];
fmt[0] = '%';
fmt[1] = 'p';
fmt[2] = p[i+2]; /* Either %ps or %pS */
fmt[3] = '\0';
addr = va_arg(ap, unsigned long);
addr += text_delta;
trace_seq_printf(&iter->seq, fmt, (void *)addr);
p += i + 3;
continue;
}
/*
* If iter->seq is full, the above call no longer guarantees
* that ap is in sync with fmt processing, and further calls
......@@ -3747,6 +3806,14 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
good = trace_safe_str(iter, str, star, len);
/* Could be from the last boot */
if (data_delta && !good) {
str += data_delta;
good = trace_safe_str(iter, str, star, len);
}
/*
* If you hit this warning, it is likely that the
* trace event in question used %s on a string that
......@@ -3756,8 +3823,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
* instead. See samples/trace_events/trace-events-sample.h
* for reference.
*/
if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
"fmt: '%s' current_buffer: '%s'",
if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
fmt, seq_buf_str(&iter->seq.seq))) {
int ret;
......@@ -4919,6 +4985,11 @@ static int tracing_open(struct inode *inode, struct file *file)
static bool
trace_ok_for_array(struct tracer *t, struct trace_array *tr)
{
#ifdef CONFIG_TRACER_SNAPSHOT
/* arrays with mapped buffer range do not have snapshots */
if (tr->range_addr_start && t->use_max_tr)
return false;
#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
}
......@@ -5011,7 +5082,7 @@ static int show_traces_open(struct inode *inode, struct file *file)
return 0;
}
static int show_traces_release(struct inode *inode, struct file *file)
static int tracing_seq_release(struct inode *inode, struct file *file)
{
struct trace_array *tr = inode->i_private;
......@@ -5052,7 +5123,7 @@ static const struct file_operations show_traces_fops = {
.open = show_traces_open,
.read = seq_read,
.llseek = seq_lseek,
.release = show_traces_release,
.release = tracing_seq_release,
};
static ssize_t
......@@ -5237,7 +5308,8 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
{
if ((mask == TRACE_ITER_RECORD_TGID) ||
(mask == TRACE_ITER_RECORD_CMD))
(mask == TRACE_ITER_RECORD_CMD) ||
(mask == TRACE_ITER_TRACE_PRINTK))
lockdep_assert_held(&event_mutex);
/* do nothing if flag is already set */
......@@ -5249,6 +5321,25 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
if (tr->current_trace->flag_changed(tr, mask, !!enabled))
return -EINVAL;
if (mask == TRACE_ITER_TRACE_PRINTK) {
if (enabled) {
update_printk_trace(tr);
} else {
/*
* The global_trace cannot clear this.
* It's flag only gets cleared if another instance sets it.
*/
if (printk_trace == &global_trace)
return -EINVAL;
/*
* An instance must always have it set.
* by default, that's the global_trace instane.
*/
if (printk_trace == tr)
update_printk_trace(&global_trace);
}
}
if (enabled)
tr->trace_flags |= mask;
else
......@@ -6034,6 +6125,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array *tr,
return ret;
}
static void update_last_data(struct trace_array *tr)
{
if (!tr->text_delta && !tr->data_delta)
return;
/* Clear old data */
tracing_reset_online_cpus(&tr->array_buffer);
/* Using current data now */
tr->text_delta = 0;
tr->data_delta = 0;
}
/**
* tracing_update_buffers - used by tracing facility to expand ring buffers
......@@ -6051,6 +6154,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
mutex_lock(&trace_types_lock);
update_last_data(tr);
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
......@@ -6106,6 +6212,8 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf)
mutex_lock(&trace_types_lock);
update_last_data(tr);
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
......@@ -6853,6 +6961,37 @@ tracing_total_entries_read(struct file *filp, char __user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
}
static ssize_t
tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos)
{
struct trace_array *tr = filp->private_data;
struct seq_buf seq;
char buf[64];
seq_buf_init(&seq, buf, 64);
seq_buf_printf(&seq, "text delta:\t%ld\n", tr->text_delta);
seq_buf_printf(&seq, "data delta:\t%ld\n", tr->data_delta);
return simple_read_from_buffer(ubuf, cnt, ppos, buf, seq_buf_used(&seq));
}
static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
{
struct trace_array *tr = inode->i_private;
int cpu = tracing_get_cpu(inode);
int ret;
ret = tracing_check_open_get_tr(tr);
if (ret)
return ret;
ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
if (ret < 0)
__trace_array_put(tr);
return ret;
}
static ssize_t
tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
......@@ -7429,6 +7568,13 @@ static const struct file_operations tracing_entries_fops = {
.release = tracing_release_generic_tr,
};
static const struct file_operations tracing_buffer_meta_fops = {
.open = tracing_buffer_meta_open,
.read = seq_read,
.llseek = seq_lseek,
.release = tracing_seq_release,
};
static const struct file_operations tracing_total_entries_fops = {
.open = tracing_open_generic_tr,
.read = tracing_total_entries_read,
......@@ -7469,6 +7615,13 @@ static const struct file_operations trace_time_stamp_mode_fops = {
.release = tracing_single_release_tr,
};
static const struct file_operations last_boot_fops = {
.open = tracing_open_generic_tr,
.read = tracing_last_boot_read,
.llseek = generic_file_llseek,
.release = tracing_release_generic_tr,
};
#ifdef CONFIG_TRACER_SNAPSHOT
static const struct file_operations snapshot_fops = {
.open = tracing_snapshot_open,
......@@ -8661,12 +8814,17 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu,
tr, cpu, &tracing_entries_fops);
if (tr->range_addr_start)
trace_create_cpu_file("buffer_meta", TRACE_MODE_READ, d_cpu,
tr, cpu, &tracing_buffer_meta_fops);
#ifdef CONFIG_TRACER_SNAPSHOT
if (!tr->range_addr_start) {
trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
tr, cpu, &snapshot_fops);
trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
tr, cpu, &snapshot_raw_fops);
}
#endif
}
......@@ -9203,7 +9361,21 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
buf->tr = tr;
if (tr->range_addr_start && tr->range_addr_size) {
buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
tr->range_addr_start,
tr->range_addr_size);
ring_buffer_last_boot_delta(buf->buffer,
&tr->text_delta, &tr->data_delta);
/*
* This is basically the same as a mapped buffer,
* with the same restrictions.
*/
tr->mapped++;
} else {
buf->buffer = ring_buffer_alloc(size, rb_flags);
}
if (!buf->buffer)
return -ENOMEM;
......@@ -9240,6 +9412,10 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
return ret;
#ifdef CONFIG_TRACER_MAX_TRACE
/* Fix mapped buffer trace arrays do not have snapshot buffers */
if (tr->range_addr_start)
return 0;
ret = allocate_trace_buffer(tr, &tr->max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
......@@ -9340,7 +9516,9 @@ static int trace_array_create_dir(struct trace_array *tr)
}
static struct trace_array *
trace_array_create_systems(const char *name, const char *systems)
trace_array_create_systems(const char *name, const char *systems,
unsigned long range_addr_start,
unsigned long range_addr_size)
{
struct trace_array *tr;
int ret;
......@@ -9366,6 +9544,10 @@ trace_array_create_systems(const char *name, const char *systems)
goto out_free_tr;
}
/* Only for boot up memory mapped ring buffers */
tr->range_addr_start = range_addr_start;
tr->range_addr_size = range_addr_size;
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
......@@ -9423,7 +9605,7 @@ trace_array_create_systems(const char *name, const char *systems)
static struct trace_array *trace_array_create(const char *name)
{
return trace_array_create_systems(name, NULL);
return trace_array_create_systems(name, NULL, 0, 0);
}
static int instance_mkdir(const char *name)
......@@ -9448,6 +9630,31 @@ static int instance_mkdir(const char *name)
return ret;
}
static u64 map_pages(u64 start, u64 size)
{
struct page **pages;
phys_addr_t page_start;
unsigned int page_count;
unsigned int i;
void *vaddr;
page_count = DIV_ROUND_UP(size, PAGE_SIZE);
page_start = start;
pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
if (!pages)
return 0;
for (i = 0; i < page_count; i++) {
phys_addr_t addr = page_start + i * PAGE_SIZE;
pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
}
vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
kfree(pages);
return (u64)(unsigned long)vaddr;
}
/**
* trace_array_get_by_name - Create/Lookup a trace array, given its name.
* @name: The name of the trace array to be looked up/created.
......@@ -9477,7 +9684,7 @@ struct trace_array *trace_array_get_by_name(const char *name, const char *system
goto out_unlock;
}
tr = trace_array_create_systems(name, systems);
tr = trace_array_create_systems(name, systems, 0, 0);
if (IS_ERR(tr))
tr = NULL;
......@@ -9507,6 +9714,9 @@ static int __remove_instance(struct trace_array *tr)
set_tracer_flag(tr, 1 << i, 0);
}
if (printk_trace == tr)
update_printk_trace(&global_trace);
tracing_set_nop(tr);
clear_ftrace_function_probes(tr);
event_trace_del_tracer(tr);
......@@ -9669,10 +9879,15 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
if (ftrace_create_function_files(tr, d_tracer))
MEM_FAIL(1, "Could not allocate function filter files");
if (tr->range_addr_start) {
trace_create_file("last_boot_info", TRACE_MODE_READ, d_tracer,
tr, &last_boot_fops);
#ifdef CONFIG_TRACER_SNAPSHOT
} else {
trace_create_file("snapshot", TRACE_MODE_WRITE, d_tracer,
tr, &snapshot_fops);
#endif
}
trace_create_file("error_log", TRACE_MODE_WRITE, d_tracer,
tr, &tracing_err_log_fops);
......@@ -10292,6 +10507,7 @@ __init static void enable_instances(void)
{
struct trace_array *tr;
char *curr_str;
char *name;
char *str;
char *tok;
......@@ -10300,18 +10516,106 @@ __init static void enable_instances(void)
str = boot_instance_info;
while ((curr_str = strsep(&str, "\t"))) {
phys_addr_t start = 0;
phys_addr_t size = 0;
unsigned long addr = 0;
bool traceprintk = false;
bool traceoff = false;
char *flag_delim;
char *addr_delim;
tok = strsep(&curr_str, ",");
flag_delim = strchr(tok, '^');
addr_delim = strchr(tok, '@');
if (addr_delim)
*addr_delim++ = '\0';
if (flag_delim)
*flag_delim++ = '\0';
name = tok;
if (flag_delim) {
char *flag;
while ((flag = strsep(&flag_delim, "^"))) {
if (strcmp(flag, "traceoff") == 0) {
traceoff = true;
} else if ((strcmp(flag, "printk") == 0) ||
(strcmp(flag, "traceprintk") == 0) ||
(strcmp(flag, "trace_printk") == 0)) {
traceprintk = true;
} else {
pr_info("Tracing: Invalid instance flag '%s' for %s\n",
flag, name);
}
}
}
tok = addr_delim;
if (tok && isdigit(*tok)) {
start = memparse(tok, &tok);
if (!start) {
pr_warn("Tracing: Invalid boot instance address for %s\n",
name);
continue;
}
if (*tok != ':') {
pr_warn("Tracing: No size specified for instance %s\n", name);
continue;
}
tok++;
size = memparse(tok, &tok);
if (!size) {
pr_warn("Tracing: Invalid boot instance size for %s\n",
name);
continue;
}
} else if (tok) {
if (!reserve_mem_find_by_name(tok, &start, &size)) {
start = 0;
pr_warn("Failed to map boot instance %s to %s\n", name, tok);
continue;
}
}
if (start) {
addr = map_pages(start, size);
if (addr) {
pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
name, &start, (unsigned long)size);
} else {
pr_warn("Tracing: Failed to map boot instance %s\n", name);
continue;
}
} else {
/* Only non mapped buffers have snapshot buffers */
if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE))
do_allocate_snapshot(tok);
do_allocate_snapshot(name);
}
tr = trace_array_get_by_name(tok, NULL);
if (!tr) {
pr_warn("Failed to create instance buffer %s\n", curr_str);
tr = trace_array_create_systems(name, NULL, addr, size);
if (IS_ERR(tr)) {
pr_warn("Tracing: Failed to create instance buffer %s\n", curr_str);
continue;
}
/* Allow user space to delete it */
if (traceoff)
tracer_tracing_off(tr);
if (traceprintk)
update_printk_trace(tr);
/*
* If start is set, then this is a mapped buffer, and
* cannot be deleted by user space, so keep the reference
* to it.
*/
if (start)
tr->flags |= TRACE_ARRAY_FL_BOOT;
else
trace_array_put(tr);
while ((tok = strsep(&curr_str, ","))) {
......
......@@ -336,7 +336,6 @@ struct trace_array {
bool allocated_snapshot;
spinlock_t snapshot_trigger_lock;
unsigned int snapshot;
unsigned int mapped;
unsigned long max_latency;
#ifdef CONFIG_FSNOTIFY
struct dentry *d_max_latency;
......@@ -344,6 +343,13 @@ struct trace_array {
struct irq_work fsnotify_irqwork;
#endif
#endif
/* The below is for memory mapped ring buffer */
unsigned int mapped;
unsigned long range_addr_start;
unsigned long range_addr_size;
long text_delta;
long data_delta;
struct trace_pid_list __rcu *filtered_pids;
struct trace_pid_list __rcu *filtered_no_pids;
/*
......@@ -423,7 +429,8 @@ struct trace_array {
};
enum {
TRACE_ARRAY_FL_GLOBAL = (1 << 0)
TRACE_ARRAY_FL_GLOBAL = BIT(0),
TRACE_ARRAY_FL_BOOT = BIT(1),
};
extern struct list_head ftrace_trace_arrays;
......@@ -644,6 +651,8 @@ trace_buffer_lock_reserve(struct trace_buffer *buffer,
unsigned long len,
unsigned int trace_ctx);
int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, int cpu);
struct trace_entry *tracing_get_trace_entry(struct trace_array *tr,
struct trace_array_cpu *data);
......@@ -1312,6 +1321,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
C(IRQ_INFO, "irq-info"), \
C(MARKERS, "markers"), \
C(EVENT_FORK, "event-fork"), \
C(TRACE_PRINTK, "trace_printk_dest"), \
C(PAUSE_ON_TRACE, "pause-on-trace"), \
C(HASH_PTR, "hash-ptr"), /* Print hashed pointer */ \
FUNCTION_FLAGS \
......
......@@ -544,6 +544,8 @@ print_graph_irq(struct trace_iterator *iter, unsigned long addr,
struct trace_seq *s = &iter->seq;
struct trace_entry *ent = iter->ent;
addr += iter->tr->text_delta;
if (addr < (unsigned long)__irqentry_text_start ||
addr >= (unsigned long)__irqentry_text_end)
return;
......@@ -710,6 +712,7 @@ print_graph_entry_leaf(struct trace_iterator *iter,
struct ftrace_graph_ret *graph_ret;
struct ftrace_graph_ent *call;
unsigned long long duration;
unsigned long func;
int cpu = iter->cpu;
int i;
......@@ -717,6 +720,8 @@ print_graph_entry_leaf(struct trace_iterator *iter,
call = &entry->graph_ent;
duration = graph_ret->rettime - graph_ret->calltime;
func = call->func + iter->tr->text_delta;
if (data) {
struct fgraph_cpu_data *cpu_data;
......@@ -747,10 +752,10 @@ print_graph_entry_leaf(struct trace_iterator *iter,
* enabled.
*/
if (flags & __TRACE_GRAPH_PRINT_RETVAL)
print_graph_retval(s, graph_ret->retval, true, (void *)call->func,
print_graph_retval(s, graph_ret->retval, true, (void *)func,
!!(flags & TRACE_GRAPH_PRINT_RETVAL_HEX));
else
trace_seq_printf(s, "%ps();\n", (void *)call->func);
trace_seq_printf(s, "%ps();\n", (void *)func);
print_graph_irq(iter, graph_ret->func, TRACE_GRAPH_RET,
cpu, iter->ent->pid, flags);
......@@ -766,6 +771,7 @@ print_graph_entry_nested(struct trace_iterator *iter,
struct ftrace_graph_ent *call = &entry->graph_ent;
struct fgraph_data *data = iter->private;
struct trace_array *tr = iter->tr;
unsigned long func;
int i;
if (data) {
......@@ -788,7 +794,9 @@ print_graph_entry_nested(struct trace_iterator *iter,
for (i = 0; i < call->depth * TRACE_GRAPH_INDENT; i++)
trace_seq_putc(s, ' ');
trace_seq_printf(s, "%ps() {\n", (void *)call->func);
func = call->func + iter->tr->text_delta;
trace_seq_printf(s, "%ps() {\n", (void *)func);
if (trace_seq_has_overflowed(s))
return TRACE_TYPE_PARTIAL_LINE;
......@@ -863,6 +871,8 @@ check_irq_entry(struct trace_iterator *iter, u32 flags,
int *depth_irq;
struct fgraph_data *data = iter->private;
addr += iter->tr->text_delta;
/*
* If we are either displaying irqs, or we got called as
* a graph event and private data does not exist,
......@@ -990,11 +1000,14 @@ print_graph_return(struct ftrace_graph_ret *trace, struct trace_seq *s,
unsigned long long duration = trace->rettime - trace->calltime;
struct fgraph_data *data = iter->private;
struct trace_array *tr = iter->tr;
unsigned long func;
pid_t pid = ent->pid;
int cpu = iter->cpu;
int func_match = 1;
int i;
func = trace->func + iter->tr->text_delta;
if (check_irq_return(iter, flags, trace->depth))
return TRACE_TYPE_HANDLED;
......@@ -1033,7 +1046,7 @@ print_graph_return(struct ftrace_graph_ret *trace, struct trace_seq *s,
* function-retval option is enabled.
*/
if (flags & __TRACE_GRAPH_PRINT_RETVAL) {
print_graph_retval(s, trace->retval, false, (void *)trace->func,
print_graph_retval(s, trace->retval, false, (void *)func,
!!(flags & TRACE_GRAPH_PRINT_RETVAL_HEX));
} else {
/*
......@@ -1046,7 +1059,7 @@ print_graph_return(struct ftrace_graph_ret *trace, struct trace_seq *s,
if (func_match && !(flags & TRACE_GRAPH_PRINT_TAIL))
trace_seq_puts(s, "}\n");
else
trace_seq_printf(s, "} /* %ps */\n", (void *)trace->func);
trace_seq_printf(s, "} /* %ps */\n", (void *)func);
}
/* Overrun */
......
......@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags,
}
static void print_fn_trace(struct trace_seq *s, unsigned long ip,
unsigned long parent_ip, int flags)
unsigned long parent_ip, long delta, int flags)
{
ip += delta;
parent_ip += delta;
seq_print_ip_sym(s, ip, flags);
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
......@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags,
trace_assign_type(field, iter->ent);
print_fn_trace(s, field->ip, field->parent_ip, flags);
print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags);
trace_seq_putc(s, '\n');
return trace_handle_return(s);
......@@ -1230,6 +1233,7 @@ static enum print_line_t trace_stack_print(struct trace_iterator *iter,
struct trace_seq *s = &iter->seq;
unsigned long *p;
unsigned long *end;
long delta = iter->tr->text_delta;
trace_assign_type(field, iter->ent);
end = (unsigned long *)((long)iter->ent + iter->ent_size);
......@@ -1242,7 +1246,7 @@ static enum print_line_t trace_stack_print(struct trace_iterator *iter,
break;
trace_seq_puts(s, " => ");
seq_print_ip_sym(s, *p, flags);
seq_print_ip_sym(s, (*p) + delta, flags);
trace_seq_putc(s, '\n');
}
......@@ -1587,10 +1591,13 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
{
struct print_entry *field;
struct trace_seq *s = &iter->seq;
unsigned long ip;
trace_assign_type(field, iter->ent);
seq_print_ip_sym(s, field->ip, flags);
ip = field->ip + iter->tr->text_delta;
seq_print_ip_sym(s, ip, flags);
trace_seq_printf(s, ": %s", field->buf);
return trace_handle_return(s);
......@@ -1674,7 +1681,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags,
trace_assign_type(field, iter->ent);
print_fn_trace(s, field->ip, field->parent_ip, flags);
print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
......
......@@ -92,12 +92,22 @@ int tracefs_cpu_map(struct tracefs_cpu_map_desc *desc, int cpu)
if (desc->cpu_fd < 0)
return -ENODEV;
again:
map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
if (map == MAP_FAILED)
return -errno;
desc->meta = (struct trace_buffer_meta *)map;
/* the meta-page is bigger than the original mapping */
if (page_size < desc->meta->meta_struct_len) {
int meta_page_size = desc->meta->meta_page_size;
munmap(desc->meta, page_size);
page_size = meta_page_size;
goto again;
}
return 0;
}
......@@ -228,6 +238,20 @@ TEST_F(map, data_mmap)
data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
desc->cpu_fd, meta_len);
ASSERT_EQ(data, MAP_FAILED);
/* Verify meta-page padding */
if (desc->meta->meta_page_size > getpagesize()) {
data_len = desc->meta->meta_page_size;
data = mmap(NULL, data_len,
PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
ASSERT_NE(data, MAP_FAILED);
for (int i = desc->meta->meta_struct_len;
i < desc->meta->meta_page_size; i += sizeof(int))
ASSERT_EQ(*(int *)(data + i), 0);
munmap(data, data_len);
}
}
FIXTURE(snapshot) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment