Commit ee9f8fce authored by Josh Poimboeuf's avatar Josh Poimboeuf Committed by Ingo Molnar

x86/unwind: Add the ORC unwinder

Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.
Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
[ Extended the changelog. ]
Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
parent 1ee6f00d
ORC unwinder
============
Overview
--------
The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.
The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder. Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.
The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.
ORC vs frame pointers
---------------------
With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel. The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.
In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band. So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.
Ingo Molnar says:
"Note that it's not just a performance improvement, but also an
instruction cache locality improvement: 3.2% .text savings almost
directly transform into a similarly sized reduction in cache
footprint. That can transform to even higher speedups for workloads
whose cache locality is borderline."
Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions. Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
saved.
The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.
ORC vs DWARF
------------
ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers. This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.
The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep. In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder. (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)
The ORC data format does have a few downsides compared to DWARF. ORC
unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
than DWARF-based eh_frame tables.
Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations. But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames. But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.
ORC unwind table generation
---------------------------
The ORC data is generated by objtool. With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch. So it's an easy step to go from stack
validation to ORC data generation.
It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data. However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.
That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files. But asm annotations were tried
in the past and were found to be unmaintainable. They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.
Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code. And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.
So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations. It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.
The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow. If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete. (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
flow.)
If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation. Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis. But for now,
objtool follows GCC code quite well.
Unwinder implementation details
-------------------------------
Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt. After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
respectively.
The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact. The
arrays are sorted in parallel at boot time.
Performance is further improved by the use of a fast lookup table which
is created at runtime. The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.
Etymology
---------
Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies. Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.
"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.
Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.
ORC stands for Oops Rewind Capability.
[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
[3] http://dustin.wikidot.com/half-orcs-and-orcs
#ifndef _ASM_UML_UNWIND_H
#define _ASM_UML_UNWIND_H
static inline void
unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
#endif /* _ASM_UML_UNWIND_H */
...@@ -157,6 +157,7 @@ config X86 ...@@ -157,6 +157,7 @@ config X86
select HAVE_MEMBLOCK select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI select HAVE_NMI
select HAVE_OPROFILE select HAVE_OPROFILE
select HAVE_OPTPROBES select HAVE_OPTPROBES
......
...@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG ...@@ -355,4 +355,29 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from The current power state can be read from
/sys/kernel/debug/punit_atom/dev_power_state /sys/kernel/debug/punit_atom/dev_power_state
config ORC_UNWINDER
bool "ORC unwinder"
depends on X86_64
select STACK_VALIDATION
---help---
This option enables the ORC (Oops Rewind Capability) unwinder for
unwinding kernel stack traces. It uses a custom data format which is
a simplified version of the DWARF Call Frame Information standard.
This unwinder is more accurate across interrupt entry frames than the
frame pointer unwinder. It can also enable a 5-10% performance
improvement across the entire kernel if CONFIG_FRAME_POINTER is
disabled.
Enabling this option will increase the kernel's runtime memory usage
by roughly 2-4MB, depending on your kernel config.
config FRAME_POINTER_UNWINDER
def_bool y
depends on !ORC_UNWINDER && FRAME_POINTER
config GUESS_UNWINDER
def_bool y
depends on !ORC_UNWINDER && !FRAME_POINTER
endmenu endmenu
...@@ -2,6 +2,15 @@ ...@@ -2,6 +2,15 @@
#define _ASM_X86_MODULE_H #define _ASM_X86_MODULE_H
#include <asm-generic/module.h> #include <asm-generic/module.h>
#include <asm/orc_types.h>
struct mod_arch_specific {
#ifdef CONFIG_ORC_UNWINDER
unsigned int num_orcs;
int *orc_unwind_ip;
struct orc_entry *orc_unwind;
#endif
};
#ifdef CONFIG_X86_64 #ifdef CONFIG_X86_64
/* X86_64 does not define MODULE_PROC_FAMILY */ /* X86_64 does not define MODULE_PROC_FAMILY */
......
/*
* Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, see <http://www.gnu.org/licenses/>.
*/
#ifndef _ORC_LOOKUP_H
#define _ORC_LOOKUP_H
/*
* This is a lookup table for speeding up access to the .orc_unwind table.
* Given an input address offset, the corresponding lookup table entry
* specifies a subset of the .orc_unwind table to search.
*
* Each block represents the end of the previous range and the start of the
* next range. An extra block is added to give the last range an end.
*
* The block size should be a power of 2 to avoid a costly 'div' instruction.
*
* A block size of 256 was chosen because it roughly doubles unwinder
* performance while only adding ~5% to the ORC data footprint.
*/
#define LOOKUP_BLOCK_ORDER 8
#define LOOKUP_BLOCK_SIZE (1 << LOOKUP_BLOCK_ORDER)
#ifndef LINKER_SCRIPT
extern unsigned int orc_lookup[];
extern unsigned int orc_lookup_end[];
#define LOOKUP_START_IP (unsigned long)_stext
#define LOOKUP_STOP_IP (unsigned long)_etext
#endif /* LINKER_SCRIPT */
#endif /* _ORC_LOOKUP_H */
...@@ -88,7 +88,7 @@ struct orc_entry { ...@@ -88,7 +88,7 @@ struct orc_entry {
unsigned sp_reg:4; unsigned sp_reg:4;
unsigned bp_reg:4; unsigned bp_reg:4;
unsigned type:2; unsigned type:2;
}; } __packed;
/* /*
* This struct is used by asm and inline asm code to manually annotate the * This struct is used by asm and inline asm code to manually annotate the
......
...@@ -12,11 +12,14 @@ struct unwind_state { ...@@ -12,11 +12,14 @@ struct unwind_state {
struct task_struct *task; struct task_struct *task;
int graph_idx; int graph_idx;
bool error; bool error;
#ifdef CONFIG_FRAME_POINTER #if defined(CONFIG_ORC_UNWINDER)
bool signal, full_regs;
unsigned long sp, bp, ip;
struct pt_regs *regs;
#elif defined(CONFIG_FRAME_POINTER)
bool got_irq; bool got_irq;
unsigned long *bp, *orig_sp; unsigned long *bp, *orig_sp, ip;
struct pt_regs *regs; struct pt_regs *regs;
unsigned long ip;
#else #else
unsigned long *sp; unsigned long *sp;
#endif #endif
...@@ -24,41 +27,30 @@ struct unwind_state { ...@@ -24,41 +27,30 @@ struct unwind_state {
void __unwind_start(struct unwind_state *state, struct task_struct *task, void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame); struct pt_regs *regs, unsigned long *first_frame);
bool unwind_next_frame(struct unwind_state *state); bool unwind_next_frame(struct unwind_state *state);
unsigned long unwind_get_return_address(struct unwind_state *state); unsigned long unwind_get_return_address(struct unwind_state *state);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
static inline bool unwind_done(struct unwind_state *state) static inline bool unwind_done(struct unwind_state *state)
{ {
return state->stack_info.type == STACK_TYPE_UNKNOWN; return state->stack_info.type == STACK_TYPE_UNKNOWN;
} }
static inline
void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{
first_frame = first_frame ? : get_stack_pointer(task, regs);
__unwind_start(state, task, regs, first_frame);
}
static inline bool unwind_error(struct unwind_state *state) static inline bool unwind_error(struct unwind_state *state)
{ {
return state->error; return state->error;
} }
#ifdef CONFIG_FRAME_POINTER
static inline static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{ {
if (unwind_done(state)) first_frame = first_frame ? : get_stack_pointer(task, regs);
return NULL;
return state->regs ? &state->regs->ip : state->bp + 1; __unwind_start(state, task, regs, first_frame);
} }
#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER)
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
{ {
if (unwind_done(state)) if (unwind_done(state))
...@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) ...@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
return state->regs; return state->regs;
} }
#else
#else /* !CONFIG_FRAME_POINTER */ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
{ {
return NULL; return NULL;
} }
#endif
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) #ifdef CONFIG_ORC_UNWINDER
void unwind_init(void);
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size);
#else
static inline void unwind_init(void) {}
static inline
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
#endif
/*
* This disables KASAN checking when reading a value from another task's stack,
* since the other task could be running on another CPU and could have poisoned
* the stack in the meantime.
*/
#define READ_ONCE_TASK_STACK(task, x) \
({ \
unsigned long val; \
if (task == current) \
val = READ_ONCE(x); \
else \
val = READ_ONCE_NOCHECK(x); \
val; \
})
static inline bool task_on_another_cpu(struct task_struct *task)
{ {
return NULL; #ifdef CONFIG_SMP
return task != current && task->on_cpu;
#else
return false;
#endif
} }
#endif /* CONFIG_FRAME_POINTER */
#endif /* _ASM_X86_UNWIND_H */ #endif /* _ASM_X86_UNWIND_H */
...@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS) += perf_regs.o ...@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_TRACING) += tracepoint.o
obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o
ifdef CONFIG_FRAME_POINTER obj-$(CONFIG_ORC_UNWINDER) += unwind_orc.o
obj-y += unwind_frame.o obj-$(CONFIG_FRAME_POINTER_UNWINDER) += unwind_frame.o
else obj-$(CONFIG_GUESS_UNWINDER) += unwind_guess.o
obj-y += unwind_guess.o
endif
### ###
# 64 bit specific files # 64 bit specific files
......
...@@ -35,6 +35,7 @@ ...@@ -35,6 +35,7 @@
#include <asm/page.h> #include <asm/page.h>
#include <asm/pgtable.h> #include <asm/pgtable.h>
#include <asm/setup.h> #include <asm/setup.h>
#include <asm/unwind.h>
#if 0 #if 0
#define DEBUGP(fmt, ...) \ #define DEBUGP(fmt, ...) \
...@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr, ...@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
struct module *me) struct module *me)
{ {
const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL, const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
*para = NULL; *para = NULL, *orc = NULL, *orc_ip = NULL;
char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
...@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr, ...@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
locks = s; locks = s;
if (!strcmp(".parainstructions", secstrings + s->sh_name)) if (!strcmp(".parainstructions", secstrings + s->sh_name))
para = s; para = s;
if (!strcmp(".orc_unwind", secstrings + s->sh_name))
orc = s;
if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
orc_ip = s;
} }
if (alt) { if (alt) {
...@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr, ...@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
/* make jump label nops */ /* make jump label nops */
jump_label_apply_nops(me); jump_label_apply_nops(me);
if (orc && orc_ip)
unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
(void *)orc->sh_addr, orc->sh_size);
return 0; return 0;
} }
......
...@@ -115,6 +115,7 @@ ...@@ -115,6 +115,7 @@
#include <asm/microcode.h> #include <asm/microcode.h>
#include <asm/mmu_context.h> #include <asm/mmu_context.h>
#include <asm/kaslr.h> #include <asm/kaslr.h>
#include <asm/unwind.h>
/* /*
* max_low_pfn_mapped: highest direct mapped pfn under 4GB * max_low_pfn_mapped: highest direct mapped pfn under 4GB
...@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p) ...@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
if (efi_enabled(EFI_BOOT)) if (efi_enabled(EFI_BOOT))
efi_apply_memmap_quirks(); efi_apply_memmap_quirks();
#endif #endif
unwind_init();
} }
#ifdef CONFIG_X86_32 #ifdef CONFIG_X86_32
......
...@@ -10,20 +10,22 @@ ...@@ -10,20 +10,22 @@
#define FRAME_HEADER_SIZE (sizeof(long) * 2) #define FRAME_HEADER_SIZE (sizeof(long) * 2)
/* unsigned long unwind_get_return_address(struct unwind_state *state)
* This disables KASAN checking when reading a value from another task's stack, {
* since the other task could be running on another CPU and could have poisoned if (unwind_done(state))
* the stack in the meantime. return 0;
*/
#define READ_ONCE_TASK_STACK(task, x) \ return __kernel_text_address(state->ip) ? state->ip : 0;
({ \ }
unsigned long val; \ EXPORT_SYMBOL_GPL(unwind_get_return_address);
if (task == current) \
val = READ_ONCE(x); \ unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
else \ {
val = READ_ONCE_NOCHECK(x); \ if (unwind_done(state))
val; \ return NULL;
})
return state->regs ? &state->regs->ip : state->bp + 1;
}
static void unwind_dump(struct unwind_state *state) static void unwind_dump(struct unwind_state *state)
{ {
...@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state) ...@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
} }
} }
unsigned long unwind_get_return_address(struct unwind_state *state)
{
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
static size_t regs_size(struct pt_regs *regs) static size_t regs_size(struct pt_regs *regs)
{ {
/* x86_32 regs from kernel mode are two words shorter: */ /* x86_32 regs from kernel mode are two words shorter: */
......
...@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state) ...@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
} }
EXPORT_SYMBOL_GPL(unwind_get_return_address); EXPORT_SYMBOL_GPL(unwind_get_return_address);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
{
return NULL;
}
bool unwind_next_frame(struct unwind_state *state) bool unwind_next_frame(struct unwind_state *state)
{ {
struct stack_info *info = &state->stack_info; struct stack_info *info = &state->stack_info;
......
#include <linux/module.h>
#include <linux/sort.h>
#include <asm/ptrace.h>
#include <asm/stacktrace.h>
#include <asm/unwind.h>
#include <asm/orc_types.h>
#include <asm/orc_lookup.h>
#include <asm/sections.h>
#define orc_warn(fmt, ...) \
printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
extern int __start_orc_unwind_ip[];
extern int __stop_orc_unwind_ip[];
extern struct orc_entry __start_orc_unwind[];
extern struct orc_entry __stop_orc_unwind[];
static DEFINE_MUTEX(sort_mutex);
int *cur_orc_ip_table = __start_orc_unwind_ip;
struct orc_entry *cur_orc_table = __start_orc_unwind;
unsigned int lookup_num_blocks;
bool orc_init;
static inline unsigned long orc_ip(const int *ip)
{
return (unsigned long)ip + *ip;
}
static struct orc_entry *__orc_find(int *ip_table, struct orc_entry *u_table,
unsigned int num_entries, unsigned long ip)
{
int *first = ip_table;
int *last = ip_table + num_entries - 1;
int *mid = first, *found = first;
if (!num_entries)
return NULL;
/*
* Do a binary range search to find the rightmost duplicate of a given
* starting address. Some entries are section terminators which are
* "weak" entries for ensuring there are no gaps. They should be
* ignored when they conflict with a real entry.
*/
while (first <= last) {
mid = first + ((last - first) / 2);
if (orc_ip(mid) <= ip) {
found = mid;
first = mid + 1;
} else
last = mid - 1;
}
return u_table + (found - ip_table);
}
#ifdef CONFIG_MODULES
static struct orc_entry *orc_module_find(unsigned long ip)
{
struct module *mod;
mod = __module_address(ip);
if (!mod || !mod->arch.orc_unwind || !mod->arch.orc_unwind_ip)
return NULL;
return __orc_find(mod->arch.orc_unwind_ip, mod->arch.orc_unwind,
mod->arch.num_orcs, ip);
}
#else
static struct orc_entry *orc_module_find(unsigned long ip)
{
return NULL;
}
#endif
static struct orc_entry *orc_find(unsigned long ip)
{
if (!orc_init)
return NULL;
/* For non-init vmlinux addresses, use the fast lookup table: */
if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) {
unsigned int idx, start, stop;
idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE;
if (unlikely((idx >= lookup_num_blocks-1))) {
orc_warn("WARNING: bad lookup idx: idx=%u num=%u ip=%lx\n",
idx, lookup_num_blocks, ip);
return NULL;
}
start = orc_lookup[idx];
stop = orc_lookup[idx + 1] + 1;
if (unlikely((__start_orc_unwind + start >= __stop_orc_unwind) ||
(__start_orc_unwind + stop > __stop_orc_unwind))) {
orc_warn("WARNING: bad lookup value: idx=%u num=%u start=%u stop=%u ip=%lx\n",
idx, lookup_num_blocks, start, stop, ip);
return NULL;
}
return __orc_find(__start_orc_unwind_ip + start,
__start_orc_unwind + start, stop - start, ip);
}
/* vmlinux .init slow lookup: */
if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext)
return __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
__stop_orc_unwind_ip - __start_orc_unwind_ip, ip);
/* Module lookup: */
return orc_module_find(ip);
}
static void orc_sort_swap(void *_a, void *_b, int size)
{
struct orc_entry *orc_a, *orc_b;
struct orc_entry orc_tmp;
int *a = _a, *b = _b, tmp;
int delta = _b - _a;
/* Swap the .orc_unwind_ip entries: */
tmp = *a;
*a = *b + delta;
*b = tmp - delta;
/* Swap the corresponding .orc_unwind entries: */
orc_a = cur_orc_table + (a - cur_orc_ip_table);
orc_b = cur_orc_table + (b - cur_orc_ip_table);
orc_tmp = *orc_a;
*orc_a = *orc_b;
*orc_b = orc_tmp;
}
static int orc_sort_cmp(const void *_a, const void *_b)
{
struct orc_entry *orc_a;
const int *a = _a, *b = _b;
unsigned long a_val = orc_ip(a);
unsigned long b_val = orc_ip(b);
if (a_val > b_val)
return 1;
if (a_val < b_val)
return -1;
/*
* The "weak" section terminator entries need to always be on the left
* to ensure the lookup code skips them in favor of real entries.
* These terminator entries exist to handle any gaps created by
* whitelisted .o files which didn't get objtool generation.
*/
orc_a = cur_orc_table + (a - cur_orc_ip_table);
return orc_a->sp_reg == ORC_REG_UNDEFINED ? -1 : 1;
}
#ifdef CONFIG_MODULES
void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
void *_orc, size_t orc_size)
{
int *orc_ip = _orc_ip;
struct orc_entry *orc = _orc;
unsigned int num_entries = orc_ip_size / sizeof(int);
WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
orc_size % sizeof(*orc) != 0 ||
num_entries != orc_size / sizeof(*orc));
/*
* The 'cur_orc_*' globals allow the orc_sort_swap() callback to
* associate an .orc_unwind_ip table entry with its corresponding
* .orc_unwind entry so they can both be swapped.
*/
mutex_lock(&sort_mutex);
cur_orc_ip_table = orc_ip;
cur_orc_table = orc;
sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
mutex_unlock(&sort_mutex);
mod->arch.orc_unwind_ip = orc_ip;
mod->arch.orc_unwind = orc;
mod->arch.num_orcs = num_entries;
}
#endif
void __init unwind_init(void)
{
size_t orc_ip_size = (void *)__stop_orc_unwind_ip - (void *)__start_orc_unwind_ip;
size_t orc_size = (void *)__stop_orc_unwind - (void *)__start_orc_unwind;
size_t num_entries = orc_ip_size / sizeof(int);
struct orc_entry *orc;
int i;
if (!num_entries || orc_ip_size % sizeof(int) != 0 ||
orc_size % sizeof(struct orc_entry) != 0 ||
num_entries != orc_size / sizeof(struct orc_entry)) {
orc_warn("WARNING: Bad or missing .orc_unwind table. Disabling unwinder.\n");
return;
}
/* Sort the .orc_unwind and .orc_unwind_ip tables: */
sort(__start_orc_unwind_ip, num_entries, sizeof(int), orc_sort_cmp,
orc_sort_swap);
/* Initialize the fast lookup table: */
lookup_num_blocks = orc_lookup_end - orc_lookup;
for (i = 0; i < lookup_num_blocks-1; i++) {
orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind,
num_entries,
LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i));
if (!orc) {
orc_warn("WARNING: Corrupt .orc_unwind table. Disabling unwinder.\n");
return;
}
orc_lookup[i] = orc - __start_orc_unwind;
}
/* Initialize the ending block: */
orc = __orc_find(__start_orc_unwind_ip, __start_orc_unwind, num_entries,
LOOKUP_STOP_IP);
if (!orc) {
orc_warn("WARNING: Corrupt .orc_unwind table. Disabling unwinder.\n");
return;
}
orc_lookup[lookup_num_blocks-1] = orc - __start_orc_unwind;
orc_init = true;
}
unsigned long unwind_get_return_address(struct unwind_state *state)
{
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
{
if (unwind_done(state))
return NULL;
if (state->regs)
return &state->regs->ip;
if (state->sp)
return (unsigned long *)state->sp - 1;
return NULL;
}
static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
size_t len)
{
struct stack_info *info = &state->stack_info;
/*
* If the address isn't on the current stack, switch to the next one.
*
* We may have to traverse multiple stacks to deal with the possibility
* that info->next_sp could point to an empty stack and the address
* could be on a subsequent stack.
*/
while (!on_stack(info, (void *)addr, len))
if (get_stack_info(info->next_sp, state->task, info,
&state->stack_mask))
return false;
return true;
}
static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
unsigned long *val)
{
if (!stack_access_ok(state, addr, sizeof(long)))
return false;
*val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr);
return true;
}
#define REGS_SIZE (sizeof(struct pt_regs))
#define SP_OFFSET (offsetof(struct pt_regs, sp))
#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
unsigned long *ip, unsigned long *sp, bool full)
{
size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
if (IS_ENABLED(CONFIG_X86_64)) {
if (!stack_access_ok(state, addr, regs_size))
return false;
*ip = regs->ip;
*sp = regs->sp;
return true;
}
if (!stack_access_ok(state, addr, sp_offset))
return false;
*ip = regs->ip;
if (user_mode(regs)) {
if (!stack_access_ok(state, addr + sp_offset,
REGS_SIZE - SP_OFFSET))
return false;
*sp = regs->sp;
} else
*sp = (unsigned long)&regs->sp;
return true;
}
bool unwind_next_frame(struct unwind_state *state)
{
unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
enum stack_type prev_type = state->stack_info.type;
struct orc_entry *orc;
struct pt_regs *ptregs;
bool indirect = false;
if (unwind_done(state))
return false;
/* Don't let modules unload while we're reading their ORC data. */
preempt_disable();
/* Have we reached the end? */
if (state->regs && user_mode(state->regs))
goto done;
/*
* Find the orc_entry associated with the text address.
*
* Decrement call return addresses by one so they work for sibling
* calls and calls to noreturn functions.
*/
orc = orc_find(state->signal ? state->ip : state->ip - 1);
if (!orc || orc->sp_reg == ORC_REG_UNDEFINED)
goto done;
orig_ip = state->ip;
/* Find the previous frame's stack: */
switch (orc->sp_reg) {
case ORC_REG_SP:
sp = state->sp + orc->sp_offset;
break;
case ORC_REG_BP:
sp = state->bp + orc->sp_offset;
break;
case ORC_REG_SP_INDIRECT:
sp = state->sp + orc->sp_offset;
indirect = true;
break;
case ORC_REG_BP_INDIRECT:
sp = state->bp + orc->sp_offset;
indirect = true;
break;
case ORC_REG_R10:
if (!state->regs || !state->full_regs) {
orc_warn("missing regs for base reg R10 at ip %p\n",
(void *)state->ip);
goto done;
}
sp = state->regs->r10;
break;
case ORC_REG_R13:
if (!state->regs || !state->full_regs) {
orc_warn("missing regs for base reg R13 at ip %p\n",
(void *)state->ip);
goto done;
}
sp = state->regs->r13;
break;
case ORC_REG_DI:
if (!state->regs || !state->full_regs) {
orc_warn("missing regs for base reg DI at ip %p\n",
(void *)state->ip);
goto done;
}
sp = state->regs->di;
break;
case ORC_REG_DX:
if (!state->regs || !state->full_regs) {
orc_warn("missing regs for base reg DX at ip %p\n",
(void *)state->ip);
goto done;
}
sp = state->regs->dx;
break;
default:
orc_warn("unknown SP base reg %d for ip %p\n",
orc->sp_reg, (void *)state->ip);
goto done;
}
if (indirect) {
if (!deref_stack_reg(state, sp, &sp))
goto done;
}
/* Find IP, SP and possibly regs: */
switch (orc->type) {
case ORC_TYPE_CALL:
ip_p = sp - sizeof(long);
if (!deref_stack_reg(state, ip_p, &state->ip))
goto done;
state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
state->ip, (void *)ip_p);
state->sp = sp;
state->regs = NULL;
state->signal = false;
break;
case ORC_TYPE_REGS:
if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
orc_warn("can't dereference registers at %p for ip %p\n",
(void *)sp, (void *)orig_ip);
goto done;
}
state->regs = (struct pt_regs *)sp;
state->full_regs = true;
state->signal = true;
break;
case ORC_TYPE_REGS_IRET:
if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
orc_warn("can't dereference iret registers at %p for ip %p\n",
(void *)sp, (void *)orig_ip);
goto done;
}
ptregs = container_of((void *)sp, struct pt_regs, ip);
if ((unsigned long)ptregs >= prev_sp &&
on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
state->regs = ptregs;
state->full_regs = false;
} else
state->regs = NULL;
state->signal = true;
break;
default:
orc_warn("unknown .orc_unwind entry type %d\n", orc->type);
break;
}
/* Find BP: */
switch (orc->bp_reg) {
case ORC_REG_UNDEFINED:
if (state->regs && state->full_regs)
state->bp = state->regs->bp;
break;
case ORC_REG_PREV_SP:
if (!deref_stack_reg(state, sp + orc->bp_offset, &state->bp))
goto done;
break;
case ORC_REG_BP:
if (!deref_stack_reg(state, state->bp + orc->bp_offset, &state->bp))
goto done;
break;
default:
orc_warn("unknown BP base reg %d for ip %p\n",
orc->bp_reg, (void *)orig_ip);
goto done;
}
/* Prevent a recursive loop due to bad ORC data: */
if (state->stack_info.type == prev_type &&
on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
state->sp <= prev_sp) {
orc_warn("stack going in the wrong direction? ip=%p\n",
(void *)orig_ip);
goto done;
}
preempt_enable();
return true;
done:
preempt_enable();
state->stack_info.type = STACK_TYPE_UNKNOWN;
return false;
}
EXPORT_SYMBOL_GPL(unwind_next_frame);
void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{
memset(state, 0, sizeof(*state));
state->task = task;
/*
* Refuse to unwind the stack of a task while it's executing on another
* CPU. This check is racy, but that's ok: the unwinder has other
* checks to prevent it from going off the rails.
*/
if (task_on_another_cpu(task))
goto done;
if (regs) {
if (user_mode(regs))
goto done;
state->ip = regs->ip;
state->sp = kernel_stack_pointer(regs);
state->bp = regs->bp;
state->regs = regs;
state->full_regs = true;
state->signal = true;
} else if (task == current) {
asm volatile("lea (%%rip), %0\n\t"
"mov %%rsp, %1\n\t"
"mov %%rbp, %2\n\t"
: "=r" (state->ip), "=r" (state->sp),
"=r" (state->bp));
} else {
struct inactive_task_frame *frame = (void *)task->thread.sp;
state->sp = task->thread.sp;
state->bp = READ_ONCE_NOCHECK(frame->bp);
state->ip = READ_ONCE_NOCHECK(frame->ret_addr);
}
if (get_stack_info((unsigned long *)state->sp, state->task,
&state->stack_info, &state->stack_mask))
return;
/*
* The caller can provide the address of the first frame directly
* (first_frame) or indirectly (regs->sp) to indicate which stack frame
* to start unwinding at. Skip ahead until we reach it.
*/
/* When starting from regs, skip the regs frame: */
if (regs) {
unwind_next_frame(state);
return;
}
/* Otherwise, skip ahead to the user-specified starting frame: */
while (!unwind_done(state) &&
(!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
state->sp <= (unsigned long)first_frame))
unwind_next_frame(state);
return;
done:
state->stack_info.type = STACK_TYPE_UNKNOWN;
return;
}
EXPORT_SYMBOL_GPL(__unwind_start);
...@@ -24,6 +24,7 @@ ...@@ -24,6 +24,7 @@
#include <asm/asm-offsets.h> #include <asm/asm-offsets.h>
#include <asm/thread_info.h> #include <asm/thread_info.h>
#include <asm/page_types.h> #include <asm/page_types.h>
#include <asm/orc_lookup.h>
#include <asm/cache.h> #include <asm/cache.h>
#include <asm/boot.h> #include <asm/boot.h>
...@@ -148,6 +149,8 @@ SECTIONS ...@@ -148,6 +149,8 @@ SECTIONS
BUG_TABLE BUG_TABLE
ORC_UNWIND_TABLE
. = ALIGN(PAGE_SIZE); . = ALIGN(PAGE_SIZE);
__vvar_page = .; __vvar_page = .;
......
...@@ -680,6 +680,31 @@ ...@@ -680,6 +680,31 @@
#define BUG_TABLE #define BUG_TABLE
#endif #endif
#ifdef CONFIG_ORC_UNWINDER
#define ORC_UNWIND_TABLE \
. = ALIGN(4); \
.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_orc_unwind_ip) = .; \
KEEP(*(.orc_unwind_ip)) \
VMLINUX_SYMBOL(__stop_orc_unwind_ip) = .; \
} \
. = ALIGN(6); \
.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_orc_unwind) = .; \
KEEP(*(.orc_unwind)) \
VMLINUX_SYMBOL(__stop_orc_unwind) = .; \
} \
. = ALIGN(4); \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(orc_lookup) = .; \
. += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) / \
LOOKUP_BLOCK_SIZE) + 1) * 4; \
VMLINUX_SYMBOL(orc_lookup_end) = .; \
}
#else
#define ORC_UNWIND_TABLE
#endif
#ifdef CONFIG_PM_TRACE #ifdef CONFIG_PM_TRACE
#define TRACEDATA \ #define TRACEDATA \
. = ALIGN(4); \ . = ALIGN(4); \
...@@ -866,7 +891,7 @@ ...@@ -866,7 +891,7 @@
DATA_DATA \ DATA_DATA \
CONSTRUCTORS \ CONSTRUCTORS \
} \ } \
BUG_TABLE BUG_TABLE \
#define INIT_TEXT_SECTION(inittext_align) \ #define INIT_TEXT_SECTION(inittext_align) \
. = ALIGN(inittext_align); \ . = ALIGN(inittext_align); \
......
...@@ -374,6 +374,9 @@ config STACK_VALIDATION ...@@ -374,6 +374,9 @@ config STACK_VALIDATION
pointers (if CONFIG_FRAME_POINTER is enabled). This helps ensure pointers (if CONFIG_FRAME_POINTER is enabled). This helps ensure
that runtime stack traces are more reliable. that runtime stack traces are more reliable.
This is also a prerequisite for generation of ORC unwind data, which
is needed for CONFIG_ORC_UNWINDER.
For more information, see For more information, see
tools/objtool/Documentation/stack-validation.txt. tools/objtool/Documentation/stack-validation.txt.
......
...@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1) ...@@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1)
__objtool_obj := $(objtree)/tools/objtool/objtool __objtool_obj := $(objtree)/tools/objtool/objtool
objtool_args = check objtool_args = $(if $(CONFIG_ORC_UNWINDER),orc generate,check)
ifndef CONFIG_FRAME_POINTER ifndef CONFIG_FRAME_POINTER
objtool_args += --no-fp objtool_args += --no-fp
endif endif
...@@ -279,6 +280,11 @@ objtool_obj = $(if $(patsubst y%,, \ ...@@ -279,6 +280,11 @@ objtool_obj = $(if $(patsubst y%,, \
endif # SKIP_STACK_VALIDATION endif # SKIP_STACK_VALIDATION
endif # CONFIG_STACK_VALIDATION endif # CONFIG_STACK_VALIDATION
# Rebuild all objects when objtool changes, or is enabled/disabled.
objtool_dep = $(objtool_obj) \
$(wildcard include/config/orc/unwinder.h \
include/config/stack/validation.h)
define rule_cc_o_c define rule_cc_o_c
$(call echo-cmd,checksrc) $(cmd_checksrc) \ $(call echo-cmd,checksrc) $(cmd_checksrc) \
$(call cmd_and_fixdep,cc_o_c) \ $(call cmd_and_fixdep,cc_o_c) \
...@@ -301,13 +307,13 @@ cmd_undef_syms = echo ...@@ -301,13 +307,13 @@ cmd_undef_syms = echo
endif endif
# Built-in and composite module parts # Built-in and composite module parts
$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
$(call cmd,force_checksrc) $(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c) $(call if_changed_rule,cc_o_c)
# Single-part modules are special since we need to mark them in $(MODVERDIR) # Single-part modules are special since we need to mark them in $(MODVERDIR)
$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE $(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE
$(call cmd,force_checksrc) $(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c) $(call if_changed_rule,cc_o_c)
@{ echo $(@:.o=.ko); echo $@; \ @{ echo $(@:.o=.ko); echo $@; \
...@@ -402,7 +408,7 @@ cmd_modversions_S = \ ...@@ -402,7 +408,7 @@ cmd_modversions_S = \
endif endif
endif endif
$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE $(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE
$(call if_changed_rule,as_o_S) $(call if_changed_rule,as_o_S)
targets += $(real-objs-y) $(real-objs-m) $(lib-y) targets += $(real-objs-y) $(real-objs-m) $(lib-y)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment