implement custom C++ unwinder

acc52c0e · Michael Arntzenius · 1bac7510 · acc52c0e · acc52c0e · acc52c0e
Commit acc52c0e authored May 22, 2015 by Michael Arntzenius
8 changed files
--- a/docs/EXCEPTION-SAFETY.md
+++ b/docs/EXCEPTION-SAFETY.md
+# Using exceptions safely in Pyston
+In addition to following general best practices for writing exception-safe C++, when writing Pyston there are a few special rules (because it has a custom unwinder):
+1. **Only throw `ExcInfo` values.** All Pyston exceptions are of type `ExcInfo`, which represents a Python exception. In fact, usually you should never `throw`; instead, call `raiseRaw`, `raiseExc`, `raise3`, or similar.
+2. **Always catch by value.** That is, always write:
+   ```c++
+   try { ... } catch (ExcInfo e) { ... } // Do this!
+   ```
+   And **never** write:
+   ```c++
+   try { ... } catch (ExcInfo& e) { ... } // DO NOT DO THIS!
+   ```
+   The reason for this has to do with the way exceptions are stored in thread-local storage in Pyston; see `docs/UNWINDING.md` for the gory details.
+3. **Never rethrow with bare `throw;`.** Instead, write `throw e;`, where `e` is the exception you caught previously.
+4. **Never invoke the GC from a destructor.** The GC is not currently aware of the place the exception-currently-being-unwound is stored. Invoking the GC from a destructor might collect the exception, producing a use-after-free bug!
+5. **Never throw an exception inside a destructor.** This is a general rule in C++ anyways, but worth reiterating here. In fact, don't even invoke code that *throws an exception but handles it*! This, again, has to do with the way exceptions are stored.
+6. **Don't throw exceptions inside signal handlers.** It should be okay if you throw an exception and *always* catch it inside the handler, but I haven't tested this. In theory the exception should just unwind through the signal frame, and libunwind will take care of resetting the signal mask. However, as this codepath hasn't been tested, it's best avoided.
+Most of these restrictions could be eliminated in principle. See `docs/UNWINDING.md` for the gory details.
--- a/docs/UNWINDING.md
+++ b/docs/UNWINDING.md
+# The Pyston Unwinder
+Pyston uses a custom exception unwinder, replacing the general-purpose C++ unwinder provided by `libstdc++` and `libgcc`. We do this for two reasons:
+1. **Efficiency**. The default clang/gcc C++ unwinder is slow, because it needs to support features we don't (such as two-phase unwinding, and having multiple exception types) and because it isn't optimized for speed (C++ assumes exceptions are uncommon).
+2. **Customizability**. For example, Python handles backtraces differently than C++ does; with a custom unwinder, we can support Python-style backtraces more easily.
+The custom unwinder is in `src/runtime/cxx_unwind.cpp`.
+### Useful references on C++ exception handling
+- [https://monoinfinito.wordpress.com/series/exception-handling-in-c/](): Good overview of C++ exceptions.
+- [http://www.airs.com/blog/archives/460](): Covers dirty details of `.eh_frame`.
+- [http://www.airs.com/blog/archives/464](): Covers dirty details of the personality function and the LSDA.
+# How normal C++ unwinding works
+The big picture is that when an exception is thrown, we walk the stack *twice*:
+1. In the first phase, we look for a `catch`-block whose type matches the thrown exception. If we don't find one, we terminate the process.
+2. In the second phase, we unwind up to the `catch`-block we found; along the way we run any intervening `finally` blocks or RAII destructors.
+The purpose of the two-phase search is to make sure that *exceptions that won't be caught terminate the process immediately with a full stack-trace*. In Pyston we don't care about this --- stack traces work differently for us anyway.
+## How normal C++ unwinding works, in detail
+### Throwing
+C++ `throw` statements are translated into a pair of method calls:
+1. A call to `void *__cxxabiv1::__cxa_allocate_exception(size_t)` allocates space for an exception of the given size.
+2. A call to `void __cxxabiv1::__cxa_throw(void *exc_obj, std::type_info *type_info, void (*dtor)(void*))` invokes the stack unwinder. `exc_obj` is the exception to be thrown; `type_info` is the RTTI for the exception's class, and `dtor` is a callback that (I think) is called to destroy the exception object.
+These methods (and others in the `__cxxabiv1` namespace) are defined in `libstdc++`. `__cxa_throw` invokes the generic (non-C++-specific) unwinder by calling `_Unwind_RaiseException()`. This function (and others prefixed with `_Unwind`) are defined in `libgcc`. The details of the libgcc unwinder's interface are less important, and I omit them here.
+### Unwinding and .eh_frame
+The libgcc unwinder walks the call frame stack, looking up debug information about each function it unwinds through. It finds the debug information by searching for the instruction pointer that would be returned-to in a list of tables; one table for each loaded object (in the linker-and-loader sense of "object", i.e. executable file or shared library). For a given object, the debug info is in a section called `.eh_frame`. See [this blog post](http://www.airs.com/blog/archives/460) for more on the format of `.eh_frame`.
+In particular, the unwinder checks whether the function has an associated "personality function", and calls it if it does. If there's no personality function, unwinding continues as normal. C functions do not have personality functions. C++ functions have the personality function `__gxx_personality_v0`, or (if they don't involve exceptions or RAII at all) no personality function.
+The job of the personality function is to:
+1. Determine what action, if any, needs to happen when unwinding this exception through this frame.
+2. If we are in Phase 1, or if there is no action to be taken, report this information to the caller.
+3. If we are in Phase 2, actually take the relevant action: jump into the relevant cleanup code, `finally`, or `catch` block. In this case, the personality function does not return.
+### The LSDA, landing pads and switch values: how the personality function works
+The personality function determines what to do by comparing the instruction pointer being unwound through against C++-specific unwinding information. This is contained in an area of `.eh_frame` called the LSDA (Language-Specific Data Area). See [this blog post](http://www.airs.com/blog/archives/464) for a detailed run-down.
+If the personality function finds a "special" action to perform when unwinding, it is associated with two values:
+- The *landing pad*, a code address, determined by the instruction pointer value.
+- The *switch value*, an `int64_t`. This is *zero* if we're running cleanup code (RAII destructors or a `finally` block); otherwise it is an index that indicates *which* `catch` block we've matched (since there may be several `catch` blocks covering the code region we're unwinding through).
+If we're in phase 2, the personality function then jumps to the landing pad, after (a) restoring execution state for this call frame and (b) storing the exception object pointer and the switch value in specific registers (`RAX` and `RDX` respectively). The code at the landing pad is emitted by the C++ compiler as part of the function being unwound through, and it dispatches on the switch value to determine what code to actually run.
+It dispatches to code in one of two flavors: *cleanup code* (`finally` blocks and RAII destructors), or *handler code* (`catch` blocks).
+#### Cleanup code (`finally`/RAII)
+Cleanup code does what you'd expect: calls the appropriate destructors and/or runs the code in the appropriate `finally` block. It may also call `__cxa_end_catch()`, if we are unwinding out of a catch block - think of `__cxa_begin_catch()` and `__cxa_end_catch()` as like RAII constructor/destructor pairs; the latter is guaranteed to get called when leaving a catch block, whether normally or by exception.
+After this is done, it calls `_Unwind_Resume()` to resume unwinding, passing it the exception object pointer that it received in `RAX` when the personality function jumped to the landing pad.
+#### Handler code (`catch`)
+Handler code, first of all, may *also* call RAII destructors or other cleanup code if necessary. After that, it *may* call `__cxa_get_exception_ptr` with the exception object pointer. I'm not sure why it does this, but it expects `__cxa_get_exception_ptr` to also *return* a pointer to the exception object, so it's effectively a no-op. (I think in a normal C++ unwinder maybe there's an exception *header* as well, and some pointer arithmetic going on, so that the pointer passed in `RAX` to the landing pad and the exception object itself are different?)
+After this, it calls `__cxa_begin_catch()` with the exception object pointer. Again, `__cxa_begin_catch()` is expected to return the exception object pointer, so in Pyston this is basically a no-op. (Again, maybe there's some funky pointer arithmetic going on in regular C++ unwinding - I'm not sure.)
+Then, *if* the exception is caught by-value (`catch (ExcInfo e)`) rather than by-reference (`catch (ExcInfo& e)`) - and Pyston must *always* catch by value - it copies the exception object onto the stack.
+Then it runs the code inside the catch block, like you'd expect.
+Finally, it calls `__cxa_end_catch()` (which takes no arguments). In regular C++ this destroys the current exception if appropriate. (It grabs the exception out of some thread-specific data structure that I don't fully understand.)
+# How our unwinder is different
+We use `libunwind` to deal with a lot of the tedious gruntwork (restoring register state, etc.) of unwinding.
+First, we dispense with two-phase unwinding. It's slow and Python tracebacks work differently anyway. (Currently we grab tracebacks before we start unwinding; in the future, we ought to generate them incrementally *as* we unwind.)
+Second, we allocate exceptions using a thread-local variable, rather than `malloc()`. By ensuring that only one exception is ever active on a given thread at a given time, this lets us be more efficient. However, we have not measured the performance improvement here; it may be negligible.
+Third, when unwinding, we only check whether a function *has* a personality function. If it does, we assert that it is `__gxx_personality_v0`, but we *do not call it*. Instead, we run our own custom dispatch code. We do this because:
+1. One argument to the personality function is the current unwind context, in a `libgcc`-specific format. libunwind uses a different format, so we *can't* call it.
+2. It avoids an unnecessary indirect call.
+3. The personality function checks the exception's type against `catch`-block types. All Pyston exceptions have the same type, so this is unnecessary.
+## Functions we override
+- `std::terminate`
+- `__gxx_personality_v0`: stubbed out, should never be called
+- `_Unwind_Resume`
+- `__cxxabiv1::__cxa_allocate_exception`
+- `__cxxabiv1::__cxa_begin_catch`
+- `__cxxabiv1::__cxa_end_catch`
+- `__cxxabiv1::__cxa_throw`
+- `__cxxabiv1::__cxa_rethrow`: stubbed out, we never rethrow directly
+- `__cxxabiv1::__cxa_get_exception_ptr`
+# Future work
+## Incremental traceback generation
+Python tracebacks include only the area of the stack between where the exception was originally raised and where it gets caught. Currently we generate tracebacks (via `getTraceback`) using `unwindPythonStack()` in `src/codegen/unwinding.cpp`, which unwinds the whole stack at once.
+Instead we ought to generate them *as we unwind*. This should be a straightforward matter of taking the code in `unwindPythonStack` and integrating it into `unwind_loop` (in `src/runtime/cxx_unwind.cpp`), so that we keep a "current traceback" object that we update as we unwind the stack and discover Python frames.
+## Binary search in libunwind
+Libunwind, like libgcc, keeps a linked list of objects (executables, shared libraries) to search for debug info. Since it's a linked list, if it's very long we can't find debug info efficiently; a better way would be to keep an array sorted by the start address of the object (since objects are non-overlapping). This comes up in practice because LLVM JITs each function as a separate object.
+libunwind's linked list is updated in `_U_dyn_register` (in `libunwind/src/mi/dyn-register.c`) and scanned in `local_find_proc_info` (in `libunwind/src/mi/Gfind_dynamic_proc_info.c`) (and possibly elsewhere).
+## GC awareness
+Currently we store exceptions-being-unwound in a thread-local variable, `pyston::exception_ferry` (in `src/runtime/cxx_unwind.cpp`). This is invisible to the GC. This *should* be fine, since this variable is only relevant during unwinding, and unwinding *should not* trigger the GC. `catch`-block code might, but as long as we catch by-value (`catch (ExcInfo e)` rather than `catch (ExcInfo& e)`), the relevant pointers will be copied to our stack (thus GC-visible) before any catch-block code is run. The only other problem is if *destructors* can cause GC, since destructors *are* called during unwinding and there's nothing we can do about that. So don't do that!
+It wouldn't be too hard to make the GC aware of `pyston::exception_ferry`. We could either:
+- add code to the GC that regards `pyston::exception_ferry` as a source of roots, OR
+- store the exception ferry in `cur_thread_state` instead of its own variable, and update `ThreadStateInternal::accept`
+HOWEVER, there's a problem: if we do this, we need to *zero out* the exception ferry at the appropriate time (to avoid keeping an exception alive after it ought to be garbage), and this is harder than it seems. We can't zero it out in `__cxa_begin_catch`, because it's only *after* `__cxa_begin_catch` returns that the exception is copied to the stack. We can't zero it in `__cxa_end_catch`, because `__cxa_end_catch` is called *even if exiting a catch block due to an exception*, so we'd wipe an exception that we actually wanted to propagate!
+So this is tricky.
+## Decrementing IC counts when unwinding through ICs
+To do this, we need some way to tell when we're unwinding through an IC. Keeping a global map from instruction-ranges to IC information should suffice. Then we just check and update this map inside of `unwind_loop`. This might slow us down a bit, but it's probably negligible; worth measuring, though.
+Alternatively, there might be some way to use the existing cleanup-code support in the unwinder to do this. That would involve generating EH-frames on the fly, but we already do this! So probably we'd just need to generate more complicated EH frames.
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -84,6 +84,7 @@ add_library(PYSTON_OBJECTS OBJECT ${OPTIONAL_SRCS}
 		runtime/code.cpp
 		runtime/complex.cpp
 		runtime/ctxswitching.S
+		runtime/cxx_unwind.cpp
 		runtime/descr.cpp
 		runtime/dict.cpp
 		runtime/file.cpp

--- a/src/runtime/builtin_modules/builtins.cpp
+++ b/src/runtime/builtin_modules/builtins.cpp
@@ -213,7 +213,7 @@ extern "C" Box* next(Box* iterator, Box* _default) {
    } catch (ExcInfo e) {
        if (_default && e.matches(StopIteration))
            return _default;
-        throw;
+        throw e;
    }
 }

--- a/src/runtime/cxx_unwind.cpp
+++ b/src/runtime/cxx_unwind.cpp
+// Copyright (c) 2014-2015 Dropbox, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//    http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include <cstdlib>
+#include <dlfcn.h> // dladdr
+#include <stddef.h>
+#include <stdint.h>
+#include <unwind.h>
+#include "llvm/Support/LEB128.h" // for {U,S}LEB128 decoding
+#include "codegen/ast_interpreter.h" // interpreter_instr_addr
+#include "codegen/unwinding.h"       // getCFForAddress
+#include "core/stats.h"              // StatCounter
+#include "core/types.h"              // for ExcInfo
+#include "core/util.h"               // Timer
+#include "runtime/generator.h"       // generatorEntry
+#define UNW_LOCAL_ONLY
+#include <libunwind.h>
+#define PYSTON_CUSTOM_UNWINDER 1 // set to 0 to use C++ unwinder
+#define NORETURN __attribute__((__noreturn__))
+// canary used in ExcData in debug mode to catch exception-value corruption.
+#define CANARY_VALUE 0xdeadbeef
+// An action of 0 in the LSDA action table indicates cleanup.
+#define CLEANUP_ACTION 0
+// Dwarf encoding modes.
+#define DW_EH_PE_absptr 0x00
+#define DW_EH_PE_omit 0xff
+#define DW_EH_PE_uleb128 0x01
+#define DW_EH_PE_udata2 0x02
+#define DW_EH_PE_udata4 0x03
+#define DW_EH_PE_udata8 0x04
+#define DW_EH_PE_sleb128 0x09
+#define DW_EH_PE_sdata2 0x0A
+#define DW_EH_PE_sdata4 0x0B
+#define DW_EH_PE_sdata8 0x0C
+#define DW_EH_PE_signed 0x08
+#define DW_EH_PE_pcrel 0x10
+#define DW_EH_PE_textrel 0x20
+#define DW_EH_PE_datarel 0x30
+#define DW_EH_PE_funcrel 0x40
+#define DW_EH_PE_aligned 0x50
+#define DW_EH_PE_indirect 0x80
+// end dwarf encoding modes
+extern "C" void __gxx_personality_v0(); // wrong type signature, but that's ok, it's extern "C"
+// check(EXPR) is like assert((EXPR) == 0), but evaluates EXPR even in debug mode.
+template <typename T> static inline void check(T x) {
+    assert(x == 0);
+}
+namespace pyston {
+struct ExcData;
+extern thread_local ExcData exception_ferry;
+struct ExcData {
+    ExcInfo exc;
+#ifndef NDEBUG
+    unsigned canary = CANARY_VALUE;
+#endif
+    ExcData() : exc(nullptr, nullptr, nullptr) {}
+    ExcData(ExcInfo e) : exc(e) {}
+    ExcData(Box* type, Box* value, Box* traceback) : exc(type, value, traceback) {}
+    void check() const {
+        assert(this);
+        assert(canary == CANARY_VALUE);
+        assert(exc.type && exc.value && exc.traceback);
+        assert(gc::isValidGCObject(exc.type) && gc::isValidGCObject(exc.value) && gc::isValidGCObject(exc.traceback));
+        assert(this == &exception_ferry);
+    }
+};
+thread_local ExcData exception_ferry;
+static_assert(offsetof(ExcData, exc) == 0, "wrong offset");
+// Timer that auto-logs.
+struct LogTimer {
+    StatCounter& counter;
+    Timer timer;
+    LogTimer(const char* desc, StatCounter& ctr, long min_usec = -1) : counter(ctr), timer(desc, min_usec) {}
+    ~LogTimer() { counter.log(timer.end()); }
+};
+static StatCounter us_unwind_loop("us_unwind_loop");
+static StatCounter us_unwind_resume_catch("us_unwind_resume_catch");
+static StatCounter us_unwind_cleanup("us_unwind_cleanup");
+static StatCounter us_unwind_get_proc_info("us_unwind_get_proc_info");
+static StatCounter us_unwind_step("us_unwind_step");
+static StatCounter us_unwind_find_call_site_entry("us_unwind_find_call_site_entry");
+// do these need to be separate timers? might as well
+static thread_local Timer per_thread_resume_catch_timer(-1);
+static thread_local Timer per_thread_cleanup_timer(-1);
+#ifndef NDEBUG
+static __thread bool in_cleanup_code = false;
+#endif
+extern "C" {
+static NORETURN void panic(void) {
+    RELEASE_ASSERT(0, "pyston::panic() called!");
+}
+// Highly useful resource: http://www.airs.com/blog/archives/464
+// talks about DWARF LSDA parsing with respect to C++ exception-handling
+struct lsda_info_t {
+    // base which landing pad offsets are relative to
+    const uint8_t* landing_pad_base;
+    const uint8_t* type_table;
+    const uint8_t* call_site_table;
+    const uint8_t* action_table;
+    uint8_t type_table_entry_encoding;      // a DW_EH_PE_xxx value
+    uint8_t call_site_table_entry_encoding; // a DW_EH_PE_xxx value
+};
+struct call_site_entry_t {
+    const uint8_t* instrs_start;
+    size_t instrs_len_bytes;
+    const uint8_t* landing_pad; // may be NULL if no landing pad
+    // "plus one" so that 0 can mean "no action". offset is in bytes.
+    size_t action_offset_plus_one;
+};
+// ---------- Parsing stuff ----------
+static inline void parse_lsda_header(const unw_proc_info_t* pip, lsda_info_t* info) {
+    const uint8_t* ptr = (const uint8_t*)pip->lsda;
+    // 1. Read the landing pad base pointer.
+    uint8_t landing_pad_base_encoding = *ptr++;
+    if (landing_pad_base_encoding == DW_EH_PE_omit) {
+        // The common case is to omit. Then the landing pad base is _Unwind_GetRegion(context), which is the start of
+        // the function.
+        info->landing_pad_base = (const uint8_t*)pip->start_ip;
+    } else {
+        RELEASE_ASSERT(0, "we only support omitting the landing pad base");
+    }
+    // 2. Read the type table encoding & base pointer.
+    info->type_table_entry_encoding = *ptr++;
+    if (info->type_table_entry_encoding != DW_EH_PE_omit) {
+        // read ULEB128-formatted byte offset from THIS FIELD to the start of the types table.
+        unsigned uleb_size;
+        uint64_t offset = llvm::decodeULEB128(ptr, &uleb_size);
+        // We don't use the type table, and I'm not sure this calculation is correct - it might be an offset from a
+        // different base, I should use gdb to check it against libgcc. So I've set it to nullptr instead.
+        info->type_table = nullptr;
+        // info->type_table = ptr + offset; // <- The calculation I'm not sure of.
+        ptr += uleb_size;
+    } else { // type table omitted
+        info->type_table = nullptr;
+    }
+    // 3. Read the call-site encoding & base pointer.
+    info->call_site_table_entry_encoding = *ptr++;
+    unsigned uleb_size;
+    size_t call_site_table_nbytes = llvm::decodeULEB128(ptr, &uleb_size);
+    ptr += uleb_size;
+    // The call site table follows immediately after the header.
+    info->call_site_table = ptr;
+    // The action table follows immediately after the call site table.
+    info->action_table = ptr + call_site_table_nbytes;
+    assert(info->landing_pad_base);
+    assert(info->call_site_table);
+    assert(info->action_table);
+}
+static inline const uint8_t* parse_call_site_entry(const uint8_t* ptr, const lsda_info_t* info,
+                                                   call_site_entry_t* entry) {
+    size_t instrs_start_offset, instrs_len_bytes, landing_pad_offset, action_offset_plus_one;
+    // clang++ recently changed from always doing udata4 here to using uleb128, so we support both
+    unsigned uleb_size;
+    if (DW_EH_PE_uleb128 == info->call_site_table_entry_encoding) {
+        instrs_start_offset = llvm::decodeULEB128(ptr, &uleb_size);
+        ptr += uleb_size;
+        instrs_len_bytes = llvm::decodeULEB128(ptr, &uleb_size);
+        ptr += uleb_size;
+        landing_pad_offset = llvm::decodeULEB128(ptr, &uleb_size);
+        ptr += uleb_size;
+    } else if (DW_EH_PE_udata4 == info->call_site_table_entry_encoding) {
+        // offsets are from landing pad base
+        instrs_start_offset = (size_t) * (const uint32_t*)ptr;
+        instrs_len_bytes = (size_t) * (const uint32_t*)(ptr + 4);
+        landing_pad_offset = (size_t) * (const uint32_t*)(ptr + 8);
+        ptr += 12;
+    } else {
+        RELEASE_ASSERT(0, "expected call site table entries to use DW_EH_PE_udata4 or DW_EH_PE_uleb128");
+    }
+    // action offset (plus one) is always a ULEB128
+    action_offset_plus_one = llvm::decodeULEB128(ptr, &uleb_size);
+    ptr += uleb_size;
+    entry->instrs_start = info->landing_pad_base + instrs_start_offset;
+    entry->instrs_len_bytes = instrs_len_bytes;
+    if (0 == landing_pad_offset) {
+        // An offset of 0 is special and indicates "no landing pad", i.e. this call site does not handle exceptions or
+        // perform any cleanup. (The call site entry is still necessary to indicate that it is *expected* that an
+        // exception could be thrown here, and that unwinding should proceed; if the entry were absent, we'd call
+        // std::terminate().)
+        entry->landing_pad = nullptr;
+    } else {
+        entry->landing_pad = info->landing_pad_base + landing_pad_offset;
+    }
+    entry->action_offset_plus_one = action_offset_plus_one;
+    return ptr;
+}
+static inline const uint8_t* first_action(const lsda_info_t* info, const call_site_entry_t* entry) {
+    if (!entry->action_offset_plus_one)
+        return nullptr;
+    return info->action_table + entry->action_offset_plus_one - 1;
+}
+// Returns pointer to next action, or NULL if no next action.
+// Stores type filter into `*type_filter', stores number of bytes read into `*num_bytes', unless it is null.
+static inline const uint8_t* next_action(const uint8_t* action_ptr, int64_t* type_filter,
+                                         unsigned* num_bytes = nullptr) {
+    assert(type_filter);
+    unsigned leb_size, total_size;
+    *type_filter = llvm::decodeSLEB128(action_ptr, &leb_size);
+    action_ptr += leb_size;
+    total_size = leb_size;
+    intptr_t offset_to_next_entry = llvm::decodeSLEB128(action_ptr, &leb_size);
+    total_size += leb_size;
+    if (num_bytes) {
+        *num_bytes = total_size;
+    }
+    // an offset of 0 ends the action-chain.
+    return offset_to_next_entry ? action_ptr + offset_to_next_entry : nullptr;
+}
+// ---------- Printing things for debugging purposes ----------
+static void print_lsda(const lsda_info_t* info) {
+    size_t action_table_min_len_bytes = 0;
+    // Print call site table.
+    printf("Call site table:\n");
+    const uint8_t* p = info->call_site_table;
+    assert(p);
+    while (p < info->action_table) { // the call site table ends where the action table begins
+        call_site_entry_t entry;
+        p = parse_call_site_entry(p, info, &entry);
+        printf("  start %p end %p landingpad %p action-plus-one %lx\n", entry.instrs_start,
+               entry.instrs_start + entry.instrs_len_bytes, entry.landing_pad, entry.action_offset_plus_one);
+        // Follow the action chain.
+        for (const uint8_t* action_ptr = first_action(info, &entry); action_ptr;) {
+            RELEASE_ASSERT(action_ptr >= info->action_table, "malformed LSDA");
+            ptrdiff_t offset = action_ptr - info->action_table;
+            // add one to indicate that there is an entry here. (consider the case of an empty table, for example.)
+            // would be nicer to set action_table_min_len_bytes to the end of the entry, but that involves uleb-size
+            // arithmetic.
+            if (offset + 1 > action_table_min_len_bytes)
+                action_table_min_len_bytes = offset + 1;
+            int64_t type_filter;
+            action_ptr = next_action(action_ptr, &type_filter);
+            if (action_ptr)
+                printf("    %ld: filter %ld  next %ld\n", offset, type_filter, action_ptr - info->action_table);
+            else
+                printf("    %ld: filter %ld  end\n", offset, type_filter);
+        }
+    }
+    // Print the action table.
+    printf("Action table:\n");
+    RELEASE_ASSERT(p == info->action_table, "malformed LSDA");
+    while (p < info->action_table + action_table_min_len_bytes) {
+        assert(p);
+        ptrdiff_t offset = p - info->action_table;
+        unsigned num_bytes;
+        int64_t type_filter;
+        const uint8_t* next = next_action(p, &type_filter, &num_bytes);
+        p += num_bytes;
+        if (next)
+            printf("  %ld: filter %ld  next %ld\n", offset, type_filter, p - info->action_table);
+        else
+            printf("  %ld: filter %ld  end\n", offset, type_filter);
+    }
+}
+// FIXME: duplicated from unwinding.cpp
+static unw_word_t getFunctionEnd(unw_word_t ip) {
+    unw_proc_info_t pip;
+    // where is the documentation for unw_get_proc_info_by_ip, anyway?
+    int ret = unw_get_proc_info_by_ip(unw_local_addr_space, ip, &pip, NULL);
+    RELEASE_ASSERT(ret == 0 && pip.end_ip, "");
+    return pip.end_ip;
+}
+static void print_frame(unw_cursor_t* cursor, const unw_proc_info_t* pip) {
+    // FIXME: code duplication with PythonFrameIter::incr
+    static unw_word_t interpreter_instr_end = getFunctionEnd((unw_word_t)interpreter_instr_addr);
+    static unw_word_t generator_entry_end = getFunctionEnd((unw_word_t)generatorEntry);
+    unw_word_t ip, bp;
+    check(unw_get_reg(cursor, UNW_REG_IP, &ip));
+    check(unw_get_reg(cursor, UNW_TDEP_BP, &bp));
+    // NB. unw_get_proc_name is MUCH slower than dl_addr for getting the names of functions, but it gets the names of
+    // more functions. However, it also has a bug that pops up when used on JITted functions, so we use dladdr for now.
+    // (I've put an assert in libunwind that'll catch, but not fix, the bug.) - rntz
+    // {
+    //     char name[500];
+    //     unw_word_t off;
+    //     int err = unw_get_proc_name(cursor, name, 500, &off);
+    //     // ENOMEM means name didn't fit in buffer, so it was truncated. We're okay with that.
+    //     RELEASE_ASSERT(!err || err == -UNW_ENOMEM || err == -UNW_ENOINFO, "unw_get_proc_name errored");
+    //     if (err != -UNW_ENOINFO) {
+    //         printf(strnlen(name, 500) < 50 ? "  %-50s" : "  %s\n", name);
+    //     } else {
+    //         printf("  %-50s", "? (no info)");
+    //     }
+    // }
+    {
+        Dl_info dl_info;
+        if (dladdr((void*)ip, &dl_info)) { // returns non-zero on success, zero on failure
+            if (!dl_info.dli_sname || strlen(dl_info.dli_sname) < 50)
+                printf("  %-50s", dl_info.dli_sname ? dl_info.dli_sname : "(unnamed)");
+            else
+                printf("  %s\n", dl_info.dli_sname);
+        } else {
+            printf("  %-50s", "? (no dl info)");
+        }
+    }
+    CompiledFunction* cf = getCFForAddress(ip);
+    AST_stmt* cur_stmt = nullptr;
+    enum { COMPILED, INTERPRETED, GENERATOR, OTHER } frame_type;
+    if (cf) {
+        // compiled frame
+        frame_type = COMPILED;
+        printf("      ip %12lx  bp %lx    JITTED\n", ip, bp);
+        // TODO: get current statement
+    } else if ((unw_word_t)interpreter_instr_addr <= ip && ip < interpreter_instr_end) {
+        // interpreted frame
+        frame_type = INTERPRETED;
+        printf("      ip %12lx  bp %lx    interpreted\n", ip, bp);
+        // sometimes this assert()s!
+        // cf = getCFForInterpretedFrame((void*)bp);
+        // cur_stmt = getCurrentStatementForInterpretedFrame((void*) bp);
+    } else if ((unw_word_t)generatorEntry <= ip && ip < generator_entry_end) {
+        // generator return frame
+        frame_type = GENERATOR;
+        printf("      ip %12lx  bp %lx    generator\n", ip, bp);
+    } else {
+        // generic frame, probably C/C++
+        frame_type = OTHER;
+        printf("      ip %12lx  bp %lx\n", ip, bp);
+    }
+    if (frame_type == INTERPRETED && cf && cur_stmt) {
+        auto source = cf->clfunc->source.get();
+        // FIXME: dup'ed from lineInfoForFrame
+        LineInfo line(cur_stmt->lineno, cur_stmt->col_offset, source->fn, source->getName());
+        printf("      File \"%s\", line %d, in %s\n", line.file.c_str(), line.line, line.func.c_str());
+    }
+}
+// ---------- Helpers for unwind_loop ----------
+static inline bool find_call_site_entry(const lsda_info_t* info, const uint8_t* ip, call_site_entry_t* entry) {
+    const uint8_t* p = info->call_site_table;
+    while (p < info->action_table) { // The call site table ends where the action table begins.
+        p = parse_call_site_entry(p, info, entry);
+        if (VERBOSITY("cxx_unwind") >= 3) {
+            printf("    start %p end %p landingpad %p action %lx\n", entry->instrs_start,
+                   entry->instrs_start + entry->instrs_len_bytes, entry->landing_pad, entry->action_offset_plus_one);
+        }
+        // If our IP is in the given range, we found the right entry!
+        if (entry->instrs_start <= ip && ip < entry->instrs_start + entry->instrs_len_bytes)
+            return true;
+        // The call-site table is in sorted order by start IP. If we've passed our current IP, we won't find an entry.
+        if (ip < entry->instrs_start + entry->instrs_len_bytes)
+            break;
+    }
+    // If p actually overran *into* info.action_table, we have a malformed LSDA.
+    ASSERT(!(p > info->action_table), "Malformed LSDA; call site entry overlaps action table!");
+    return false;
+}
+static inline NORETURN void resume(unw_cursor_t* cursor, const uint8_t* landing_pad, int64_t switch_value,
+                                   const ExcData* exc_data) {
+    exc_data->check();
+    assert(landing_pad);
+    if (VERBOSITY("cxx_unwind") >= 2)
+        printf("  * RESUMED: ip %p  switch_value %ld\n", (const void*)landing_pad, (long)switch_value);
+    if (0 != switch_value) {
+        // The exception handler will call __cxa_begin_catch, which stops this timer and logs it.
+        per_thread_resume_catch_timer.restart("resume_catch", 20);
+    } else {
+        // The cleanup code will call _Unwind_Resume, which will stop this timer and log it.
+        // TODO: am I sure cleanup code can't raise exceptions? maybe have an assert!
+        per_thread_cleanup_timer.restart("cleanup", 20);
+#ifndef NDEBUG
+        in_cleanup_code = true;
+#endif
+    }
+    // set rax to pointer to exception object
+    // set rdx to the switch_value (0 for cleanup, otherwise an index indicating which exception handler to use)
+    //
+    // NB. assumes x86-64. maybe I should use __builtin_eh_return_data_regno() here?
+    // but then, need to translate into UNW_* values somehow. not clear how.
+    check(unw_set_reg(cursor, UNW_X86_64_RAX, (unw_word_t)exc_data));
+    check(unw_set_reg(cursor, UNW_X86_64_RDX, switch_value));
+    // resume!
+    check(unw_set_reg(cursor, UNW_REG_IP, (unw_word_t)landing_pad));
+    unw_resume(cursor);
+    RELEASE_ASSERT(0, "unw_resume returned!");
+}
+// Determines whether to dispatch to cleanup code or an exception handler based on the action table.
+// Doesn't need exception info b/c in Pyston we assume all handlers catch all exceptions.
+//
+// Returns the switch value to be passed into the landing pad, which selects which handler gets run in the case of
+// multiple `catch' blocks, or is 0 to run cleanup code.
+static inline int64_t determine_action(const lsda_info_t* info, const call_site_entry_t* entry) {
+    // No action means there are destructors/cleanup to run, but no exception handlers.
+    const uint8_t* p = first_action(info, entry);
+    if (!p)
+        return CLEANUP_ACTION;
+    // Read a chain of actions.
+    if (VERBOSITY("cxx_unwind") >= 3) {
+        printf("      reading action chain\n");
+    }
+    // When we see a cleanup action, we *don't* immediately take it. Rather, we remember that we should clean up if none
+    // of the other actions matched.
+    bool saw_cleanup = false;
+    do {
+        ASSERT(p >= info->action_table, "malformed LSDA");
+        ptrdiff_t offset = p - info->action_table;
+        int64_t type_filter;
+        p = next_action(p, &type_filter);
+        if (VERBOSITY("cxx_unwind") >= 3) {
+            if (p)
+                printf("      %ld: filter %ld  next %ld\n", offset, type_filter, p - info->action_table);
+            else
+                printf("      %ld: filter %ld  end\n", offset, type_filter);
+        }
+        if (0 == type_filter) {
+            // A type_filter of 0 indicates a cleanup.
+            saw_cleanup = true;
+        } else {
+            // Otherwise, the type_filter is supposed to be interpreted by looking up information in the types table and
+            // comparing it against the type of the exception thrown. In Pyston, however, every exception handler
+            // handles all exceptions, so we ignore the type information entirely and just run the handler.
+            //
+            // I don't fully understand negative type filters. For now we don't implement them. See
+            // http://www.airs.com/blog/archives/464 for some information.
+            RELEASE_ASSERT(type_filter > 0, "negative type filters unimplemented");
+            return type_filter;
+        }
+    } while (p);
+    if (saw_cleanup)
+        return CLEANUP_ACTION;
+    // We ran through the whole action chain and none applied, *and* there was no cleanup indicated. What do we do?
+    // This can't happen currently, but I think the answer is probably panic().
+    RELEASE_ASSERT(0, "action chain exhausted and no cleanup indicated");
+}
+static inline int step(unw_cursor_t* cp) {
+    LogTimer t("unw_step", us_unwind_step, 5);
+    return unw_step(cp);
+}
+// The stack-unwinding loop.
+// TODO: integrate incremental traceback generation into this function
+static inline void unwind_loop(const ExcData* exc_data) {
+    Timer t("unwind_loop", 50);
+    // NB. https://monoinfinito.wordpress.com/series/exception-handling-in-c/ is a very useful resource
+    // as are http://www.airs.com/blog/archives/460 and http://www.airs.com/blog/archives/464
+    unw_cursor_t cursor;
+    unw_context_t uc; // exists only to initialize cursor
+#ifndef NDEBUG
+    // poison stack memory. have had problems with these structures being insufficiently initialized.
+    memset(&uc, 0xef, sizeof uc);
+    memset(&cursor, 0xef, sizeof cursor);
+#endif
+    unw_getcontext(&uc);
+    unw_init_local(&cursor, &uc);
+    while (step(&cursor) > 0) {
+        unw_proc_info_t pip;
+        {
+            // NB. unw_get_proc_info is slow; a significant chunk of all time spent unwinding is spent here.
+            LogTimer t_procinfo("get_proc_info", us_unwind_get_proc_info, 10);
+            check(unw_get_proc_info(&cursor, &pip));
+        }
+        assert((pip.lsda == 0) == (pip.handler == 0));
+        assert(pip.flags == 0);
+        if (VERBOSITY("cxx_unwind") >= 2) {
+            print_frame(&cursor, &pip);
+        }
+        // Skip frames without handlers
+        if (pip.handler == 0) {
+            continue;
+        }
+        RELEASE_ASSERT(pip.handler == (uintptr_t)__gxx_personality_v0,
+                       "personality function other than __gxx_personality_v0; "
+                       "don't know how to unwind through non-C++ functions");
+        // Don't call __gxx_personality_v0; we perform dispatch ourselves.
+        // 1. parse LSDA header
+        lsda_info_t info;
+        parse_lsda_header(&pip, &info);
+        call_site_entry_t entry;
+        {
+            LogTimer t_call_site("find_call_site_entry", us_unwind_find_call_site_entry, 10);
+            // 2. Find our current IP in the call site table.
+            unw_word_t ip;
+            unw_get_reg(&cursor, UNW_REG_IP, &ip);
+            // ip points to the instruction *after* the instruction that caused the error - which is generally (always?)
+            // a call instruction - UNLESS we're in a signal frame, in which case it points at the instruction that
+            // caused the error. For now, we assume we're never in a signal frame. So, we decrement it by one.
+            //
+            // TODO: double-check that we never hit a signal frame.
+            --ip;
+            bool found = find_call_site_entry(&info, (const uint8_t*)ip, &entry);
+            // If we didn't find an entry, an exception happened somewhere exceptions should never happen; terminate
+            // immediately.
+            if (!found) {
+                panic();
+            }
+        }
+        // 3. Figure out what to do based on the call site entry.
+        if (!entry.landing_pad) {
+            // No landing pad means no exception handling or cleanup; keep unwinding!
+            continue;
+        }
+        // After this point we are guaranteed to resume something rather than unwinding further.
+        if (VERBOSITY("cxx_unwind") >= 3) {
+            print_lsda(&info);
+        }
+        int64_t switch_value = determine_action(&info, &entry);
+        us_unwind_loop.log(t.end());
+        resume(&cursor, entry.landing_pad, switch_value, exc_data);
+    }
+    us_unwind_loop.log(t.end());
+    // Hit end of stack! return & let unwindException determine what to do.
+}
+// The unwinder entry-point.
+static void unwind(const ExcData* exc) {
+    exc->check();
+    if (exc->exc.value->hasattr("magic_break")) {
+        (void)(0 == 0);
+    }
+    unwind_loop(exc);
+    // unwind_loop returned, couldn't find any handler. ruh-roh.
+    panic();
+}
+} // extern "C"
+} // namespace pyston
+// Standard library / runtime functions we override
+#if PYSTON_CUSTOM_UNWINDER
+void std::terminate() noexcept {
+    // The default std::terminate assumes things about the C++ exception state which aren't true for our custom
+    // unwinder.
+    RELEASE_ASSERT(0, "std::terminate() called!");
+}
+// wrong type signature, but that's okay, it's extern "C"
+extern "C" void __gxx_personality_v0() {
+    RELEASE_ASSERT(0, "__gxx_personality_v0 should never get called");
+}
+extern "C" void _Unwind_Resume(struct _Unwind_Exception* _exc) {
+    assert(pyston::in_cleanup_code);
+#ifndef NDEBUG
+    pyston::in_cleanup_code = false;
+#endif
+    pyston::us_unwind_cleanup.log(pyston::per_thread_cleanup_timer.end());
+    if (VERBOSITY("cxx_unwind"))
+        printf("***** _Unwind_Resume() *****\n");
+    // we give `_exc' type `struct _Unwind_Exception*' because unwind.h demands it; it's not actually accurate
+    const pyston::ExcData* data = (const pyston::ExcData*)_exc;
+    pyston::unwind(data);
+}
+// C++ ABI functionality
+namespace __cxxabiv1 {
+extern "C" void* __cxa_allocate_exception(size_t size) noexcept {
+    // we should only ever be throwing ExcInfos
+    ASSERT(size == sizeof(pyston::ExcInfo), "allocating exception whose size doesn't match ExcInfo");
+    // Instead of allocating memory for this exception, we return a pointer to a pre-allocated thread-local variable.
+    //
+    // This variable, pyston::exception_ferry, is used only while we are unwinding, and should not be used outside of
+    // the unwinder. Since it's a thread-local variable, we *cannot* throw any exceptions while it is live, otherwise we
+    // would clobber it and forget our old exception.
+    //
+    // Q: Why can't we just use cur_thread_state.curexc_{type,value,traceback}?
+    //
+    // A: Because that conflates the space used to store exceptions during C++ unwinding with the space used to store
+    // them during C-API return-code based unwinding! This actually comes up in practice - the original version *did*
+    // use curexc_{type,value,traceback}, and it had a bug.
+    //
+    // In particular, we need to unset the C API exception at an appropriate point so as not to make C-API functions
+    // *think* an exception is being thrown when one isn't. The natural place is __cxa_begin_catch, BUT we need some way
+    // to communicate the exception info to the inside of the catch block - and all we get is the return value of
+    // __cxa_begin_catch, which is a single pointer, when we need three!
+    //
+    // You might think we could get away with only unsetting the C-API information in __cxa_end_catch, but you'd be
+    // wrong! Firstly, this would prohibit calling C-API functions inside a catch-block. Secondly, __cxa_end_catch is
+    // always called when leaving a catch block, even if we're leaving it by re-raising the exception. So if we store
+    // our exception info in curexc_*, and then unset these in __cxa_end_catch, then we'll wipe our exception info
+    // during unwinding!
+    return (void*)&pyston::exception_ferry;
+}
+// Takes the value that resume() sent us in RAX, and returns a pointer to the exception object actually thrown. In our
+// case, these are the same, and should always be &pyston::exception_ferry.
+extern "C" void* __cxa_begin_catch(void* exc_obj_in) noexcept {
+    assert(exc_obj_in);
+    pyston::us_unwind_resume_catch.log(pyston::per_thread_resume_catch_timer.end());
+    if (VERBOSITY("cxx_unwind"))
+        printf("***** __cxa_begin_catch() *****\n");
+    pyston::ExcData* e = (pyston::ExcData*)exc_obj_in;
+    e->check();
+    return (void*)&e->exc;
+}
+extern "C" void __cxa_end_catch() {
+    if (VERBOSITY("cxx_unwind"))
+        printf("***** __cxa_end_catch() *****\n");
+    // See comment in __cxa_begin_catch for why we don't clear the exception ferry here.
+}
+extern "C" void __cxa_throw(void* exc_obj, std::type_info* tinfo, void (*dtor)(void*)) {
+    assert(!pyston::in_cleanup_code);
+    assert(exc_obj);
+    if (VERBOSITY("cxx_unwind"))
+        printf("***** __cxa_throw() *****\n");
+    pyston::unwind((const pyston::ExcData*)exc_obj);
+}
+extern "C" void* __cxa_get_exception_ptr(void* exc_obj_in) noexcept {
+    assert(exc_obj_in);
+    pyston::ExcData* e = (pyston::ExcData*)exc_obj_in;
+    e->check();
+    return (void*)&e->exc;
+}
+// We deliberately don't implement rethrowing because we can't implement it correctly with our current strategy for
+// storing the exception info. Don't use bare `throw' from inside an exception handler! Instead, do:
+//
+//     try { ... }
+//     catch(ExcInfo e) {   // copies the exception info received to the stack
+//         ...
+//         throw e;
+//     }
+//
+extern "C" void __cxa_rethrow() {
+    RELEASE_ASSERT(0, "__cxa_rethrow() unimplemented; please don't use bare `throw' in Pyston!");
+}
+}
+#endif // PYSTON_CUSTOM_UNWINDER
--- a/src/runtime/descr.cpp
+++ b/src/runtime/descr.cpp
@@ -42,7 +42,7 @@ static void propertyDocCopy(BoxedProperty* prop, Box* fget) {
        get_doc = getattrInternal(fget, "__doc__", NULL);
    } catch (ExcInfo e) {
        if (!e.matches(Exception)) {
-            throw;
+            throw e;
        }
        get_doc = NULL;
    }

--- a/src/runtime/objmodel.cpp
+++ b/src/runtime/objmodel.cpp
@@ -4968,7 +4968,7 @@ extern "C" Box* boxedLocalsGet(Box* boxedLocals, const char* attr, Box* globals)
            // If it throws a KeyError, then the variable doesn't exist so move on
            // and check the globals (below); otherwise, just propogate the exception.
            if (!isSubclass(e.value->cls, KeyError)) {
-                throw;
+                throw e;
            }
        }
    }

--- a/src/runtime/stacktrace.cpp
+++ b/src/runtime/stacktrace.cpp
@@ -45,55 +45,6 @@ void showBacktrace() {
    }
 }
-// Currently-unused libunwind-based unwinding:
-void unwindExc(Box* exc_obj) __attribute__((noreturn));
-void unwindExc(Box* exc_obj) {
-    unw_cursor_t cursor;
-    unw_context_t uc;
-    unw_word_t ip, sp;
-    unw_getcontext(&uc);
-    unw_init_local(&cursor, &uc);
-    int code;
-    unw_proc_info_t pip;
-    while (unw_step(&cursor) > 0) {
-        unw_get_reg(&cursor, UNW_REG_IP, &ip);
-        unw_get_reg(&cursor, UNW_REG_SP, &sp);
-        printf("ip = %lx, sp = %lx\n", (long)ip, (long)sp);
-        code = unw_get_proc_info(&cursor, &pip);
-        RELEASE_ASSERT(code == 0, "");
-        // printf("%lx %lx %lx %lx %lx %lx %d %d %p\n", pip.start_ip, pip.end_ip, pip.lsda, pip.handler, pip.gp,
-        // pip.flags, pip.format, pip.unwind_info_size, pip.unwind_info);
-        assert((pip.lsda == 0) == (pip.handler == 0));
-        assert(pip.flags == 0);
-        if (pip.handler == 0) {
-            if (VERBOSITY())
-                printf("Skipping frame without handler\n");
-            continue;
-        }
-        printf("%lx %lx %lx\n", pip.lsda, pip.handler, pip.flags);
-        // assert(pip.handler == (uintptr_t)__gxx_personality_v0 || pip.handler == (uintptr_t)__py_personality_v0);
-        // auto handler_fn = (int (*)(int, int, uint64_t, void*, void*))pip.handler;
-        ////handler_fn(1, 1 /* _UA_SEARCH_PHASE */, 0 /* exc_class */, NULL, NULL);
-        // handler_fn(2, 2 /* _UA_SEARCH_PHASE */, 0 /* exc_class */, NULL, NULL);
-        unw_set_reg(&cursor, UNW_REG_IP, 1);
-        // TODO testing:
-        // unw_resume(&cursor);
-    }
-    abort();
-}
 void raiseRaw(const ExcInfo& e) __attribute__((__noreturn__));
 void raiseRaw(const ExcInfo& e) {
    STAT_TIMER(t0, "us_timer_raiseraw");
@@ -105,11 +56,17 @@ void raiseRaw(const ExcInfo& e) {
    assert(gc::isValidGCObject(e.value));
    assert(gc::isValidGCObject(e.traceback));
-    // Using libgcc:
+    if (VERBOSITY("stacktrace")) {
-    throw e;
+        try {
+            std::string st = str(e.type)->s.str();
+            std::string sv = str(e.value)->s.str();
+            printf("---- raiseRaw() called with %s: %s\n", st.c_str(), sv.c_str());
+        } catch (ExcInfo e) {
+            printf("---- raiseRaw() called and WTFed\n");
+        }
+    }
-    // Using libunwind
+    throw e;
-    // unwindExc(exc_obj);
 }
 void raiseExc(Box* exc_obj) {