Merge pull request #936 from kennyyu/kennyyu-deadlock-detector

tools: add tool to detect potential deadlocks in running programs

Merge pull request #936 from kennyyu/kennyyu-deadlock-detector
tools: add tool to detect potential deadlocks in running programs
0bd4e585 · 4ast · GitHub · 1abb069e · 1f632188 · 0bd4e585
Commit 0bd4e585 authored Feb 06, 2017 by 4ast Committed by GitHub Feb 06, 2017
5 changed files
--- a/README.md
+++ b/README.md
@@ -89,6 +89,7 @@ Examples:
 - tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
+- tools/[deadlock_detector](tools/deadlock_detector.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_detector_example.txt)
 - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
 - tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency distribution as a histogram. [Examples](tools/ext4dist_example.txt).
 - tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).

--- a/man/man8/deadlock_detector.8
+++ b/man/man8/deadlock_detector.8
+.TH deadlock_detector 8  "2017-02-01" "USER COMMANDS"
+.SH NAME
+deadlock_detector \- Find potential deadlocks (lock order inversions)
+in a running program.
+.SH SYNOPSIS
+.B deadlock_detector [\-h] [\--binary BINARY] [\--dump-graph DUMP_GRAPH]
+.B                  [\--verbose] [\--lock-symbols LOCK_SYMBOLS]
+.B                  [\--unlock-symbols UNLOCK_SYMBOLS]
+.B                  pid
+.SH DESCRIPTION
+deadlock_detector finds potential deadlocks in a running process. The program
+attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` by default
+to build a mutex wait directed graph, and then looks for a cycle in this graph.
+This graph has the following properties:
+
+- Nodes in the graph represent mutexes.
+
+- Edge (A, B) exists if there exists some thread T where lock(A) was called
+and lock(B) was called before unlock(A) was called.
+
+If there is a cycle in this graph, this indicates that there is a lock order
+inversion (potential deadlock). If the program finds a lock order inversion, the
+program will dump the cycle of mutexes, dump the stack traces where each mutex
+was acquired, and then exit.
+
+This program can only find potential deadlocks that occur while the program is
+tracing the process. It cannot find deadlocks that may have occurred before the
+program was attached to the process.
+
+This tool does not work for shared mutexes or recursive mutexes.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc
+.SH OPTIONS
+.TP
+\-h, --help
+show this help message and exit
+.TP
+\--binary BINARY
+If set, trace the mutexes from the binary at this path. For
+statically-linked binaries, this argument is not required.
+For dynamically-linked binaries, this argument is required and should be the
+path of the pthread library the binary is using.
+Example: /lib/x86_64-linux-gnu/libpthread.so.0
+.TP
+\--dump-graph DUMP_GRAPH
+If set, this will dump the mutex graph to the specified file.
+.TP
+\--verbose
+Print statistics about the mutex wait graph.
+.TP
+\--lock-symbols LOCK_SYMBOLS
+Comma-separated list of lock symbols to trace. Default is pthread_mutex_lock.
+These symbols cannot be inlined in the binary.
+.TP
+\--unlock-symbols UNLOCK_SYMBOLS
+Comma-separated list of unlock symbols to trace. Default is
+pthread_mutex_unlock. These symbols cannot be inlined in the binary.
+.TP
+pid
+Pid to trace
+.SH EXAMPLES
+.TP
+Find potential deadlocks in PID 181. The --binary argument is not needed for \
+statically-linked binaries.
+#
+.B deadlock_detector 181
+.TP
+Find potential deadlocks in PID 181. If the process was created from a \
+dynamically-linked executable, the --binary argument is required and must be \
+the path of the pthread library:
+#
+.B deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+.TP
+Find potential deadlocks in PID 181. If the process was created from a \
+statically-linked executable, optionally pass the location of the binary. \
+On older kernels without https://lkml.org/lkml/2017/1/13/585, binaries that \
+contain `:` in the path cannot be attached with uprobes. As a workaround, we \
+can create a symlink to the binary, and provide the symlink name instead with \
+the `--binary` option:
+#
+.B deadlock_detector 181 --binary /usr/local/bin/lockinversion
+.TP
+Find potential deadlocks in PID 181 and dump the mutex wait graph to a file:
+#
+.B deadlock_detector 181 --dump-graph graph.json
+.TP
+Find potential deadlocks in PID 181 and print mutex wait graph statistics:
+#
+.B deadlock_detector 181 --verbose
+.TP
+Find potential deadlocks in PID 181 with custom mutexes:
+#
+.B deadlock_detector 181
+.B      --lock-symbols custom_mutex1_lock,custom_mutex2_lock
+.B      --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock
+.SH OUTPUT
+This program does not output any fields. Rather, it will keep running until
+it finds a potential deadlock, or the user hits Ctrl-C. If the program finds
+a potential deadlock, it will output the stack traces and lock order inversion
+in the following format and exit:
+.TP
+Potential Deadlock Detected!
+.TP
+Cycle in lock order graph: Mutex M0 => Mutex M1 => Mutex M0
+.TP
+Mutex M1 acquired here while holding Mutex M0 in Thread T:
+.B [stack trace]
+.TP
+Mutex M0 previously acquired by the same Thread T here:
+.B [stack trace]
+.TP
+Mutex M0 acquired here while holding Mutex M1 in Thread S:
+.B [stack trace]
+.TP
+Mutex M1 previously acquired by the same Thread S here:
+.B [stack trace]
+.TP
+Thread T created by Thread R here:
+.B [stack trace]
+.TP
+Thread S created by Thread Q here:
+.B [stack trace]
+.SH OVERHEAD
+This traces all mutex lock and unlock events and all thread creation events
+on the traced process. The overhead of this can be high if the process has many
+threads and mutexes. You should only run this on a process where the slowdown
+is acceptable.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Kenny Yu
--- a/tools/deadlock_detector.c
+++ b/tools/deadlock_detector.c
+/*
+ * deadlock_detector.c  Detects potential deadlocks in a running process.
+ *                      For Linux, uses BCC, eBPF. See .py file.
+ *
+ * Copyright 2017 Facebook, Inc.
+ * Licensed under the Apache License, Version 2.0 (the "License")
+ *
+ * 1-Feb-2016   Kenny Yu   Created this.
+ */
+
+#include <linux/sched.h>
+#include <uapi/linux/ptrace.h>
+
+// Maximum number of mutexes a single thread can hold at once.
+// If the number is too big, the unrolled loops wil cause the stack
+// to be too big, and the bpf verifier will fail.
+#define MAX_HELD_MUTEXES 16
+
+// Info about held mutexes. `mutex` will be 0 if not held.
+struct held_mutex_t {
+  u64 mutex;
+  u64 stack_id;
+};
+
+// List of mutexes that a thread is holding. Whenever we loop over this array,
+// we need to force the compiler to unroll the loop, otherwise the bcc verifier
+// will fail because the loop will create a backwards edge.
+struct thread_to_held_mutex_leaf_t {
+  struct held_mutex_t held_mutexes[MAX_HELD_MUTEXES];
+};
+
+// Map of thread ID -> array of (mutex addresses, stack id)
+BPF_TABLE("hash", u32, struct thread_to_held_mutex_leaf_t,
+          thread_to_held_mutexes, 2097152);
+
+// Key type for edges. Represents an edge from mutex1 => mutex2.
+struct edges_key_t {
+  u64 mutex1;
+  u64 mutex2;
+};
+
+// Leaf type for edges. Holds information about where each mutex was acquired.
+struct edges_leaf_t {
+  u64 mutex1_stack_id;
+  u64 mutex2_stack_id;
+  u32 thread_pid;
+  char comm[TASK_COMM_LEN];
+};
+
+// Represents all edges currently in the mutex wait graph.
+BPF_TABLE("hash", struct edges_key_t, struct edges_leaf_t, edges, 2097152);
+
+// Info about parent thread when a child thread is created.
+struct thread_created_leaf_t {
+  u64 stack_id;
+  u32 parent_pid;
+  char comm[TASK_COMM_LEN];
+};
+
+// Map of child thread pid -> info about parent thread.
+BPF_TABLE("hash", u32, struct thread_created_leaf_t, thread_to_parent, 10240);
+
+// Stack traces when threads are created and when mutexes are locked/unlocked.
+BPF_STACK_TRACE(stack_traces, 655360);
+
+// The first argument to the user space function we are tracing
+// is a pointer to the mutex M held by thread T.
+//
+// For all mutexes N held by mutexes_held[T]
+//   add edge N => M (held by T)
+// mutexes_held[T].add(M)
+int trace_mutex_acquire(struct pt_regs *ctx, void *mutex_addr) {
+  // Higher 32 bits is process ID, Lower 32 bits is thread ID
+  u32 pid = bpf_get_current_pid_tgid();
+  u64 mutex = (u64)mutex_addr;
+
+  struct thread_to_held_mutex_leaf_t empty_leaf = {};
+  struct thread_to_held_mutex_leaf_t *leaf =
+      thread_to_held_mutexes.lookup_or_init(&pid, &empty_leaf);
+  if (!leaf) {
+    bpf_trace_printk(
+        "could not add thread_to_held_mutex key, thread: %d, mutex: %p\n", pid,
+        mutex);
+    return 1; // Could not insert, no more memory
+  }
+
+  // Recursive mutexes lock the same mutex multiple times. We cannot tell if
+  // the mutex is recursive after the mutex is already created. To avoid noisy
+  // reports, disallow self edges. Do one pass to check if we are already
+  // holding the mutex, and if we are, do nothing.
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    if (leaf->held_mutexes[i].mutex == mutex) {
+      return 1; // Disallow self edges
+    }
+  }
+
+  u64 stack_id =
+      stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
+
+  int added_mutex = 0;
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    // If this is a free slot, see if we can insert.
+    if (!leaf->held_mutexes[i].mutex) {
+      if (!added_mutex) {
+        leaf->held_mutexes[i].mutex = mutex;
+        leaf->held_mutexes[i].stack_id = stack_id;
+        added_mutex = 1;
+      }
+      continue; // Nothing to do for a free slot
+    }
+
+    // Add edges from held mutex => current mutex
+    struct edges_key_t edge_key = {};
+    edge_key.mutex1 = leaf->held_mutexes[i].mutex;
+    edge_key.mutex2 = mutex;
+
+    struct edges_leaf_t edge_leaf = {};
+    edge_leaf.mutex1_stack_id = leaf->held_mutexes[i].stack_id;
+    edge_leaf.mutex2_stack_id = stack_id;
+    edge_leaf.thread_pid = pid;
+    bpf_get_current_comm(&edge_leaf.comm, sizeof(edge_leaf.comm));
+
+    // Returns non-zero on error
+    int result = edges.update(&edge_key, &edge_leaf);
+    if (result) {
+      bpf_trace_printk("could not add edge key %p, %p, error: %d\n",
+                       edge_key.mutex1, edge_key.mutex2, result);
+      continue; // Could not insert, no more memory
+    }
+  }
+
+  // There were no free slots for this mutex.
+  if (!added_mutex) {
+    bpf_trace_printk("could not add mutex %p, added_mutex: %d\n", mutex,
+                     added_mutex);
+    return 1;
+  }
+  return 0;
+}
+
+// The first argument to the user space function we are tracing
+// is a pointer to the mutex M held by thread T.
+//
+// mutexes_held[T].remove(M)
+int trace_mutex_release(struct pt_regs *ctx, void *mutex_addr) {
+  // Higher 32 bits is process ID, Lower 32 bits is thread ID
+  u32 pid = bpf_get_current_pid_tgid();
+  u64 mutex = (u64)mutex_addr;
+
+  struct thread_to_held_mutex_leaf_t *leaf =
+      thread_to_held_mutexes.lookup(&pid);
+  if (!leaf) {
+    // If the leaf does not exist for the pid, then it means we either missed
+    // the acquire event, or we had no more memory and could not add it.
+    bpf_trace_printk(
+        "could not find thread_to_held_mutex, thread: %d, mutex: %p\n", pid,
+        mutex);
+    return 1;
+  }
+
+  // For older kernels without "Bpf: allow access into map value arrays"
+  // (https://lkml.org/lkml/2016/8/30/287) the bpf verifier will fail with an
+  // invalid memory access on `leaf->held_mutexes[i]` below. On newer kernels,
+  // we can avoid making this extra copy in `value` and use `leaf` directly.
+  struct thread_to_held_mutex_leaf_t value = {};
+  bpf_probe_read(&value, sizeof(struct thread_to_held_mutex_leaf_t), leaf);
+
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    // Find the current mutex (if it exists), and clear it.
+    // Note: Can't use `leaf->` in this if condition, see comment above.
+    if (value.held_mutexes[i].mutex == mutex) {
+      leaf->held_mutexes[i].mutex = 0;
+      leaf->held_mutexes[i].stack_id = 0;
+    }
+  }
+
+  return 0;
+}
+
+// Trace return from clone() syscall in the child thread (return value > 0).
+int trace_clone(struct pt_regs *ctx, unsigned long flags, void *child_stack,
+                void *ptid, void *ctid, struct pt_regs *regs) {
+  u32 child_pid = PT_REGS_RC(ctx);
+  if (child_pid <= 0) {
+    return 1;
+  }
+
+  struct thread_created_leaf_t thread_created_leaf = {};
+  thread_created_leaf.parent_pid = bpf_get_current_pid_tgid();
+  thread_created_leaf.stack_id =
+      stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
+  bpf_get_current_comm(&thread_created_leaf.comm,
+                       sizeof(thread_created_leaf.comm));
+
+  struct thread_created_leaf_t *insert_result =
+      thread_to_parent.lookup_or_init(&child_pid, &thread_created_leaf);
+  if (!insert_result) {
+    bpf_trace_printk(
+        "could not add thread_created_key, child: %d, parent: %d\n", child_pid,
+        thread_created_leaf.parent_pid);
+    return 1; // Could not insert, no more memory
+  }
+  return 0;
+}
--- a/tools/deadlock_detector.py
+++ b/tools/deadlock_detector.py
+#!/usr/bin/env python
+#
+# deadlock_detector  Detects potential deadlocks (lock order inversions)
+#                    on a running process. For Linux, uses BCC, eBPF.
+#
+# USAGE: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
+#                             [--verbose] [--lock-symbols LOCK_SYMBOLS]
+#                             [--unlock-symbols UNLOCK_SYMBOLS]
+#                             pid
+#
+# This traces pthread mutex lock and unlock calls to build a directed graph
+# representing the mutex wait graph:
+#
+# - Nodes in the graph represent mutexes.
+# - Edge (A, B) exists if there exists some thread T where lock(A) was called
+#   and lock(B) was called before unlock(A) was called.
+#
+# If the program finds a potential lock order inversion, the program will dump
+# the cycle of mutexes and the stack traces where each mutex was acquired, and
+# then exit.
+#
+# This program can only find potential deadlocks that occur while the program
+# is tracing the process. It cannot find deadlocks that may have occurred
+# before the program was attached to the process.
+#
+# Since this traces all mutex lock and unlock events and all thread creation
+# events on the traced process, the overhead of this bpf program can be very
+# high if the process has many threads and mutexes. You should only run this on
+# a process where the slowdown is acceptable.
+#
+# Note: This tool does not work for shared mutexes or recursive mutexes.
+#
+# For shared (read-write) mutexes, a deadlock requires a cycle in the wait
+# graph where at least one of the mutexes in the cycle is acquiring exclusive
+# (write) ownership.
+#
+# For recursive mutexes, lock() is called multiple times on the same mutex.
+# However, there is no way to determine if a mutex is a recursive mutex
+# after the mutex has been created. As a result, this tool will not find
+# potential deadlocks that involve only one mutex.
+#
+# Copyright 2017 Facebook, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 01-Feb-2017   Kenny Yu   Created this.
+
+from __future__ import (
+    absolute_import, division, unicode_literals, print_function
+)
+from bcc import BPF
+from collections import defaultdict
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+
+
+class DiGraph(object):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Represents a directed graph. Edges can store (key, value) attributes.
+    '''
+
+    def __init__(self):
+        # Map of node -> set of nodes
+        self.adjacency_map = {}
+        # Map of (node1, node2) -> map string -> arbitrary attribute
+        # This will not be copied in subgraph()
+        self.attributes_map = {}
+
+    def neighbors(self, node):
+        return self.adjacency_map.get(node, set())
+
+    def edges(self):
+        edges = []
+        for node, neighbors in self.adjacency_map.items():
+            for neighbor in neighbors:
+                edges.append((node, neighbor))
+        return edges
+
+    def nodes(self):
+        return self.adjacency_map.keys()
+
+    def attributes(self, node1, node2):
+        return self.attributes_map[(node1, node2)]
+
+    def add_edge(self, node1, node2, **kwargs):
+        if node1 not in self.adjacency_map:
+            self.adjacency_map[node1] = set()
+        if node2 not in self.adjacency_map:
+            self.adjacency_map[node2] = set()
+        self.adjacency_map[node1].add(node2)
+        self.attributes_map[(node1, node2)] = kwargs
+
+    def remove_node(self, node):
+        self.adjacency_map.pop(node, None)
+        for _, neighbors in self.adjacency_map.items():
+            neighbors.discard(node)
+
+    def subgraph(self, nodes):
+        graph = DiGraph()
+        for node in nodes:
+            for neighbor in self.neighbors(node):
+                if neighbor in nodes:
+                    graph.add_edge(node, neighbor)
+        return graph
+
+    def node_link_data(self):
+        '''
+        Returns the graph as a dictionary in a format that can be
+        serialized.
+        '''
+        data = {
+            'directed': True,
+            'multigraph': False,
+            'graph': {},
+            'links': [],
+            'nodes': [],
+        }
+
+        # Do one pass to build a map of node -> position in nodes
+        node_to_number = {}
+        for node in self.adjacency_map.keys():
+            node_to_number[node] = len(data['nodes'])
+            data['nodes'].append({'id': node})
+
+        # Do another pass to build the link information
+        for node, neighbors in self.adjacency_map.items():
+            for neighbor in neighbors:
+                link = self.attributes_map[(node, neighbor)].copy()
+                link['source'] = node_to_number[node]
+                link['target'] = node_to_number[neighbor]
+                data['links'].append(link)
+        return data
+
+
+def strongly_connected_components(G):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Parameters
+    ----------
+    G : DiGraph
+    Returns
+    -------
+    comp : generator of sets
+        A generator of sets of nodes, one for each strongly connected
+        component of G.
+    '''
+    preorder = {}
+    lowlink = {}
+    scc_found = {}
+    scc_queue = []
+    i = 0  # Preorder counter
+    for source in G.nodes():
+        if source not in scc_found:
+            queue = [source]
+            while queue:
+                v = queue[-1]
+                if v not in preorder:
+                    i = i + 1
+                    preorder[v] = i
+                done = 1
+                v_nbrs = G.neighbors(v)
+                for w in v_nbrs:
+                    if w not in preorder:
+                        queue.append(w)
+                        done = 0
+                        break
+                if done == 1:
+                    lowlink[v] = preorder[v]
+                    for w in v_nbrs:
+                        if w not in scc_found:
+                            if preorder[w] > preorder[v]:
+                                lowlink[v] = min([lowlink[v], lowlink[w]])
+                            else:
+                                lowlink[v] = min([lowlink[v], preorder[w]])
+                    queue.pop()
+                    if lowlink[v] == preorder[v]:
+                        scc_found[v] = True
+                        scc = {v}
+                        while (
+                            scc_queue and preorder[scc_queue[-1]] > preorder[v]
+                        ):
+                            k = scc_queue.pop()
+                            scc_found[k] = True
+                            scc.add(k)
+                        yield scc
+                    else:
+                        scc_queue.append(v)
+
+
+def simple_cycles(G):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Parameters
+    ----------
+    G : DiGraph
+    Returns
+    -------
+    cycle_generator: generator
+       A generator that produces elementary cycles of the graph.
+       Each cycle is represented by a list of nodes along the cycle.
+    '''
+
+    def _unblock(thisnode, blocked, B):
+        stack = set([thisnode])
+        while stack:
+            node = stack.pop()
+            if node in blocked:
+                blocked.remove(node)
+                stack.update(B[node])
+                B[node].clear()
+
+    # Johnson's algorithm requires some ordering of the nodes.
+    # We assign the arbitrary ordering given by the strongly connected comps
+    # There is no need to track the ordering as each node removed as processed.
+    # save the actual graph so we can mutate it here
+    # We only take the edges because we do not want to
+    # copy edge and node attributes here.
+    subG = G.subgraph(G.nodes())
+    sccs = list(strongly_connected_components(subG))
+    while sccs:
+        scc = sccs.pop()
+        # order of scc determines ordering of nodes
+        startnode = scc.pop()
+        # Processing node runs 'circuit' routine from recursive version
+        path = [startnode]
+        blocked = set()  # vertex: blocked from search?
+        closed = set()  # nodes involved in a cycle
+        blocked.add(startnode)
+        B = defaultdict(set)  # graph portions that yield no elementary circuit
+        stack = [(startnode, list(subG.neighbors(startnode)))]
+        while stack:
+            thisnode, nbrs = stack[-1]
+            if nbrs:
+                nextnode = nbrs.pop()
+                if nextnode == startnode:
+                    yield path[:]
+                    closed.update(path)
+                elif nextnode not in blocked:
+                    path.append(nextnode)
+                    stack.append((nextnode, list(subG.neighbors(nextnode))))
+                    closed.discard(nextnode)
+                    blocked.add(nextnode)
+                    continue
+            # done with nextnode... look for more neighbors
+            if not nbrs:  # no more nbrs
+                if thisnode in closed:
+                    _unblock(thisnode, blocked, B)
+                else:
+                    for nbr in subG.neighbors(thisnode):
+                        if thisnode not in B[nbr]:
+                            B[nbr].add(thisnode)
+                stack.pop()
+                path.pop()
+        # done processing this node
+        subG.remove_node(startnode)
+        H = subG.subgraph(scc)  # make smaller to avoid work in SCC routine
+        sccs.extend(list(strongly_connected_components(H)))
+
+
+def find_cycle(graph):
+    '''
+    Looks for a cycle in the graph. If found, returns the first cycle.
+    If nodes a1, a2, ..., an are in a cycle, then this returns:
+        [(a1,a2), (a2,a3), ... (an-1,an), (an, a1)]
+    Otherwise returns an empty list.
+    '''
+    cycles = list(simple_cycles(graph))
+    if cycles:
+        nodes = cycles[0]
+        nodes.append(nodes[0])
+        edges = []
+        prev = nodes[0]
+        for node in nodes[1:]:
+            edges.append((prev, node))
+            prev = node
+        return edges
+    else:
+        return []
+
+
+def print_cycle(binary, graph, edges, thread_info, print_stack_trace_fn):
+    '''
+    Prints the cycle in the mutex graph in the following format:
+
+    Potential Deadlock Detected!
+
+    Cycle in lock order graph: M0 => M1 => M2 => M0
+
+    for (m, n) in cycle:
+        Mutex n acquired here while holding Mutex m in thread T:
+            [ stack trace ]
+
+        Mutex m previously acquired by thread T here:
+            [ stack trace ]
+
+    for T in all threads:
+        Thread T was created here:
+            [ stack trace ]
+    '''
+
+    # List of mutexes in the cycle, first and last repeated
+    nodes_in_order = []
+    # Map mutex address -> readable alias
+    node_addr_to_name = {}
+    for counter, (m, n) in enumerate(edges):
+        nodes_in_order.append(m)
+        # For global or static variables, try to symbolize the mutex address.
+        symbol = symbolize_with_objdump(binary, m)
+        if symbol:
+            symbol += ' '
+        node_addr_to_name[m] = 'Mutex M%d (%s0x%016x)' % (counter, symbol, m)
+    nodes_in_order.append(nodes_in_order[0])
+
+    print('----------------\nPotential Deadlock Detected!\n')
+    print(
+        'Cycle in lock order graph: %s\n' %
+        (' => '.join([node_addr_to_name[n] for n in nodes_in_order]))
+    )
+
+    # Set of threads involved in the lock inversion
+    thread_pids = set()
+
+    # For each edge in the cycle, print where the two mutexes were held
+    for (m, n) in edges:
+        thread_pid = graph.attributes(m, n)['thread_pid']
+        thread_comm = graph.attributes(m, n)['thread_comm']
+        first_mutex_stack_id = graph.attributes(m, n)['first_mutex_stack_id']
+        second_mutex_stack_id = graph.attributes(m, n)['second_mutex_stack_id']
+        thread_pids.add(thread_pid)
+        print(
+            '%s acquired here while holding %s in Thread %d (%s):' % (
+                node_addr_to_name[n], node_addr_to_name[m], thread_pid,
+                thread_comm
+            )
+        )
+        print_stack_trace_fn(second_mutex_stack_id)
+        print('')
+        print(
+            '%s previously acquired by the same Thread %d (%s) here:' %
+            (node_addr_to_name[m], thread_pid, thread_comm)
+        )
+        print_stack_trace_fn(first_mutex_stack_id)
+        print('')
+
+    # Print where the threads were created, if available
+    for thread_pid in thread_pids:
+        parent_pid, stack_id, parent_comm = thread_info.get(
+            thread_pid, (None, None, None)
+        )
+        if parent_pid:
+            print(
+                'Thread %d created by Thread %d (%s) here: ' %
+                (thread_pid, parent_pid, parent_comm)
+            )
+            print_stack_trace_fn(stack_id)
+        else:
+            print(
+                'Could not find stack trace where Thread %d was created' %
+                thread_pid
+            )
+        print('')
+
+
+def symbolize_with_objdump(binary, addr):
+    '''
+    Searches the binary for the address using objdump. Returns the symbol if
+    it is found, otherwise returns empty string.
+    '''
+    try:
+        command = (
+            'objdump -tT %s | grep %x | awk {\'print $NF\'} | c++filt' %
+            (binary, addr)
+        )
+        output = subprocess.check_output(command, shell=True)
+        return output.decode('utf-8').strip()
+    except subprocess.CalledProcessError:
+        return ''
+
+
+def strlist(s):
+    '''Given a comma-separated string, returns a list of substrings'''
+    return s.strip().split(',')
+
+
+def main():
+    examples = '''Examples:
+    deadlock_detector 181        # Analyze PID 181
+
+    deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+                                 # Analyze PID 181 and locks from this binary.
+                                 # If tracing a process that is running from
+                                 # a dynamically-linked binary, this argument
+                                 # is required and should be the path to the
+                                 # pthread library.
+
+    deadlock_detector 181 --verbose
+                                 # Analyze PID 181 and print statistics about
+                                 # the mutex wait graph.
+
+    deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \\
+        --unlock-symbols my_mutex_unlock1,my_mutex_unlock2
+                                 # Analyze PID 181 and trace custom mutex
+                                 # symbols instead of pthread mutexes.
+
+    deadlock_detector 181 --dump-graph graph.json
+                                 # Analyze PID 181 and dump the mutex wait
+                                 # graph to graph.json.
+    '''
+    parser = argparse.ArgumentParser(
+        description=(
+            'Detect potential deadlocks (lock inversions) in a running binary.'
+            '\nMust be run as root.'
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=examples,
+    )
+    parser.add_argument('pid', type=int, help='Pid to trace')
+    # Binaries with `:` in the path will fail to attach uprobes on kernels
+    # running without this patch: https://lkml.org/lkml/2017/1/13/585.
+    # Symlinks to the binary without `:` in the path can get around this issue.
+    parser.add_argument(
+        '--binary',
+        type=str,
+        default='',
+        help='If set, trace the mutexes from the binary at this path. '
+        'For statically-linked binaries, this argument is not required. '
+        'For dynamically-linked binaries, this argument is required and '
+        'should be the path of the pthread library the binary is using. '
+        'Example: /lib/x86_64-linux-gnu/libpthread.so.0',
+    )
+    parser.add_argument(
+        '--dump-graph',
+        type=str,
+        default='',
+        help='If set, this will dump the mutex graph to the specified file.',
+    )
+    parser.add_argument(
+        '--verbose',
+        action='store_true',
+        help='Print statistics about the mutex wait graph.',
+    )
+    parser.add_argument(
+        '--lock-symbols',
+        type=strlist,
+        default=['pthread_mutex_lock'],
+        help='Comma-separated list of lock symbols to trace. Default is '
+        'pthread_mutex_lock. These symbols cannot be inlined in the binary.',
+    )
+    parser.add_argument(
+        '--unlock-symbols',
+        type=strlist,
+        default=['pthread_mutex_unlock'],
+        help='Comma-separated list of unlock symbols to trace. Default is '
+        'pthread_mutex_unlock. These symbols cannot be inlined in the binary.',
+    )
+    args = parser.parse_args()
+    if not args.binary:
+        try:
+            args.binary = os.readlink('/proc/%d/exe' % args.pid)
+        except OSError as e:
+            print('%s. Is the process (pid=%d) running?' % (str(e), args.pid))
+            sys.exit(1)
+
+    bpf = BPF(src_file='deadlock_detector.c')
+
+    # Trace where threads are created
+    bpf.attach_kretprobe(
+        event='sys_clone', fn_name='trace_clone', pid=args.pid
+    )
+
+    # We must trace unlock first, otherwise in the time we attached the probe
+    # on lock() and have not yet attached the probe on unlock(), a thread can
+    # acquire mutexes and release them, but the release events will not be
+    # traced, resulting in noisy reports.
+    for symbol in args.unlock_symbols:
+        try:
+            bpf.attach_uprobe(
+                name=args.binary,
+                sym=symbol,
+                fn_name='trace_mutex_release',
+                pid=args.pid,
+            )
+        except Exception as e:
+            print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
+            sys.exit(1)
+    for symbol in args.lock_symbols:
+        try:
+            bpf.attach_uprobe(
+                name=args.binary,
+                sym=symbol,
+                fn_name='trace_mutex_acquire',
+                pid=args.pid,
+            )
+        except Exception as e:
+            print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
+            sys.exit(1)
+
+    def print_stack_trace(stack_id):
+        '''Closure that prints the symbolized stack trace.'''
+        for addr in bpf.get_table('stack_traces').walk(stack_id):
+            line = bpf.sym(addr, args.pid)
+            # Try to symbolize with objdump if we cannot with bpf.
+            if line == '[unknown]':
+                symbol = symbolize_with_objdump(args.binary, addr)
+                if symbol:
+                    line = symbol
+            print('@ %016x %s' % (addr, line))
+
+    print('Tracing... Hit Ctrl-C to end.')
+    while True:
+        try:
+            # Map of child thread pid -> parent info
+            thread_info = {
+                child.value: (parent.parent_pid, parent.stack_id, parent.comm)
+                for child, parent in bpf.get_table('thread_to_parent').items()
+            }
+
+            # Mutex wait directed graph. Nodes are mutexes. Edge (A,B) exists
+            # if there exists some thread T where lock(A) was called and
+            # lock(B) was called before unlock(A) was called.
+            graph = DiGraph()
+            for key, leaf in bpf.get_table('edges').items():
+                graph.add_edge(
+                    key.mutex1,
+                    key.mutex2,
+                    thread_pid=leaf.thread_pid,
+                    thread_comm=leaf.comm.decode('utf-8'),
+                    first_mutex_stack_id=leaf.mutex1_stack_id,
+                    second_mutex_stack_id=leaf.mutex2_stack_id,
+                )
+            if args.verbose:
+                print(
+                    'Mutexes: %d, Edges: %d' %
+                    (len(graph.nodes()), len(graph.edges()))
+                )
+            if args.dump_graph:
+                with open(args.dump_graph, 'w') as f:
+                    data = graph.node_link_data()
+                    f.write(json.dumps(data, indent=2))
+
+            cycle = find_cycle(graph)
+            if cycle:
+                print_cycle(
+                    args.binary, graph, cycle, thread_info, print_stack_trace
+                )
+                sys.exit(1)
+
+            time.sleep(1)
+        except KeyboardInterrupt:
+            break
+
+
+if __name__ == '__main__':
+    main()
--- a/tools/deadlock_detector_example.txt
+++ b/tools/deadlock_detector_example.txt
+Demonstrations of deadlock_detector.
+
+This program detects potential deadlocks on a running process. The program
+attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` to build
+a mutex wait directed graph, and then looks for a cycle in this graph. This
+graph has the following properties:
+
+- Nodes in the graph represent mutexes.
+- Edge (A, B) exists if there exists some thread T where lock(A) was called
+  and lock(B) was called before unlock(A) was called.
+
+If there is a cycle in this graph, this indicates that there is a lock order
+inversion (potential deadlock). If the program finds a lock order inversion, the
+program will dump the cycle of mutexes, dump the stack traces where each mutex
+was acquired, and then exit.
+
+This program can only find potential deadlocks that occur while the program
+is tracing the process. It cannot find deadlocks that may have occurred
+before the program was attached to the process.
+
+Since this traces all mutex lock and unlock events and all thread creation
+events on the traced process, the overhead of this bpf program can be very
+high if the process has many threads and mutexes. You should only run this on
+a process where the slowdown is acceptable.
+
+Note: This tool does not work for shared mutexes or recursive mutexes.
+
+For shared (read-write) mutexes, a deadlock requires a cycle in the wait
+graph where at least one of the mutexes in the cycle is acquiring exclusive
+(write) ownership.
+
+For recursive mutexes, lock() is called multiple times on the same mutex.
+However, there is no way to determine if a mutex is a recursive mutex
+after the mutex has been created. As a result, this tool will not find
+potential deadlocks that involve only one mutex.
+
+
+# ./deadlock_detector.py 181
+Tracing... Hit Ctrl-C to end.
+----------------
+Potential Deadlock Detected!
+
+Cycle in lock order graph: Mutex M0 (main::static_mutex3 0x0000000000473c60) => Mutex M1 (0x00007fff6d738400) => Mutex M2 (global_mutex1 0x0000000000473be0) => Mutex M3 (global_mutex2 0x0000000000473c20) => Mutex M0 (main::static_mutex3 0x0000000000473c60)
+
+Mutex M1 (0x00007fff6d738400) acquired here while holding Mutex M0 (main::static_mutex3 0x0000000000473c60) in Thread 357250 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e38 main::{lambda()#3}::operator()() const
+@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
+@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M0 (main::static_mutex3 0x0000000000473c60) previously acquired by the same Thread 357250 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e22 main::{lambda()#3}::operator()() const
+@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
+@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M2 (global_mutex1 0x0000000000473be0) acquired here while holding Mutex M1 (0x00007fff6d738400) in Thread 357251 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402ea8 main::{lambda()#4}::operator()() const
+@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
+@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M1 (0x00007fff6d738400) previously acquired by the same Thread 357251 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e97 main::{lambda()#4}::operator()() const
+@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
+@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M3 (global_mutex2 0x0000000000473c20) acquired here while holding Mutex M2 (global_mutex1 0x0000000000473be0) in Thread 357247 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402d5f main::{lambda()#1}::operator()() const
+@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
+@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M2 (global_mutex1 0x0000000000473be0) previously acquired by the same Thread 357247 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402d4e main::{lambda()#1}::operator()() const
+@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
+@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M0 (main::static_mutex3 0x0000000000473c60) acquired here while holding Mutex M3 (global_mutex2 0x0000000000473c20) in Thread 357248 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402dc9 main::{lambda()#2}::operator()() const
+@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
+@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M3 (global_mutex2 0x0000000000473c20) previously acquired by the same Thread 357248 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402db8 main::{lambda()#2}::operator()() const
+@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
+@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Thread 357248 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004033ac std::thread::thread<main::{lambda()#2}>(main::{lambda()#2}&&)
+@ 000000000040308f main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357250 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004034b2 std::thread::thread<main::{lambda()#3}>(main::{lambda()#3}&&)
+@ 00000000004030b9 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357251 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004035b8 std::thread::thread<main::{lambda()#4}>(main::{lambda()#4}&&)
+@ 00000000004030e6 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357247 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004032a6 std::thread::thread<main::{lambda()#1}>(main::{lambda()#1}&&)
+@ 0000000000403070 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+This is output from a process that has a potential deadlock involving 4 mutexes
+and 4 threads:
+
+- Thread 357250 acquired M1 while holding M0 (edge M0 -> M1)
+- Thread 357251 acquired M2 while holding M1 (edge M1 -> M2)
+- Thread 357247 acquired M3 while holding M2 (edge M2 -> M3)
+- Thread 357248 acquired M0 while holding M3 (edge M3 -> M0)
+
+This is the C++ program that generated the output above:
+
+```c++
+#include <chrono>
+#include <iostream>
+#include <mutex>
+#include <thread>
+
+std::mutex global_mutex1;
+std::mutex global_mutex2;
+
+int main(void) {
+  static std::mutex static_mutex3;
+  std::mutex local_mutex4;
+
+  std::cout << "sleeping for a bit to allow trace to attach..." << std::endl;
+  std::this_thread::sleep_for(std::chrono::seconds(10));
+  std::cout << "starting program..." << std::endl;
+
+  auto t1 = std::thread([] {
+    std::lock_guard<std::mutex> g1(global_mutex1);
+    std::lock_guard<std::mutex> g2(global_mutex2);
+  });
+  t1.join();
+
+  auto t2 = std::thread([] {
+    std::lock_guard<std::mutex> g2(global_mutex2);
+    std::lock_guard<std::mutex> g3(static_mutex3);
+  });
+  t2.join();
+
+  auto t3 = std::thread([&local_mutex4] {
+    std::lock_guard<std::mutex> g3(static_mutex3);
+    std::lock_guard<std::mutex> g4(local_mutex4);
+  });
+  t3.join();
+
+  auto t4 = std::thread([&local_mutex4] {
+    std::lock_guard<std::mutex> g4(local_mutex4);
+    std::lock_guard<std::mutex> g1(global_mutex1);
+  });
+  t4.join();
+
+  std::cout << "sleeping to allow trace to collect data..." << std::endl;
+  std::this_thread::sleep_for(std::chrono::seconds(5));
+  std::cout << "done!" << std::endl;
+}
+```
+
+Note that an actual deadlock did not occur, although this mutex lock ordering
+creates the possibility of a deadlock, and this is a hint to the programmer to
+reconsider the lock ordering. If the mutexes are global or static and debug
+symbols are enabled, the output will contain the mutex symbol name. The output
+uses a similar format as ThreadSanitizer
+(https://github.com/google/sanitizers/wiki/ThreadSanitizerDeadlockDetector).
+
+
+# ./deadlock_detector.py 181 --binary /usr/local/bin/lockinversion
+
+Tracing... Hit Ctrl-C to end.
+^C
+
+If the traced process is instantiated from a statically-linked executable,
+this argument is optional, and the program will determine the path of the
+executable from the pid. However, on older kernels without this patch
+("uprobe: Find last occurrence of ':' when parsing uprobe PATH:OFFSET",
+https://lkml.org/lkml/2017/1/13/585), binaries that contain `:` in the path
+cannot be attached with uprobes. As a workaround, we can create a symlink
+to the binary, and provide the symlink name instead to the `--binary` option.
+
+
+# ./deadlock_detector.py 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+
+Tracing... Hit Ctrl-C to end.
+^C
+
+If the traced process is instantiated from a dynamically-linked executable,
+this argument is required and needs to be the path to the pthread shared
+library used by the executable.
+
+
+# ./deadlock_detector.py 181 --dump-graph graph.json --verbose
+
+Tracing... Hit Ctrl-C to end.
+Mutexes: 0, Edges: 0
+Mutexes: 532, Edges: 411
+Mutexes: 735, Edges: 675
+Mutexes: 1118, Edges: 1278
+Mutexes: 1666, Edges: 2185
+Mutexes: 2056, Edges: 2694
+Mutexes: 2245, Edges: 2906
+Mutexes: 2656, Edges: 3479
+Mutexes: 2813, Edges: 3785
+^C
+
+If the program does not find a deadlock, it will keep running until you hit
+Ctrl-C. If you pass the `--verbose` flag, the program will also dump statistics
+about the number of mutexes and edges in the mutex wait graph. If you want to
+serialize the graph to analyze it later, you can pass the `--dump-graph FILE`
+flag, and the program will serialize the graph in json.
+
+
+# ./deadlock_detector.py 181 --lock-symbols custom_mutex1_lock,custom_mutex2_lock --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock --verbose
+
+Tracing... Hit Ctrl-C to end.
+Mutexes: 0, Edges: 0
+Mutexes: 532, Edges: 411
+Mutexes: 735, Edges: 675
+Mutexes: 1118, Edges: 1278
+Mutexes: 1666, Edges: 2185
+Mutexes: 2056, Edges: 2694
+Mutexes: 2245, Edges: 2906
+Mutexes: 2656, Edges: 3479
+Mutexes: 2813, Edges: 3785
+^C
+
+If your program is using custom mutexes and not pthread mutexes, you can use
+the `--lock-symbols` and `--unlock-symbols` flags to specify different mutex
+symbols to trace. The flags take a comma-separated string of symbol names.
+Note that if the symbols are inlined in the binary, then this program can result
+in false positives.
+
+
+USAGE message:
+
+# ./deadlock_detector.py -h
+
+usage: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
+                            [--verbose] [--lock-symbols LOCK_SYMBOLS]
+                            [--unlock-symbols UNLOCK_SYMBOLS]
+                            pid
+
+Detect potential deadlocks (lock inversions) in a running binary.
+Must be run as root.
+
+positional arguments:
+  pid                   Pid to trace
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --binary BINARY       If set, trace the mutexes from the binary at this
+                        path. For statically-linked binaries, this argument is
+                        not required. For dynamically-linked binaries, this
+                        argument is required and should be the path of the
+                        pthread library the binary is using. Example:
+                        /lib/x86_64-linux-gnu/libpthread.so.0
+  --dump-graph DUMP_GRAPH
+                        If set, this will dump the mutex graph to the
+                        specified file.
+  --verbose             Print statistics about the mutex wait graph.
+  --lock-symbols LOCK_SYMBOLS
+                        Comma-separated list of lock symbols to trace. Default
+                        is pthread_mutex_lock. These symbols cannot be inlined
+                        in the binary.
+  --unlock-symbols UNLOCK_SYMBOLS
+                        Comma-separated list of unlock symbols to trace.
+                        Default is pthread_mutex_unlock. These symbols cannot
+                        be inlined in the binary.
+
+Examples:
+    deadlock_detector 181        # Analyze PID 181
+
+    deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+                                 # Analyze PID 181 and locks from this binary.
+                                 # If tracing a process that is running from
+                                 # a dynamically-linked binary, this argument
+                                 # is required and should be the path to the
+                                 # pthread library.
+
+    deadlock_detector 181 --verbose
+                                 # Analyze PID 181 and print statistics about
+                                 # the mutex wait graph.
+
+    deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \
+        --unlock-symbols my_mutex_unlock1,my_mutex_unlock2
+                                 # Analyze PID 181 and trace custom mutex
+                                 # symbols instead of pthread mutexes.
+
+    deadlock_detector 181 --dump-graph graph.json
+                                 # Analyze PID 181 and dump the mutex wait
+                                 # graph to graph.json.