1. 06 Sep, 2022 3 commits
  2. 30 Aug, 2022 14 commits
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets · ecdfb889
      Marco Elver authored
      We can still see that a majority of the time is spent hashing task pointers:
      
          ...
          16.98%  [kernel]       [k] rhashtable_jhash2
          ...
      
      Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
      calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
      CPU-independent. The reason for this is to update the per-CPU
      'tsk_pinned' histogram.
      
      To optimize the CPU-independent case to O(1), keep a separate
      CPU-independent 'tsk_pinned_all' histogram.
      
      The major source of complexity are transitions between "all
      CPU-independent task breakpoints" and "mixed CPU-independent and
      CPU-dependent task breakpoints". The code comments list all cases that
      require handling.
      
      After this optimization:
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.758 [sec]
       |
       |       34.336621 usecs/op
       |     4395.087500 usecs/op/cpu
      
          38.08%  [kernel]       [k] queued_spin_lock_slowpath
          10.81%  [kernel]       [k] smp_cfm_core_cond
           3.01%  [kernel]       [k] update_sg_lb_stats
           2.58%  [kernel]       [k] osq_lock
           2.57%  [kernel]       [k] llist_reverse_order
           1.45%  [kernel]       [k] find_next_bit
           1.21%  [kernel]       [k] flush_tlb_func_common
           1.01%  [kernel]       [k] arch_install_hw_breakpoint
      
      Showing that the time spent hashing keys has become insignificant.
      
      With the given benchmark parameters, that's an improvement of 12%
      compared with the old O(#cpus) version.
      
      And finally, using the less aggressive parameters from the preceding
      changes, we now observe:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.067 [sec]
       |
       |       35.292187 usecs/op
       |     2258.700000 usecs/op/cpu
      
      Which is an improvement of 12% compared to without the histogram
      optimizations (baseline is 40 usecs/op). This is now on par with the
      theoretical ideal (constraints disabled), and only 12% slower than no
      breakpoints at all.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com
      ecdfb889
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets · 9b1933b8
      Marco Elver authored
      Running the perf benchmark with (note: more aggressive parameters vs.
      preceding changes, but same 256 CPUs host):
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.989 [sec]
       |
       |       38.854160 usecs/op
       |     4973.332500 usecs/op/cpu
      
          20.43%  [kernel]       [k] queued_spin_lock_slowpath
          18.75%  [kernel]       [k] osq_lock
          16.98%  [kernel]       [k] rhashtable_jhash2
           8.34%  [kernel]       [k] task_bp_pinned
           4.23%  [kernel]       [k] smp_cfm_core_cond
           3.65%  [kernel]       [k] bcmp
           2.83%  [kernel]       [k] toggle_bp_slot
           1.87%  [kernel]       [k] find_next_bit
           1.49%  [kernel]       [k] __reserve_bp_slot
      
      We can see that a majority of the time is now spent hashing task
      pointers to index into task_bps_ht in task_bp_pinned().
      
      Obtaining the max_bp_pinned_slots() for CPU-independent task targets
      currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
      the result of task_bp_pinned() is CPU-independent.
      
      The loop in max_bp_pinned_slots() wants to compute the maximum slots
      across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
      obtaining the max slots across all CPUs and adding task_bp_pinned().
      
      To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
      
      After this optimization:
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.930 [sec]
       |
       |       37.697832 usecs/op
       |     4825.322500 usecs/op/cpu
      
          19.13%  [kernel]       [k] queued_spin_lock_slowpath
          18.21%  [kernel]       [k] rhashtable_jhash2
          15.46%  [kernel]       [k] osq_lock
           6.27%  [kernel]       [k] toggle_bp_slot
           5.91%  [kernel]       [k] task_bp_pinned
           5.05%  [kernel]       [k] smp_cfm_core_cond
           1.78%  [kernel]       [k] update_sg_lb_stats
           1.36%  [kernel]       [k] llist_reverse_order
           1.34%  [kernel]       [k] find_next_bit
           1.19%  [kernel]       [k] bcmp
      
      Suggesting that time spent in task_bp_pinned() has been reduced.
      However, we're still hashing too much, which will be addressed in the
      subsequent change.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com
      9b1933b8
    • Marco Elver's avatar
      perf/hw_breakpoint: Introduce bp_slots_histogram · 16db2839
      Marco Elver authored
      Factor out the existing `atomic_t count[N]` into its own struct called
      'bp_slots_histogram', to generalize and make its intent clearer in
      preparation of reusing elsewhere. The basic idea of bucketing "total
      uses of N slots" resembles a histogram, so calling it such seems most
      intuitive.
      
      No functional change.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com
      16db2839
    • Marco Elver's avatar
      perf/hw_breakpoint: Reduce contention with large number of tasks · 0912037f
      Marco Elver authored
      While optimizing task_bp_pinned()'s runtime complexity to O(1) on
      average helps reduce time spent in the critical section, we still suffer
      due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
      that now contention is the biggest issue:
      
          95.93%  [kernel]       [k] osq_lock
           0.70%  [kernel]       [k] mutex_spin_on_owner
           0.22%  [kernel]       [k] smp_cfm_core_cond
           0.18%  [kernel]       [k] task_bp_pinned
           0.18%  [kernel]       [k] rhashtable_jhash2
           0.15%  [kernel]       [k] queued_spin_lock_slowpath
      
      when running the breakpoint benchmark with (system with 256 CPUs):
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.207 [sec]
       |
       |      108.267188 usecs/op
       |     6929.100000 usecs/op/cpu
      
      The main concern for synchronizing the breakpoint constraints data is
      that a consistent snapshot of the per-CPU and per-task data is observed.
      
      The access pattern is as follows:
      
       1. If the target is a task: the task's pinned breakpoints are counted,
          checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
          is used to check for conflicts with CPU-only breakpoints;
          bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
          unused.
      
       2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
          with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
          incremented. No per-task breakpoints are checked.
      
      Since rhltable safely synchronizes insertions/deletions, we can allow
      concurrency as follows:
      
       1. If the target is a task: independent tasks may update and check the
          constraints concurrently, but same-task target calls need to be
          serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
          checked, these modifications can happen concurrently by switching
          tsk_pinned to atomic_t.
      
       2. If the target is a CPU: access to the per-CPU constraints needs to
          be serialized with other CPU-target and task-target callers (to
          stabilize the bp_cpuinfo::tsk_pinned snapshot).
      
      We can allow the above concurrency by introducing a per-CPU constraints
      data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
      task_struct::perf_event_mutex):
      
        1. If the target is a task: acquires perf_event_mutex, and acquires
           bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
           contention in the presence of many read-lock but few write-lock
           acquisitions: we assume many orders of magnitude more task target
           breakpoints creations/destructions than CPU target breakpoints.
      
        2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
      
      With these changes, contention with thousands of tasks is reduced to the
      point where waiting on locking no longer dominates the profile:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.077 [sec]
       |
       |       40.201563 usecs/op
       |     2572.900000 usecs/op/cpu
      
          21.54%  [kernel]       [k] task_bp_pinned
          20.18%  [kernel]       [k] rhashtable_jhash2
           6.81%  [kernel]       [k] toggle_bp_slot
           5.47%  [kernel]       [k] queued_spin_lock_slowpath
           3.75%  [kernel]       [k] smp_cfm_core_cond
           3.48%  [kernel]       [k] bcmp
      
      On this particular setup that's a speedup of 2.7x.
      
      We're also getting closer to the theoretical ideal performance through
      optimizations in hw_breakpoint.c -- constraints accounting disabled:
      
       | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.067 [sec]
       |
       |       35.286458 usecs/op
       |     2258.333333 usecs/op/cpu
      
      Which means the current implementation is ~12% slower than the
      theoretical ideal.
      
      For reference, performance without any breakpoints:
      
       | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
       |      Total time: 0.060 [sec]
       |
       |       31.365625 usecs/op
       |     2007.400000 usecs/op/cpu
      
      On a system with 256 CPUs, the theoretical ideal is only ~12% slower
      than no breakpoints at all; the current implementation is ~28% slower.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
      0912037f
    • Marco Elver's avatar
      locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · 01fe8a3f
      Marco Elver authored
      Implement simple accessors to probe percpu-rwsem's locked state:
      percpu_is_write_locked(), percpu_is_read_locked().
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-11-elver@google.com
      01fe8a3f
    • Marco Elver's avatar
      powerpc/hw_breakpoint: Avoid relying on caller synchronization · f95e5a3d
      Marco Elver authored
      Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
      implementation have relied on nr_bp_mutex serializing access to them.
      
      Before overhauling synchronization of kernel/events/hw_breakpoint.c,
      introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
      thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-10-elver@google.com
      f95e5a3d
    • Marco Elver's avatar
      perf/hw_breakpoint: Remove useless code related to flexible breakpoints · 24198ad3
      Marco Elver authored
      Flexible breakpoints have never been implemented, with
      bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
      bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
      flexible count in fetch_bp_busy_slots().
      
      This again causes suboptimal code generation, when we always know that
      `!!slots.flexible` will be 0.
      
      Just get rid of the flexible "placeholder" and remove all real code
      related to it. Make a note in the comment related to the constraints
      algorithm but don't remove them from the algorithm, so that if in future
      flexible breakpoints need supporting, it should be trivial to revive
      them (along with reverting this change).
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
      24198ad3
    • Marco Elver's avatar
      perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable · 9caf87be
      Marco Elver authored
      Due to being a __weak function, hw_breakpoint_weight() will cause the
      compiler to always emit a call to it. This generates unnecessarily bad
      code (register spills etc.) for no good reason; in fact it appears in
      profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
      
          ...
          0.70%  [kernel]       [k] hw_breakpoint_weight
          ...
      
      While a small percentage, no architecture defines its own
      hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
      which makes the fact it is currently __weak a poor choice.
      
      Change hw_breakpoint_weight()'s definition to follow a similar protocol
      to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
      hw_breakpoint_weight(), we'll use it instead.
      
      The result is that it is inlined and no longer shows up in profiles.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
      9caf87be
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize constant number of breakpoint slots · be3f1525
      Marco Elver authored
      Optimize internal hw_breakpoint state if the architecture's number of
      breakpoint slots is constant. This avoids several kmalloc() calls and
      potentially unnecessary failures if the allocations fail, as well as
      subtly improves code generation and cache locality.
      
      The protocol is that if an architecture defines hw_breakpoint_slots via
      the preprocessor, it must be constant and the same for all types.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
      be3f1525
    • Marco Elver's avatar
      perf/hw_breakpoint: Mark data __ro_after_init · db5f6f85
      Marco Elver authored
      Mark read-only data after initialization as __ro_after_init.
      
      While we are here, turn 'constraints_initialized' into a bool.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
      db5f6f85
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize list of per-task breakpoints · 0370dc31
      Marco Elver authored
      On a machine with 256 CPUs, running the recently added perf breakpoint
      benchmark results in:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 236.418 [sec]
       |
       |   123134.794271 usecs/op
       |  7880626.833333 usecs/op/cpu
      
      The benchmark tests inherited breakpoint perf events across many
      threads.
      
      Looking at a perf profile, we can see that the majority of the time is
      spent in various hw_breakpoint.c functions, which execute within the
      'nr_bp_mutex' critical sections which then results in contention on that
      mutex as well:
      
          37.27%  [kernel]       [k] osq_lock
          34.92%  [kernel]       [k] mutex_spin_on_owner
          12.15%  [kernel]       [k] toggle_bp_slot
          11.90%  [kernel]       [k] __reserve_bp_slot
      
      The culprit here is task_bp_pinned(), which has a runtime complexity of
      O(#tasks) due to storing all task breakpoints in the same list and
      iterating through that list looking for a matching task. Clearly, this
      does not scale to thousands of tasks.
      
      Instead, make use of the "rhashtable" variant "rhltable" which stores
      multiple items with the same key in a list. This results in average
      runtime complexity of O(1) for task_bp_pinned().
      
      With the optimization, the benchmark shows:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.208 [sec]
       |
       |      108.422396 usecs/op
       |     6939.033333 usecs/op/cpu
      
      On this particular setup that's a speedup of ~1135x.
      
      While one option would be to make task_struct a breakpoint list node,
      this would only further bloat task_struct for infrequently used data.
      Furthermore, after all optimizations in this series, there's no evidence
      it would result in better performance: later optimizations make the time
      spent looking up entries in the hash table negligible (we'll reach the
      theoretical ideal performance i.e. no constraints).
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
      0370dc31
    • Marco Elver's avatar
      perf/hw_breakpoint: Clean up headers · 089cdcb0
      Marco Elver authored
      Clean up headers:
      
       - Remove unused <linux/kallsyms.h>
      
       - Remove unused <linux/kprobes.h>
      
       - Remove unused <linux/module.h>
      
       - Remove unused <linux/smp.h>
      
       - Add <linux/export.h> for EXPORT_SYMBOL_GPL().
      
       - Add <linux/mutex.h> for mutex.
      
       - Sort alphabetically.
      
       - Move <linux/hw_breakpoint.h> to top to test it compiles on its own.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com
      089cdcb0
    • Marco Elver's avatar
      perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test · c5b81449
      Marco Elver authored
      Provide hw_breakpoint_is_used() to check if breakpoints are in use on
      the system.
      
      Use it in the KUnit test to verify the global state before and after a
      test case.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-3-elver@google.com
      c5b81449
    • Marco Elver's avatar
      perf/hw_breakpoint: Add KUnit test for constraints accounting · 724c299c
      Marco Elver authored
      Add KUnit test for hw_breakpoint constraints accounting, with various
      interesting mixes of breakpoint targets (some care was taken to catch
      interesting corner cases via bug-injection).
      
      The test cannot be built as a module because it requires access to
      hw_breakpoint_slots(), which is not inlinable or exported on all
      architectures.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com
      724c299c
  3. 29 Aug, 2022 4 commits
  4. 26 Aug, 2022 13 commits
  5. 22 Aug, 2022 1 commit
  6. 21 Aug, 2022 5 commits