1. 25 Jun, 2024 27 commits
  2. 24 Jun, 2024 13 commits
    • Jakub Kicinski's avatar
      Merge branch 'locking-introduce-nested-bh-locking' · bf2468f9
      Jakub Kicinski authored
      Sebastian Andrzej Siewior says:
      
      ====================
      locking: Introduce nested-BH locking.
      
      Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
      local_bh_disable() section remains preemtible. As a result high prior
      tasks (or threaded interrupts) will be blocked by lower-prio task (or
      threaded interrupts) which are long running which includes softirq
      sections.
      
      The proposed way out is to introduce explicit per-CPU locks for
      resources which are protected by local_bh_disable() and use those only
      on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
      
      The series introduces the infrastructure and converts large parts of
      networking which is largest stake holder here. Once this done the
      per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
      
      Performance testing. Baseline is net-next as of commit 93bda330
      ("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'")
      plus v6.10-rc1. A 10GiG link is used between two hosts. The command
         xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e
      
      was invoked on the receiving side with a ixgbe. The sending side uses
      pktgen_sample03_burst_single_flow.sh on i40e.
      
      Baseline:
      | eth1->?                 9,018,604 rx/s                  0 err,drop/s
      |   receive total         9,018,604 pkt/s                 0 drop/s                0 error/s
      |     cpu:7               9,018,604 pkt/s                 0 drop/s                0 error/s
      |   enqueue to cpu 3      9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
      |     cpu:7->3            9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
      |   kthread total         9,018,606 pkt/s                 0 drop/s          214,698 sched
      |     cpu:3               9,018,606 pkt/s                 0 drop/s          214,698 sched
      |     xdp_stats                   0 pass/s        9,018,606 drop/s                0 redir/s
      |       cpu:3                     0 pass/s        9,018,606 drop/s                0 redir/s
      |   redirect_err                  0 error/s
      |   xdp_exception                 0 hit/s
      
      perf top --sort cpu,symbol --no-children:
      |   18.14%  007  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
      |   13.29%  007  [k] ixgbe_poll
      |   12.66%  003  [k] cpu_map_kthread_run
      |    7.23%  003  [k] page_frag_free
      |    6.76%  007  [k] xdp_do_redirect
      |    3.76%  007  [k] cpu_map_redirect
      |    3.13%  007  [k] bq_flush_to_queue
      |    2.51%  003  [k] xdp_return_frame
      |    1.93%  007  [k] try_to_wake_up
      |    1.78%  007  [k] _raw_spin_lock
      |    1.74%  007  [k] cpu_map_enqueue
      |    1.56%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
      
      With this series applied:
      | eth1->?                10,329,340 rx/s                  0 err,drop/s
      |   receive total        10,329,340 pkt/s                 0 drop/s                0 error/s
      |     cpu:6              10,329,340 pkt/s                 0 drop/s                0 error/s
      |   enqueue to cpu 3     10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
      |     cpu:6->3           10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
      |   kthread total        10,329,321 pkt/s                 0 drop/s           96,297 sched
      |     cpu:3              10,329,321 pkt/s                 0 drop/s           96,297 sched
      |     xdp_stats                   0 pass/s       10,329,321 drop/s                0 redir/s
      |       cpu:3                     0 pass/s       10,329,321 drop/s                0 redir/s
      |   redirect_err                  0 error/s
      |   xdp_exception                 0 hit/s
      
      perf top --sort cpu,symbol --no-children:
      |   20.90%  006  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
      |   12.62%  006  [k] ixgbe_poll
      |    9.82%  003  [k] page_frag_free
      |    8.73%  003  [k] cpu_map_bpf_prog_run_xdp
      |    6.63%  006  [k] xdp_do_redirect
      |    4.94%  003  [k] cpu_map_kthread_run
      |    4.28%  006  [k] cpu_map_redirect
      |    4.03%  006  [k] bq_flush_to_queue
      |    3.01%  003  [k] xdp_return_frame
      |    1.95%  006  [k] _raw_spin_lock
      |    1.94%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
      
      This diff appears to be noise.
      
      v8: https://lore.kernel.org/all/20240619072253.504963-1-bigeasy@linutronix.de
      v7: https://lore.kernel.org/all/20240618072526.379909-1-bigeasy@linutronix.de
      v6: https://lore.kernel.org/all/20240612170303.3896084-1-bigeasy@linutronix.de
      v5: https://lore.kernel.org/all/20240607070427.1379327-1-bigeasy@linutronix.de
      v4: https://lore.kernel.org/all/20240604154425.878636-1-bigeasy@linutronix.de
      v3: https://lore.kernel.org/all/20240529162927.403425-1-bigeasy@linutronix.de
      v2: https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de
      v1: https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de
      ====================
      
      Link: https://patch.msgid.link/20240620132727.660738-1-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bf2468f9
    • Sebastian Andrzej Siewior's avatar
      net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. · 3f9fe37d
      Sebastian Andrzej Siewior authored
      The per-CPU flush lists, which are accessed from within the NAPI callback
      (xdp_do_flush() for instance), are per-CPU. There are subject to the
      same problem as struct bpf_redirect_info.
      
      Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
      xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
      access. The lists initialized on first usage (similar to
      bpf_net_ctx_get_ri()).
      
      Cc: "Björn Töpel" <bjorn@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f9fe37d
    • Sebastian Andrzej Siewior's avatar
      net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. · 401cb7da
      Sebastian Andrzej Siewior authored
      The XDP redirect process is two staged:
      - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
        packet and makes decisions. While doing that, the per-CPU variable
        bpf_redirect_info is used.
      
      - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
        and it may also access other per-CPU variables like xskmap_flush_list.
      
      At the very end of the NAPI callback, xdp_do_flush() is invoked which
      does not access bpf_redirect_info but will touch the individual per-CPU
      lists.
      
      The per-CPU variables are only used in the NAPI callback hence disabling
      bottom halves is the only protection mechanism. Users from preemptible
      context (like cpu_map_kthread_run()) explicitly disable bottom halves
      for protections reasons.
      Without locking in local_bh_disable() on PREEMPT_RT this data structure
      requires explicit locking.
      
      PREEMPT_RT has forced-threaded interrupts enabled and every
      NAPI-callback runs in a thread. If each thread has its own data
      structure then locking can be avoided.
      
      Create a struct bpf_net_context which contains struct bpf_redirect_info.
      Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
      it, bpf_net_ctx_clear() removes it again.
      The bpf_net_ctx_set() may nest. For instance a function can be used from
      within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
      NET_TX_SOFTIRQ which does not. Therefore only the first invocations
      updates the pointer.
      Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
      bpf_redirect_info. The returned data structure is zero initialized to
      ensure nothing is leaked from stack. This is done on first usage of the
      struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
      note that initialisation is required. First invocation of
      bpf_net_ctx_get_ri() will memset() the data structure and update
      bpf_redirect_info::kern_flags.
      bpf_redirect_info::nh is excluded from memset because it is only used
      once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
      moved past nh to exclude it from memset.
      
      The pointer to bpf_net_context is saved task's task_struct. Using
      always the bpf_net_context approach has the advantage that there is
      almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
      
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      401cb7da
    • Sebastian Andrzej Siewior's avatar
      net: Use nested-BH locking for bpf_scratchpad. · 78f520b7
      Sebastian Andrzej Siewior authored
      bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
      locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
      this data structure requires explicit locking.
      
      Add a local_lock_t to the data structure and use local_lock_nested_bh()
      for locking. This change adds only lockdep coverage and does not alter
      the functional behaviour for !PREEMPT_RT.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-14-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      78f520b7
    • Sebastian Andrzej Siewior's avatar
      seg6: Use nested-BH locking for seg6_bpf_srh_states. · d1542d4a
      Sebastian Andrzej Siewior authored
      The access to seg6_bpf_srh_states is protected by disabling preemption.
      Based on the code, the entry point is input_action_end_bpf() and
      every other function (the bpf helper functions bpf_lwt_seg6_*()), that
      is accessing seg6_bpf_srh_states, should be called from within
      input_action_end_bpf().
      
      input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
      the function and then disables preemption. This looks wrong because if
      preemption needs to be disabled as part of the locking mechanism then
      the variable shouldn't be accessed beforehand.
      
      Looking at how it is used via test_lwt_seg6local.sh then
      input_action_end_bpf() is always invoked from softirq context. If this
      is always the case then the preempt_disable() statement is superfluous.
      If this is not always invoked from softirq then disabling only
      preemption is not sufficient.
      
      Replace the preempt_disable() statement with nested-BH locking. This is
      not an equivalent replacement as it assumes that the invocation of
      input_action_end_bpf() always occurs in softirq context and thus the
      preempt_disable() is superfluous.
      Add a local_lock_t the data structure and use local_lock_nested_bh() for
      locking. Add lockdep_assert_held() to ensure the lock is held while the
      per-CPU variable is referenced in the helper functions.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-13-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1542d4a
    • Sebastian Andrzej Siewior's avatar
      lwt: Don't disable migration prio invoking BPF. · 3414adbd
      Sebastian Andrzej Siewior authored
      There is no need to explicitly disable migration if bottom halves are
      also disabled. Disabling BH implies disabling migration.
      
      Remove migrate_disable() and rely solely on disabling BH to remain on
      the same CPU.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-12-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3414adbd
    • Sebastian Andrzej Siewior's avatar
      dev: Use nested-BH locking for softnet_data.process_queue. · b22800f9
      Sebastian Andrzej Siewior authored
      softnet_data::process_queue is a per-CPU variable and relies on disabled
      BH for its locking. Without per-CPU locking in local_bh_disable() on
      PREEMPT_RT this data structure requires explicit locking.
      
      softnet_data::input_queue_head can be updated lockless. This is fine
      because this value is only update CPU local by the local backlog_napi
      thread.
      
      Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
      of process_queue. This change adds only lockdep coverage and does not
      alter the functional behaviour for !PREEMPT_RT.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b22800f9
    • Sebastian Andrzej Siewior's avatar
      dev: Remove PREEMPT_RT ifdefs from backlog_lock.*(). · a8760d0d
      Sebastian Andrzej Siewior authored
      The backlog_napi locking (previously RPS) relies on explicit locking if
      either RPS or backlog NAPI is enabled. If both are disabled then locking
      was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
      was excluded because the needed synchronisation was already provided
      local_bh_disable().
      
      Since the introduction of backlog NAPI and making it mandatory for
      PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
      removed.
      
      Remove the ifdefs in backlog_lock.*().
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a8760d0d
    • Sebastian Andrzej Siewior's avatar
      net: softnet_data: Make xmit per task. · ecefbc09
      Sebastian Andrzej Siewior authored
      Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
      local_bh_disable() there is no guarantee that only one device is
      transmitting at a time.
      With preemption and multiple senders it is possible that the per-CPU
      `recursion' counter gets incremented by different threads and exceeds
      XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
      The `more' member is subject to similar problems if set by one thread
      for one driver and wrongly used by another driver within another thread.
      
      Instead of adding a lock to protect the per-CPU variable it is simpler
      to make xmit per-task. Sending and receiving skbs happens always
      in thread context anyway.
      
      Having a lock to protected the per-CPU counter would block/ serialize two
      sending threads needlessly. It would also require a recursive lock to
      ensure that the owner can increment the counter further.
      
      Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
      needed wrapper.
      
      Cc: Ben Segall <bsegall@google.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ecefbc09
    • Sebastian Andrzej Siewior's avatar
      netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage. · c67ef53a
      Sebastian Andrzej Siewior authored
      brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
      for its locking. Without per-CPU locking in local_bh_disable() on
      PREEMPT_RT this data structure requires explicit locking.
      
      Add a local_lock_t to the data structure and use local_lock_nested_bh()
      for locking. This change adds only lockdep coverage and does not alter
      the functional behaviour for !PREEMPT_RT.
      
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Nikolay Aleksandrov <razor@blackwall.org>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Cc: bridge@lists.linux.dev
      Cc: coreteam@netfilter.org
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-8-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c67ef53a
    • Sebastian Andrzej Siewior's avatar
      net/ipv4: Use nested-BH locking for ipv4_tcp_sk. · ebad6d03
      Sebastian Andrzej Siewior authored
      ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
      locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
      this data structure requires explicit locking.
      
      Make a struct with a sock member (original ipv4_tcp_sk) and a
      local_lock_t and use local_lock_nested_bh() for locking. This change
      adds only lockdep coverage and does not alter the functional behaviour
      for !PREEMPT_RT.
      
      Cc: David Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-7-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ebad6d03
    • Sebastian Andrzej Siewior's avatar
      net/tcp_sigpool: Use nested-BH locking for sigpool_scratch. · 585aa621
      Sebastian Andrzej Siewior authored
      sigpool_scratch is a per-CPU variable and relies on disabled BH for its
      locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
      this data structure requires explicit locking.
      
      Make a struct with a pad member (original sigpool_scratch) and a
      local_lock_t and use local_lock_nested_bh() for locking. This change
      adds only lockdep coverage and does not alter the functional behaviour
      for !PREEMPT_RT.
      
      Cc: David Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-6-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      585aa621
    • Sebastian Andrzej Siewior's avatar
      net: Use nested-BH locking for napi_alloc_cache. · bdacf3e3
      Sebastian Andrzej Siewior authored
      napi_alloc_cache is a per-CPU variable and relies on disabled BH for its
      locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
      this data structure requires explicit locking.
      
      Add a local_lock_t to the data structure and use local_lock_nested_bh()
      for locking. This change adds only lockdep coverage and does not alter
      the functional behaviour for !PREEMPT_RT.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-5-bigeasy@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdacf3e3