Commit fbce7497 authored by Paul E. McKenney's avatar Paul E. McKenney

rcu: Parallelize and economize NOCB kthread wakeups

An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: default avatarRik van Riel <riel@redhat.com>
Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
parent 4a81e832
...@@ -2802,6 +2802,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted. ...@@ -2802,6 +2802,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
quiescent states. Units are jiffies, minimum quiescent states. Units are jiffies, minimum
value is one, and maximum value is HZ. value is one, and maximum value is HZ.
rcutree.rcu_nocb_leader_stride= [KNL]
Set the number of NOCB kthread groups, which
defaults to the square root of the number of
CPUs. Larger numbers reduces the wakeup overhead
on the per-CPU grace-period kthreads, but increases
that same overhead on each group's leader.
rcutree.qhimark= [KNL] rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which Set threshold of queued RCU callbacks beyond which
batch limiting is disabled. batch limiting is disabled.
......
...@@ -334,11 +334,29 @@ struct rcu_data { ...@@ -334,11 +334,29 @@ struct rcu_data {
struct rcu_head **nocb_tail; struct rcu_head **nocb_tail;
atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
atomic_long_t nocb_q_count_lazy; /* (approximate). */ atomic_long_t nocb_q_count_lazy; /* (approximate). */
struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
struct rcu_head **nocb_follower_tail;
atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
atomic_long_t nocb_follower_count_lazy; /* (approximate). */
int nocb_p_count; /* # CBs being invoked by kthread */ int nocb_p_count; /* # CBs being invoked by kthread */
int nocb_p_count_lazy; /* (approximate). */ int nocb_p_count_lazy; /* (approximate). */
wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */
struct task_struct *nocb_kthread; struct task_struct *nocb_kthread;
bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ bool nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */
/* The following fields are used by the leader, hence own cacheline. */
struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
/* CBs waiting for GP. */
struct rcu_head **nocb_gp_tail;
long nocb_gp_count;
long nocb_gp_count_lazy;
bool nocb_leader_wake; /* Is the nocb leader thread awake? */
struct rcu_data *nocb_next_follower;
/* Next follower in wakeup chain. */
/* The following fields are used by the follower, hence new cachline. */
struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
/* Leader CPU takes GP-end wakeups. */
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
/* 8) RCU CPU stall data. */ /* 8) RCU CPU stall data. */
...@@ -587,8 +605,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp); ...@@ -587,8 +605,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
/* Sum up queue lengths for tracing. */ /* Sum up queue lengths for tracing. */
static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
{ {
*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count; *ql = atomic_long_read(&rdp->nocb_q_count) +
*qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy; rdp->nocb_p_count +
atomic_long_read(&rdp->nocb_follower_count) +
rdp->nocb_p_count + rdp->nocb_gp_count;
*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
rdp->nocb_p_count_lazy +
atomic_long_read(&rdp->nocb_follower_count_lazy) +
rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
} }
#else /* #ifdef CONFIG_RCU_NOCB_CPU */ #else /* #ifdef CONFIG_RCU_NOCB_CPU */
static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment