memcg: replace stats_flush_lock with an atomic

As Johannes notes in [1], stats_flush_lock is currently used to: (a) Protect updated to stats_flush_threshold. (b) Protect updates to flush_next_time. (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits. However: 1. stats_flush_threshold is already an atomic 2. flush_next_time is not atomic. The writer is locked, but the reader is lockless. If the reader races with a flush, you could see this: if (time_after(jiffies, flush_next_time)) spin_trylock() flush_next_time = now + delay flush() spin_unlock() spin_trylock() flush_next_time = now + delay flush() spin_unlock() which means we already can get flushes at a higher frequency than FLUSH_TIME during races. But it isn't really a problem. The reader could also see garbled partial updates if the compiler decides to split the write, so it needs at least READ_ONCE and WRITE_ONCE protection. 3. Serializing cgroup_rstat_flush() calls against the ratelimit factors is currently broken because of the race in 2. But the race is actually harmless, all we might get is the occasional earlier flush. If there is no delta, the flush won't do much. And if there is, the flush is justified. So the lock can be removed all together. However, the lock also served the purpose of preventing a thundering herd problem for concurrent flushers, see [2]. Use an atomic instead to serve the purpose of unifying concurrent flushers. [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/ [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/ Link: https://lkml.kernel.org/r/20230330191801.1967435-5-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: replace stats_flush_lock with an atomic
As Johannes notes in [1], stats_flush_lock is currently used to: (a) Protect updated to stats_flush_threshold. (b) Protect updates to flush_next_time. (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits. However: 1. stats_flush_threshold is already an atomic 2. flush_next_time is not atomic. The writer is locked, but the reader is lockless. If the reader races with a flush, you could see this: if (time_after(jiffies, flush_next_time)) spin_trylock() flush_next_time = now + delay flush() spin_unlock() spin_trylock() flush_next_time = now + delay flush() spin_unlock() which means we already can get flushes at a higher frequency than FLUSH_TIME during races. But it isn't really a problem. The reader could also see garbled partial updates if the compiler decides to split the write, so it needs at least READ_ONCE and WRITE_ONCE protection. 3. Serializing cgroup_rstat_flush() calls against the ratelimit factors is currently broken because of the race in 2. But the race is actually harmless, all we might get is the occasional earlier flush. If there is no delta, the flush won't do much. And if there is, the flush is justified. So the lock can be removed all together. However, the lock also served the purpose of preventing a thundering herd problem for concurrent flushers, see [2]. Use an atomic instead to serve the purpose of unifying concurrent flushers. [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/ [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/ Link: https://lkml.kernel.org/r/20230330191801.1967435-5-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3cd9992b · Yosry Ahmed · Andrew Morton · a2174e95 · 3cd9992b
Commit 3cd9992b authored Mar 30, 2023 by Yosry Ahmed Committed by Andrew Morton Apr 18, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 7 deletions

mm/memcontrol.c mm/memcontrol.c +11 -7

No files found.
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -586,8 +586,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
@@ -637,15 +637,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 static void __mem_cgroup_flush_stats(void)
 {
-	unsigned long flag;
+	/*
+	 * We always flush the entire tree, so concurrent flushers can just
-	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+	 * skip. This avoids a thundering herd problem on the rstat global lock
+	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 */
+	if (atomic_read(&stats_flush_ongoing) ||
+	    atomic_xchg(&stats_flush_ongoing, 1))
 		return;
-	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
+	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
-	spin_unlock_irqrestore(&stats_flush_lock, flag);
+	atomic_set(&stats_flush_ongoing, 0);
 }
 void mem_cgroup_flush_stats(void)
@@ -656,7 +660,7 @@ void mem_cgroup_flush_stats(void)
 void mem_cgroup_flush_stats_ratelimited(void)
 {
-	if (time_after64(jiffies_64, flush_next_time))
+	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
 		mem_cgroup_flush_stats();
 }