Merge tag 'slab-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka: "This time it's mostly random cleanups and fixes, with two performance fixes that might have significant impact, but limited to systems experiencing particular bad corner case scenarios rather than general performance improvements. The memcg hook changes are going through the mm tree due to dependencies. - Prevent stalls when reading /proc/slabinfo (Jianfeng Wang) This fixes the long-standing problem that can happen with workloads that have alloc/free patterns resulting in many partially used slabs (in e.g. dentry cache). Reading /proc/slabinfo will traverse the long partial slab list under spinlock with disabled irqs and thus can stall other processes or even trigger the lockup detection. The traversal is only done to count free objects so that <active_objs> column can be reported along with <num_objs>. To avoid affecting fast paths with another shared counter (attempted in the past) or complex partial list traversal schemes that allow rescheduling, the chosen solution resorts to approximation - when the partial list is over 10000 slabs long, we will only traverse first 5000 slabs from head and tail each and use the average of those to estimate the whole list. Both head and tail are used as the slabs near head to tend to have more free objects than the slabs towards the tail. It is expected the approximation should not break existing /proc/slabinfo consumers. The <num_objs> field is still accurate and reflects the overall kmem_cache footprint. The <active_objs> was already imprecise due to cpu and percpu-partial slabs, so can't be relied upon to determine exact cache usage. The difference between <active_objs> and <num_objs> is mainly useful to determine the slab fragmentation, and that will be possible even with the approximation in place. - Prevent allocating many slabs when a NUMA node is full (Chen Jun) Currently, on NUMA systems with a node under significantly bigger pressure than other nodes, the fallback strategy may result in each kmalloc_node() that can't be safisfied from the preferred node, to allocate a new slab on a fallback node, and not reuse the slabs already on that node's partial list. This is now fixed and partial lists of fallback nodes are checked even for kmalloc_node() allocations. It's still preferred to allocate a new slab on the requested node before a fallback, but only with a GFP_NOWAIT attempt, which will fail quickly when the node is under a significant memory pressure. - More SLAB removal related cleanups (Xiu Jianfeng, Hyunmin Lee) - Fix slub_kunit self-test with hardened freelists (Guenter Roeck) - Mark racy accesses for KCSAN (linke li) - Misc cleanups (Xiongwei Song, Haifeng Xu, Sangyun Kim)" * tag 'slab-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: remove the check for NULL kmalloc_caches mm/slub: create kmalloc 96 and 192 caches regardless cache size order mm/slub: mark racy access on slab->freelist slub: use count_partial_free_approx() in slab_out_of_memory() slub: introduce count_partial_free_approx() slub: Set __GFP_COMP in kmem_cache by default mm/slub: remove duplicate initialization for early_kmem_cache_node_alloc() mm/slub: correct comment in do_slab_free() mm/slub, kunit: Use inverted data to corrupt kmem cache mm/slub: simplify get_partial_node() mm/slub: add slub_get_cpu_partial() helper mm/slub: remove the check of !kmem_cache_has_cpu_partial() mm/slub: Reduce memory consumption in extreme scenarios mm/slub: mark racy accesses on slab->slabs mm/slub: remove dummy slabinfo functions

Merge tag 'slab-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka: "This time it's mostly random cleanups and fixes, with two performance fixes that might have significant impact, but limited to systems experiencing particular bad corner case scenarios rather than general performance improvements. The memcg hook changes are going through the mm tree due to dependencies. - Prevent stalls when reading /proc/slabinfo (Jianfeng Wang) This fixes the long-standing problem that can happen with workloads that have alloc/free patterns resulting in many partially used slabs (in e.g. dentry cache). Reading /proc/slabinfo will traverse the long partial slab list under spinlock with disabled irqs and thus can stall other processes or even trigger the lockup detection. The traversal is only done to count free objects so that <active_objs> column can be reported along with <num_objs>. To avoid affecting fast paths with another shared counter (attempted in the past) or complex partial list traversal schemes that allow rescheduling, the chosen solution resorts to approximation - when the partial list is over 10000 slabs long, we will only traverse first 5000 slabs from head and tail each and use the average of those to estimate the whole list. Both head and tail are used as the slabs near head to tend to have more free objects than the slabs towards the tail. It is expected the approximation should not break existing /proc/slabinfo consumers. The <num_objs> field is still accurate and reflects the overall kmem_cache footprint. The <active_objs> was already imprecise due to cpu and percpu-partial slabs, so can't be relied upon to determine exact cache usage. The difference between <active_objs> and <num_objs> is mainly useful to determine the slab fragmentation, and that will be possible even with the approximation in place. - Prevent allocating many slabs when a NUMA node is full (Chen Jun) Currently, on NUMA systems with a node under significantly bigger pressure than other nodes, the fallback strategy may result in each kmalloc_node() that can't be safisfied from the preferred node, to allocate a new slab on a fallback node, and not reuse the slabs already on that node's partial list. This is now fixed and partial lists of fallback nodes are checked even for kmalloc_node() allocations. It's still preferred to allocate a new slab on the requested node before a fallback, but only with a GFP_NOWAIT attempt, which will fail quickly when the node is under a significant memory pressure. - More SLAB removal related cleanups (Xiu Jianfeng, Hyunmin Lee) - Fix slub_kunit self-test with hardened freelists (Guenter Roeck) - Mark racy accesses for KCSAN (linke li) - Misc cleanups (Xiongwei Song, Haifeng Xu, Sangyun Kim)" * tag 'slab-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: remove the check for NULL kmalloc_caches mm/slub: create kmalloc 96 and 192 caches regardless cache size order mm/slub: mark racy access on slab->freelist slub: use count_partial_free_approx() in slab_out_of_memory() slub: introduce count_partial_free_approx() slub: Set __GFP_COMP in kmem_cache by default mm/slub: remove duplicate initialization for early_kmem_cache_node_alloc() mm/slub: correct comment in do_slab_free() mm/slub, kunit: Use inverted data to corrupt kmem cache mm/slub: simplify get_partial_node() mm/slub: add slub_get_cpu_partial() helper mm/slub: remove the check of !kmem_cache_has_cpu_partial() mm/slub: Reduce memory consumption in extreme scenarios mm/slub: mark racy accesses on slab->slabs mm/slub: remove dummy slabinfo functions
cd97950c · Linus Torvalds · c07ea940 · 7338999c · cd97950c · cd97950c
Commit cd97950c authored May 13, 2024 by Linus Torvalds
Hide whitespace changes
Inline Side-by-side

Showing with 96 additions and 54 deletions

lib/slub_kunit.c lib/slub_kunit.c +1 -1

mm/slab.h mm/slab.h +0 -3

mm/slab_common.c mm/slab_common.c +9 -18

mm/slub.c mm/slub.c +86 -32

No files found.
--- a/lib/slub_kunit.c
+++ b/lib/slub_kunit.c
@@ -55,7 +55,7 @@ static void test_next_pointer(struct kunit *test)

 	ptr_addr = (unsigned long *)(p + s->offset);
 	tmp = *ptr_addr;
-	p[s->offset] = 0x12;
+	p[s->offset] = ~p[s->offset];

 	/*
 	 * Expecting three errors.

--- a/mm/slab.h
+++ b/mm/slab.h
@@ -496,9 +496,6 @@ struct slabinfo {
 };

 void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
-void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
-ssize_t slabinfo_write(struct file *file, const char __user *buffer,
-		       size_t count, loff_t *ppos);

 #ifdef CONFIG_SLUB_DEBUG
 #ifdef CONFIG_SLUB_DEBUG_ON

--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -916,22 +916,15 @@ void __init create_kmalloc_caches(void)
 	 * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined
 	 */
 	for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) {
-		for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
-			if (!kmalloc_caches[type][i])
-				new_kmalloc_cache(i, type);
-
-			/*
-			 * Caches that are not of the two-to-the-power-of size.
-			 * These have to be created immediately after the
-			 * earlier power of two caches
-			 */
-			if (KMALLOC_MIN_SIZE <= 32 && i == 6 &&
-					!kmalloc_caches[type][1])
-				new_kmalloc_cache(1, type);
-			if (KMALLOC_MIN_SIZE <= 64 && i == 7 &&
-					!kmalloc_caches[type][2])
-				new_kmalloc_cache(2, type);
-		}
+		/* Caches that are NOT of the two-to-the-power-of size. */
+		if (KMALLOC_MIN_SIZE <= 32)
+			new_kmalloc_cache(1, type);
+		if (KMALLOC_MIN_SIZE <= 64)
+			new_kmalloc_cache(2, type);
+
+		/* Caches that are of the two-to-the-power-of size. */
+		for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
+			new_kmalloc_cache(i, type);
 	}
 #ifdef CONFIG_RANDOM_KMALLOC_CACHES
 	random_kmalloc_seed = get_random_u64();
@@ -1078,7 +1071,6 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 		   sinfo.limit, sinfo.batchcount, sinfo.shared);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu",
 		   sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
-	slabinfo_show_stats(m, s);
 	seq_putc(m, '\n');
 }

@@ -1155,7 +1147,6 @@ static const struct proc_ops slabinfo_proc_ops = {
 	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= slabinfo_open,
 	.proc_read	= seq_read,
-	.proc_write	= slabinfo_write,
 	.proc_lseek	= seq_lseek,
 	.proc_release	= seq_release,
 };

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -624,11 +624,21 @@ static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
 	nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
 	s->cpu_partial_slabs = nr_slabs;
 }
+
+static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
+{
+	return s->cpu_partial_slabs;
+}
 #else
 static inline void
 slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
 {
 }
+
+static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
+{
+	return 0;
+}
 #endif /* CONFIG_SLUB_CPU_PARTIAL */

 /*
@@ -2609,19 +2619,18 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 		if (!partial) {
 			partial = slab;
 			stat(s, ALLOC_FROM_PARTIAL);
+
+			if ((slub_get_cpu_partial(s) == 0)) {
+				break;
+			}
 		} else {
 			put_cpu_partial(s, slab, 0);
 			stat(s, CPU_PARTIAL_NODE);
-			partial_slabs++;
-		}
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-		if (!kmem_cache_has_cpu_partial(s)
-			|| partial_slabs > s->cpu_partial_slabs / 2)
-			break;
-#else
-		break;
-#endif

+			if (++partial_slabs > slub_get_cpu_partial(s) / 2) {
+				break;
+			}
+		}
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return partial;
@@ -2704,7 +2713,7 @@ static struct slab *get_partial(struct kmem_cache *s, int node,
 		searchnode = numa_mem_id();

 	slab = get_partial_node(s, get_node(s, searchnode), pc);
-	if (slab || node != NUMA_NO_NODE)
+	if (slab || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
 		return slab;

 	return get_any_partial(s, pc);
@@ -2802,7 +2811,7 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	struct slab new;
 	struct slab old;

-	if (slab->freelist) {
+	if (READ_ONCE(slab->freelist)) {
 		stat(s, DEACTIVATE_REMOTE_FREES);
 		tail = DEACTIVATE_TO_TAIL;
 	}
@@ -3234,6 +3243,43 @@ static unsigned long count_partial(struct kmem_cache_node *n,
 #endif /* CONFIG_SLUB_DEBUG || SLAB_SUPPORTS_SYSFS */

 #ifdef CONFIG_SLUB_DEBUG
+#define MAX_PARTIAL_TO_SCAN 10000
+
+static unsigned long count_partial_free_approx(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct slab *slab;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
+		list_for_each_entry(slab, &n->partial, slab_list)
+			x += slab->objects - slab->inuse;
+	} else {
+		/*
+		 * For a long list, approximate the total count of objects in
+		 * it to meet the limit on the number of slabs to scan.
+		 * Scan from both the list's head and tail for better accuracy.
+		 */
+		unsigned long scanned = 0;
+
+		list_for_each_entry(slab, &n->partial, slab_list) {
+			x += slab->objects - slab->inuse;
+			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
+				break;
+		}
+		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
+			x += slab->objects - slab->inuse;
+			if (++scanned == MAX_PARTIAL_TO_SCAN)
+				break;
+		}
+		x = mult_frac(x, n->nr_partial, scanned);
+		x = min(x, node_nr_objs(n));
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
 static noinline void
 slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
 {
@@ -3260,7 +3306,7 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
 		unsigned long nr_objs;
 		unsigned long nr_free;

-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial_free_approx(n);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);

@@ -3380,6 +3426,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	struct slab *slab;
 	unsigned long flags;
 	struct partial_context pc;
+	bool try_thisnode = true;

 	stat(s, ALLOC_SLOWPATH);

@@ -3506,6 +3553,21 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 new_objects:

 	pc.flags = gfpflags;
+	/*
+	 * When a preferred node is indicated but no __GFP_THISNODE
+	 *
+	 * 1) try to get a partial slab from target node only by having
+	 *    __GFP_THISNODE in pc.flags for get_partial()
+	 * 2) if 1) failed, try to allocate a new slab from target node with
+	 *    GPF_NOWAIT | __GFP_THISNODE opportunistically
+	 * 3) if 2) failed, retry with original gfpflags which will allow
+	 *    get_partial() try partial lists of other nodes before potentially
+	 *    allocating new page from other nodes
+	 */
+	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
+		     && try_thisnode))
+		pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
 	if (slab) {
@@ -3527,10 +3589,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}

 	slub_put_cpu_ptr(s->cpu_slab);
-	slab = new_slab(s, gfpflags, node);
+	slab = new_slab(s, pc.flags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);

 	if (unlikely(!slab)) {
+		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
+		    && try_thisnode) {
+			try_thisnode = false;
+			goto new_objects;
+		}
 		slab_out_of_memory(s, gfpflags, node);
 		return NULL;
 	}
@@ -4232,7 +4299,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	c = raw_cpu_ptr(s->cpu_slab);
 	tid = READ_ONCE(c->tid);

-	/* Same with comment on barrier() in slab_alloc_node() */
+	/* Same with comment on barrier() in __slab_alloc_node() */
 	barrier();

 	if (unlikely(slab != c->slab)) {
@@ -4853,7 +4920,6 @@ static void early_kmem_cache_node_alloc(int node)
 	BUG_ON(!n);
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
-	init_tracking(kmem_cache_node, n);
 #endif
 	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
 	slab->freelist = get_freepointer(kmem_cache_node, n);
@@ -5066,9 +5132,7 @@ static int calculate_sizes(struct kmem_cache *s)
 	if ((int)order < 0)
 		return 0;

-	s->allocflags = 0;
-	if (order)
-		s->allocflags |= __GFP_COMP;
+	s->allocflags = __GFP_COMP;

 	if (s->flags & SLAB_CACHE_DMA)
 		s->allocflags |= GFP_DMA;
@@ -6042,7 +6106,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 				else if (flags & SO_OBJECTS)
 					WARN_ON_ONCE(1);
 				else
-					x = slab->slabs;
+					x = data_race(slab->slabs);
 				total += x;
 				nodes[node] += x;
 			}
@@ -6247,7 +6311,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));

 		if (slab)
-			slabs += slab->slabs;
+			slabs += data_race(slab->slabs);
 	}
 #endif

@@ -6261,7 +6325,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)

 		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
 		if (slab) {
-			slabs = READ_ONCE(slab->slabs);
+			slabs = data_race(slab->slabs);
 			objects = (slabs * oo_objects(s->oo)) / 2;
 			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
 					     cpu, objects, slabs);
@@ -7095,7 +7159,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	for_each_kmem_cache_node(s, node, n) {
 		nr_slabs += node_nr_slabs(n);
 		nr_objs += node_nr_objs(n);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial_free_approx(n);
 	}

 	sinfo->active_objs = nr_objs - nr_free;
@@ -7105,14 +7169,4 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	sinfo->objects_per_slab = oo_objects(s->oo);
 	sinfo->cache_order = oo_order(s->oo);
 }
-
-void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s)
-{
-}
-
-ssize_t slabinfo_write(struct file *file, const char __user *buffer,
-		       size_t count, loff_t *ppos)
-{
-	return -EIO;
-}
 #endif /* CONFIG_SLUB_DEBUG */