Commits · dc19745ad0e46c1a069540973e376cff0130443c · Kirill Smelkov / linux

30 Nov, 2022 8 commits

Merge branch 'slub-tiny-v1r6' into slab/for-next · dc19745a

Vlastimil Babka authored Nov 23, 2022

Merge my series [1] to deprecate the SLOB allocator.
- Renames CONFIG_SLOB to CONFIG_SLOB_DEPRECATED with deprecation notice.
- The recommended replacement is CONFIG_SLUB, optionally with the new
  CONFIG_SLUB_TINY tweaks for systems with 16MB or less RAM.
- Use cases that stopped working with CONFIG_SLUB_TINY instead of SLOB
  should be reported to linux-mm@kvack.org and slab maintainers,
  otherwise SLOB will be removed in few cycles.

[1] https://lore.kernel.org/all/20221121171202.22080-1-vbabka@suse.cz/

dc19745a

Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next · 61766652

Vlastimil Babka authored Dec 01, 2022

Add a new slub_kunit test for the extended kmalloc redzone check, by
Feng Tang. Also prevent unwanted kfence interaction with all slub kunit
tests.

61766652

mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED · 149b6fa2

Vlastimil Babka authored Nov 11, 2022

As explained in [1], we would like to remove SLOB if possible.

- There are no known users that need its somewhat lower memory footprint
  so much that they cannot handle SLUB (after some modifications by the
  previous patches) instead.

- It is an extra maintenance burden, and a number of features are
  incompatible with it.

- It blocks the API improvement of allowing kfree() on objects allocated
  via kmem_cache_alloc().

As the first step, rename the CONFIG_SLOB option in the slab allocator
configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB
depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code
churn. This will cause existing .config files and defconfigs with
CONFIG_SLOB=y to silently switch to the default (and recommended
replacement) SLUB, while still allowing SLOB to be configured by anyone
that notices and needs it. But those should contact the slab maintainers
and linux-mm@kvack.org as explained in the updated help. With no valid
objections, the plan is to update the existing defconfigs to SLUB and
remove SLOB in a few cycles.

To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY
option was introduced to limit SLUB's memory overhead.
There is a number of defconfigs specifying CONFIG_SLOB=y. As part of
this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY.

[1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/

Cc: Russell King <linux@armlinux.org.uk>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Conor Dooley <conor@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210
Acked-by: Arnd Bergmann <arnd@arndb.de> # arm
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

149b6fa2

mm, slub: don't aggressively inline with CONFIG_SLUB_TINY · be784ba8

Vlastimil Babka authored Nov 21, 2022

SLUB fastpaths use __always_inline to avoid function calls. With
CONFIG_SLUB_TINY we would rather save the memory. Add a
__fastpath_inline macro that's __always_inline normally but empty with
CONFIG_SLUB_TINY.

bloat-o-meter results on x86_64 mm/slub.o:

add/remove: 3/1 grow/shrink: 1/8 up/down: 865/-1784 (-919)
Function                                     old     new   delta
kmem_cache_free                               20     281    +261
slab_alloc_node.isra                           -     245    +245
slab_free.constprop.isra                       -     231    +231
__kmem_cache_alloc_lru.isra                    -     128    +128
__kmem_cache_release                          88      83      -5
__kmem_cache_create                         1446    1436     -10
__kmem_cache_free                            271     142    -129
kmem_cache_alloc_node                        330     127    -203
kmem_cache_free_bulk.part                    826     613    -213
__kmem_cache_alloc_node                      230      10    -220
kmem_cache_alloc_lru                         325      12    -313
kmem_cache_alloc                             325      10    -315
kmem_cache_free.part                         376       -    -376
Total: Before=26103, After=25184, chg -3.52%
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

be784ba8

mm, slub: remove percpu slabs with CONFIG_SLUB_TINY · 0af8489b

Vlastimil Babka authored Nov 15, 2022

SLUB gets most of its scalability by percpu slabs. However for
CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability.
Thus, #ifdef out the whole kmem_cache_cpu percpu structure and
associated code. Additionally to the slab page savings, this reduces
percpu allocator usage, and code size.

This change builds on recent commit c7323a5a ("mm/slub: restrict
sysfs validation to debug caches and make it safe"), as caches with
enabled debugging also avoid percpu slabs and all allocations and
freeing ends up working with the partial list. With a bit more
refactoring by the preceding patches, use the same code paths with
CONFIG_SLUB_TINY.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

0af8489b

mm, slub: split out allocations from pre/post hooks · 56d5a2b9

Vlastimil Babka authored Nov 21, 2022

In the following patch we want to introduce CONFIG_SLUB_TINY allocation
paths that don't use the percpu slab. To prepare, refactor the
allocation functions:

Split out __slab_alloc_node() from slab_alloc_node() where the former
does the actual allocation and the latter calls the pre/post hooks.

Analogically, split out __kmem_cache_alloc_bulk() from
kmem_cache_alloc_bulk().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

56d5a2b9

mm/slub, kunit: Add a test case for kmalloc redzone check · 6cd6d33c

Feng Tang authored Nov 30, 2022

kmalloc redzone check for slub has been merged, and it's better to add
a kunit case for it, which is inspired by a real-world case as described
in commit 120ee599 ("staging: octeon-usb: prevent memory corruption"):

"
  octeon-hcd will crash the kernel when SLOB is used. This usually happens
  after the 18-byte control transfer when a device descriptor is read.
  The DMA engine is always transferring full 32-bit words and if the
  transfer is shorter, some random garbage appears after the buffer.
  The problem is not visible with SLUB since it rounds up the allocations
  to word boundary, and the extra bytes will go undetected.
"

To avoid interrupting the normal functioning of kmalloc caches, a
kmem_cache mimicing kmalloc cache is created with similar flags, and
kmalloc_trace() is used to really test the orig_size and redzone setup.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

6cd6d33c

mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation · 4d9dd4b0

Feng Tang authored Nov 30, 2022

When kfence is enabled, the buffer allocated from the test case
could be from a kfence pool, and the operation could be also
caught and reported by kfence first, causing the case to fail.

With default kfence setting, this is very difficult to be triggered.
By changing CONFIG_KFENCE_NUM_OBJECTS from 255 to 16383, and
CONFIG_KFENCE_SAMPLE_INTERVAL from 100 to 5, the allocation from
kfence did hit 7 times in different slub_kunit cases out of 900
times of boot test.

To avoid this, initially we tried is_kfence_address() to check this
and repeated allocation till finding a non-kfence address. Vlastimil
Babka suggested SLAB_SKIP_KFENCE flag could be used to achieve this,
and better add a wrapper function for simplifying cache creation.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

4d9dd4b0

27 Nov, 2022 8 commits

mm, slub: refactor free debug processing · fa9b88e4

Vlastimil Babka authored Nov 21, 2022

Since commit c7323a5a ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), caches with debugging enabled use the
free_debug_processing() function to do both freeing checks and actual
freeing to partial list under list_lock, bypassing the fast paths.

We will want to use the same path for CONFIG_SLUB_TINY, but without the
debugging checks, so refactor the code so that free_debug_processing()
does only the checks, while the freeing is handled by a new function
free_to_partial_list().

For consistency, change return parameter alloc_debug_processing() from
int to bool and correct the !SLUB_DEBUG variant to return true and not
false. This didn't matter until now, but will in the following changes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

fa9b88e4

mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY · 3d97d976

Vlastimil Babka authored Nov 16, 2022

SLAB_RECLAIM_ACCOUNT caches allocate their slab pages with
__GFP_RECLAIMABLE and can help against fragmentation by grouping pages
by mobility, but on tiny systems mobility grouping is likely disabled
anyway and ignoring SLAB_RECLAIM_ACCOUNT might instead lead to merging
of caches that are made incompatible just by the flag.

Thus with CONFIG_SLUB_TINY, make SLAB_RECLAIM_ACCOUNT ineffective.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

3d97d976

mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY · 2f7c1c13

Vlastimil Babka authored Nov 15, 2022

Distinguishing kmalloc(__GFP_RECLAIMABLE) can help against fragmentation
by grouping pages by mobility, but on tiny systems the extra memory
overhead of separate set of kmalloc-rcl caches will probably be worse,
and mobility grouping likely disabled anyway.

Thus with CONFIG_SLUB_TINY, don't create kmalloc-rcl caches and use the
regular ones.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

2f7c1c13

mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY · 90ce872c

Vlastimil Babka authored Nov 21, 2022

With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering
the default slub_max_order we can make slab allocations use smaller
pages. However depending on object sizes, order-0 might not be the best
due to increased fragmentation. When testing on a 8MB RAM k210 system by
Damien Le Moal [1], slub_max_order=1 had the best results, so use that
as the default for CONFIG_SLUB_TINY.

[1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

90ce872c

mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY · 5a8a3c1f

Vlastimil Babka authored Nov 15, 2022

SLUB will leave a number of slabs on the partial list even if they are
empty, to avoid some slab freeing and reallocation. The goal of
CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0
for immediate slab page freeing.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

5a8a3c1f

mm, slub: disable SYSFS support with CONFIG_SLUB_TINY · b1a413a3

Vlastimil Babka authored Nov 14, 2022

Currently SLUB enables its sysfs support depending unconditionally on
the general CONFIG_SYSFS setting. To reduce the configuration
combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by
reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that
real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but
a randconfig might.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

b1a413a3

mm, slub: add CONFIG_SLUB_TINY · e240e53a

Vlastimil Babka authored Nov 14, 2022

For tiny systems that have used SLOB until now, SLUB might be
impractical due to its higher memory usage. To help with that, introduce
an option CONFIG_SLUB_TINY that modifies SLUB to use less memory.
This is done by sacrificing scalability, security and debugging
features, therefore not recommended for any system with more than 16MB
RAM.

This commit introduces the option and uses it to set other related
options in a way that reduces memory usage.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

e240e53a

mm, slab: ignore hardened usercopy parameters when disabled · 346907ce

Vlastimil Babka authored Nov 16, 2022

With CONFIG_HARDENED_USERCOPY not enabled, there are no
__check_heap_object() checks happening that would use the struct
kmem_cache useroffset and usersize fields. Yet the fields are still
initialized, preventing merging of otherwise compatible caches.

Also the fields contribute to struct kmem_cache size unnecessarily when
unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is
disabled. In kmem_dump_obj() print object_size instead of usersize, as
that's actually the intention.

In a quick virtme boot test, this has reduced the number of caches in
/proc/slabinfo from 131 to 111.

Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>

346907ce

21 Nov, 2022 14 commits

Merge branch 'slab/for-6.2/alloc_size' into slab/for-next · b5e72d27

Vlastimil Babka authored Nov 21, 2022

Two patches from Kees Cook [1]:

These patches work around a deficiency in GCC (>=11) and Clang (<16)
where the __alloc_size attribute does not apply to inlines.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503

This manifests as reduced overflow detection coverage for many allocation
sites under CONFIG_FORTIFY_SOURCE=y, where the allocation size was not
actually being propagated to __builtin_dynamic_object_size().

[1] https://lore.kernel.org/all/20221118034713.gonna.754-kees@kernel.org/

b5e72d27

Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next · 90e9b23a

Vlastimil Babka authored Nov 11, 2022

kmalloc() redzone improvements by Feng Tang

From cover letter [1]:

kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so there is an extra space than what is originally
requested.

This patchset tries to extend the redzone sanity check to the extra
kmalloced buffer than requested, to better detect un-legitimate access
to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE)

[1] https://lore.kernel.org/all/20221021032405.1825078-1-feng.tang@intel.com/

90e9b23a

Merge branch 'slab/for-6.2/fit_rcu_head' into slab/for-next · 76537db3

Vlastimil Babka authored Nov 10, 2022

A series by myself to reorder fields in struct slab to allow the
embedded rcu_head to grow (for debugging purposes). Requires changes to
isolate_movable_page() to skip slab pages which can otherwise become
false-positive __PageMovable due to its use of low bits in
page->mapping.

76537db3

Merge branch 'slab/for-6.2/tools' into slab/for-next · 1c1aaa33
Vlastimil Babka authored Nov 10, 2022
```
A patch for tools/vm/slabinfo to give more useful feedback when not run
as a root, by Rong Tao.
```
1c1aaa33

Merge branch 'slab/for-6.2/slub-sysfs' into slab/for-next · c64b95d3

Vlastimil Babka authored Nov 10, 2022

- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code
  and optimize boot time with late initialization.
- Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable
  again as it can be both useful and safe, by Alexander Atanasov.

c64b95d3

Merge branch 'slab/for-6.2/locking' into slab/for-next · 14d3eb66

Vlastimil Babka authored Nov 10, 2022

A patch from Jiri Kosina that makes SLAB's list_lock a raw_spinlock_t.
While there are no plans to make SLAB actually compatible with
PREEMPT_RT or any other future, it makes !PREEMPT_RT lockdep happy.

14d3eb66

Merge branch 'slab/for-6.2/cleanups' into slab/for-next · 4b28ba9e

Vlastimil Babka authored Nov 10, 2022

- Removal of dead code from deactivate_slab() by Hyeonggon Yoo.
- Fix of BUILD_BUG_ON() for sufficient early percpu size by Baoquan He.
- Make kmem_cache_alloc() kernel-doc less misleading, by myself.

4b28ba9e

slab: Remove special-casing of const 0 size allocations · 6fa57d78

Kees Cook authored Nov 17, 2022

Passing a constant-0 size allocation into kmalloc() or kmalloc_node()
does not need to be a fast-path operation, so the static return value
can be removed entirely. This makes sure that all paths through the
inlines result in a full extern function call, where __alloc_size()
hints will actually be seen[1] by GCC. (A constant return value of 0
means the "0" allocation size won't be propagated by the inline.)

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

6fa57d78

slab: Clean up SLOB vs kmalloc() definition · 3bf01933

Kees Cook authored Nov 17, 2022

As already done for kmalloc_node(), clean up the #ifdef usage in the
definition of kmalloc() so that the SLOB-only version is an entirely
separate and much more readable function.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

3bf01933

mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head · 130d4df5

Vlastimil Babka authored Aug 26, 2022

Joel reports [1] that increasing the rcu_head size for debugging
purposes used to work before struct slab was split from struct page, but
now runs into the various SLAB_MATCH() sanity checks of the layout.

This is because the rcu_head in struct page is in union with large
sub-structures and has space to grow without exceeding their size, while
in struct slab (for SLAB and SLUB) it's in union only with a list_head.

On closer inspection (and after the previous patch) we can put all
fields except slab_cache to a union with rcu_head, as slab_cache is
sufficient for the rcu freeing callbacks to work and the rest can be
overwritten by rcu_head without causing issues.

This is only somewhat complicated by the need to keep SLUB's
freelist+counters aligned for cmpxchg_double. As a result the fields
need to be reordered so that slab_cache is first (after page flags) and
the union with rcu_head follows. For consistency, do that for SLAB as
well, although not necessary there.

As a result, the rcu_head field in struct page and struct slab is no
longer at the same offset, but that doesn't matter as there is no
casting that would rely on that in the slab freeing callbacks, so we can
just drop the respective SLAB_MATCH() check.

Also we need to update the SLAB_MATCH() for compound_head to reflect the
new ordering.

While at it, also add a static_assert to check the alignment needed for
cmpxchg_double so mistakes are found sooner than a runtime GPF.

[1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

130d4df5

mm/migrate: make isolate_movable_page() skip slab pages · 8b881763

Vlastimil Babka authored Nov 04, 2022

In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA.  Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].

To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().

Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.

[1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

8b881763

mm/slab: move and adjust kernel-doc for kmem_cache_alloc · 838de63b

Vlastimil Babka authored Nov 10, 2022

Alexander reports an issue with the kmem_cache_alloc() comment in
mm/slab.c:

> The current comment mentioned that the flags only matters if the
> cache has no available objects. It's different for the __GFP_ZERO
> flag which will ensure that the returned object is always zeroed
> in any case.

> I have the feeling I run into this question already two times if
> the user need to zero the object or not, but the user does not need
> to zero the object afterwards. However another use of __GFP_ZERO
> and only zero the object if the cache has no available objects would
> also make no sense.

and suggests thus mentioning __GFP_ZERO as the exception. But on closer
inspection, the part about flags being only relevant if cache has no
available objects is misleading. The slab user has no reliable way to
determine if there are available objects, and e.g. the might_sleep()
debug check can be performed even if objects are available, so passing
correct flags given the allocation context always matters.

Thus remove that sentence completely, and while at it, move the comment
to from SLAB-specific mm/slab.c to the common include/linux/slab.h
The comment otherwise refers flags description for kmalloc(), so add
__GFP_ZERO comment there and remove a very misleading GFP_HIGHUSER
(not applicable to slab) description from there. Mention kzalloc() and
kmem_cache_zalloc() shortcuts.
Reported-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/all/20221011145413.8025-1-aahringo@redhat.com/Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

838de63b

mm/slub, percpu: correct the calculation of early percpu allocation size · a0dc161a

Baoquan He authored Oct 24, 2022

SLUB allocator relies on percpu allocator to initialize its ->cpu_slab
during early boot. For that, the dynamic chunk of percpu which serves
the early allocation need be large enough to satisfy the kmalloc
creation.

However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't
consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that
with correct calculation.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

a0dc161a

percpu: adjust the value of PERCPU_DYNAMIC_EARLY_SIZE · e8753e41

Baoquan He authored Nov 13, 2022

LKP reported a build failure as below on the following patch "mm/slub,
percpu: correct the calculation of early percpu allocation size"

~~~~~~
In file included from <command-line>:
In function 'alloc_kmem_cache_cpus',
   inlined from 'kmem_cache_open' at mm/slub.c:4340:6:
>> >> include/linux/compiler_types.h:357:45: error: call to '__compiletime_assert_474' declared with attribute error:
BUILD_BUG_ON failed: PERCPU_DYNAMIC_EARLY_SIZE < NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH * sizeof(struct kmem_cache_cpu)
     357 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
~~~~~~

From the kernel config file provided by LKP, the building was made on
arm64 with below Kconfig item enabled:

  CONFIG_ZONE_DMA=y
  CONFIG_SLUB_CPU_PARTIAL=y
  CONFIG_DEBUG_LOCK_ALLOC=y
  CONFIG_SLUB_STATS=y
  CONFIG_ARM64_PAGE_SHIFT=16
  CONFIG_ARM64_64K_PAGES=y

Then we will have:
  NR_KMALLOC_TYPES:4
  KMALLOC_SHIFT_HIGH:17
  sizeof(struct kmem_cache_cpu):184

The product of them is 12512, which is bigger than PERCPU_DYNAMIC_EARLY_SIZE,
12K. Hence, the BUILD_BUG_ON in alloc_kmem_cache_cpus() is triggered.

Earlier, in commit 099a19d9 ("percpu: allow limited allocation
before slab is online"), PERCPU_DYNAMIC_EARLY_SIZE was introduced and
set to 12K which is equal to the then PERPCU_DYNAMIC_RESERVE.
Later, in commit 1a4d7607 ("percpu: implement asynchronous chunk
population"), PERPCU_DYNAMIC_RESERVE was increased by 8K, while
PERCPU_DYNAMIC_EARLY_SIZE was kept unchanged.

So, here increase PERCPU_DYNAMIC_EARLY_SIZE by 8K too to accommodate to
the slub's requirement.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

e8753e41

11 Nov, 2022 1 commit

mm/slub: extend redzone check to extra allocated kmalloc space than requested · 946fa0db

Feng Tang authored Oct 21, 2022

kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.

To better detect out of bound access or abuse of this space, add
redzone sanity check for it.

In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.

In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

946fa0db

10 Nov, 2022 3 commits

mm: kasan: Extend kasan_metadata_size() to also cover in-object size · 5d1ba310

Feng Tang authored Oct 21, 2022

When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.

There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1].

To solve it, slub code needs to know the in-object kasan's meta
data size. Currently, there is existing kasan_metadata_size()
which returns the kasan's metadata size inside slub's metadata
area, so extend it to also cover the in-object meta size by
adding a boolean flag 'in_object'.

There is no functional change to existing code logic.

[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

5d1ba310

mm/slub: only zero requested size of buffer for kzalloc when debug enabled · 9ce67395

Feng Tang authored Oct 21, 2022

kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.

To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that redzone sanity check could be
added to the extra space later.

For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more when debug is enabled. (Thanks to Kees Cook's effort to
sanitize all ksize() user cases [1], this won't be a big issue).

[1]. https://lore.kernel.org/all/20220922031013.2150682-1-keescook@chromium.org/#rSigned-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

9ce67395

tools/vm/slabinfo: indicates the cause of the EACCES error · 654058e6

Rong Tao authored Nov 08, 2022

If you don't run slabinfo with a superuser, return 0 when read_slab_dir()
reads get_obj_and_str("slabs", &t), because fopen() fails (sometimes
EACCES), causing slabcache() to return directly, without any error during
this time, we should tell the user about the EACCES problem instead of
running successfully($?=0) without any error printing.

 For example:
 $ ./slabinfo
 Permission denied, Try using superuser  <== What this submission did
 $ sudo ./slabinfo
 Name            Objects Objsize   Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
 Acpi-Namespace     5950      48  286.7K         65/0/5   85 0   0  99
 Acpi-Operand      13664      72  999.4K       231/0/13   56 0   0  98
 ...
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

654058e6

07 Nov, 2022 1 commit

mm, slab: remove duplicate kernel-doc comment for ksize() · c18c20f1

Vlastimil Babka authored Nov 07, 2022

Akira reports:

> "make htmldocs" reports duplicate C declaration of ksize() as follows:

> /linux/Documentation/core-api/mm-api:43: ./mm/slab_common.c:1428: WARNING: Duplicate C declaration, also defined at core-api/mm-api:212.
> Declaration is '.. c:function:: size_t ksize (const void *objp)'.

> This is due to the kernel-doc comment for ksize() declaration added in
> include/linux/slab.h by commit 05a94065 ("slab: Introduce
> kmalloc_size_roundup()").

There is an older kernel-doc comment for ksize() definition in
mm/slab_common.c, which is not only duplicated, but also contradicts the
new one - the additional storage discovered by ksize() should not be
used by callers anymore. Delete the old kernel-doc.
Reported-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/all/d33440f6-40cf-9747-3340-e54ffaf7afb8@gmail.com/
Fixes: 05a94065 ("slab: Introduce kmalloc_size_roundup()")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

c18c20f1

06 Nov, 2022 1 commit

mm/slab_common: Restore passing "caller" for tracing · 32868715

Kees Cook authored Nov 04, 2022

The "caller" argument was accidentally being ignored in a few places
that were recently refactored. Restore these "caller" arguments, instead
of _RET_IP_.

Fixes: 11e9734b ("mm/slab_common: unify NUMA and UMA version of tracepoints")
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

32868715

04 Nov, 2022 1 commit

mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() · eb4940d4

Vlastimil Babka authored Nov 04, 2022

For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where
the allocation size is build-time constant) to save a function call, by
inlining kmalloc_trace() to a kmem_cache_alloc() call.

However since commit 6edf2576 ("mm/slub: enable debugging memory wasting of
kmalloc") this path now fails to pass the original request size to be
eventually recorded (for kmalloc caches with debugging enabled).

We could adjust the code to call __kmem_cache_alloc_node() as the
CONFIG_TRACING variant, but that would as a result inline a call with 5
parameters, bloating the kmalloc() call sites. The cost of extra function
call (to kmalloc_trace()) seems like a lesser evil.

It also appears that the !CONFIG_TRACING variant is incompatible with upcoming
hardening efforts [1] so it's easier if we just remove it now. Kernels with no
tracing are rare these days and the benefit is dubious anyway.

[1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed

Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

eb4940d4

03 Nov, 2022 1 commit

mm/slab_common: repair kernel-doc for __ksize() · a2076201

Lukas Bulwahn authored Oct 31, 2022

Commit 445d41d7 ("Merge branch 'slab/for-6.1/kmalloc_size_roundup' into
slab/for-next") resolved a conflict of two concurrent changes to __ksize().

However, it did not adjust the kernel-doc comment of __ksize(), while the
name of the argument to __ksize() was renamed.

Hence, ./scripts/ kernel-doc -none mm/slab_common.c warns about it.

Adjust the kernel-doc comment for __ksize() for make W=1 happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

a2076201

24 Oct, 2022 2 commits

mm/slub: perform free consistency checks before call_rcu · bc29d5bd

Vlastimil Babka authored Aug 26, 2022

For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
freeing. The rcu callback rcu_free_slab() calls __free_slab() that
currently includes checking the slab consistency for caches with
SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
to be intact.

Because in the next patch we want to allow rcu_head in struct slab to
become larger in debug configurations and thus potentially overwrite
more fields through a union than slab_list, we want to limit the fields
used in rcu_free_slab().  Thus move the consistency checks to
free_slab() before call_rcu(). This can be done safely even for
SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
occur after freeing them.

As a result, only the slab->slab_cache field has to be physically
separate from rcu_head for the freeing callback to work. We also save
some cycles in the rcu callback for caches with consistency checks
enabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

bc29d5bd

mm/slab: Annotate kmem_cache_node->list_lock as raw · b539ce9f

Jiri Kosina authored Oct 21, 2022

The list_lock can be taken in hardirq context when do_drain() is being
called via IPI on all cores, and therefore lockdep complains about it,
because it can't be preempted on PREEMPT_RT.

That's not a real issue, as SLAB can't be built on PREEMPT_RT anyway, but
we still want to get rid of the warning on non-PREEMPT_RT builds.

Annotate it therefore as a raw lock in order to get rid of he lockdep
warning below.

	 =============================
	 [ BUG: Invalid wait context ]
	 6.1.0-rc1-00134-ge35184f3 #4 Not tainted
	 -----------------------------
	 swapper/3/0 is trying to lock:
	 ffff8bc88086dc18 (&parent->list_lock){..-.}-{3:3}, at: do_drain+0x57/0xb0
	 other info that might help us debug this:
	 context-{2:2}
	 no locks held by swapper/3/0.
	 stack backtrace:
	 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.1.0-rc1-00134-ge35184f3 #4
	 Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
	 Call Trace:
	  <IRQ>
	  dump_stack_lvl+0x6b/0x9d
	  __lock_acquire+0x1519/0x1730
	  ? build_sched_domains+0x4bd/0x1590
	  ? __lock_acquire+0xad2/0x1730
	  lock_acquire+0x294/0x340
	  ? do_drain+0x57/0xb0
	  ? sched_clock_tick+0x41/0x60
	  _raw_spin_lock+0x2c/0x40
	  ? do_drain+0x57/0xb0
	  do_drain+0x57/0xb0
	  __flush_smp_call_function_queue+0x138/0x220
	  __sysvec_call_function+0x4f/0x210
	  sysvec_call_function+0x4b/0x90
	  </IRQ>
	  <TASK>
	  asm_sysvec_call_function+0x16/0x20
	 RIP: 0010:mwait_idle+0x5e/0x80
	 Code: 31 d2 65 48 8b 04 25 80 ed 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 0b 78 46 00 31 c0 48 89 c1 fb 0f 01 c9 <eb> 06 fb 0f 1f 44 00 00 65 48 8b 04 25 80 ed 01 00 f0 80 60 02 df
	 RSP: 0000:ffffa90940217ee0 EFLAGS: 00000246
	 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9bb9f93a
	 RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
	 R10: ffffa90940217ea8 R11: 0000000000000000 R12: ffffffffffffffff
	 R13: 0000000000000000 R14: ffff8bc88127c500 R15: 0000000000000000
	  ? default_idle_call+0x1a/0xa0
	  default_idle_call+0x4b/0xa0
	  do_idle+0x1f1/0x2c0
	  ? _raw_spin_unlock_irqrestore+0x56/0x70
	  cpu_startup_entry+0x19/0x20
	  start_secondary+0x122/0x150
	  secondary_startup_64_no_verify+0xce/0xdb
	  </TASK>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

b539ce9f