Commits · 617666521385ba1a07f9388bc80d24941104e412 · Kirill Smelkov / linux

30 Nov, 2022 3 commits

Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next · 61766652

Vlastimil Babka authored Dec 01, 2022

Add a new slub_kunit test for the extended kmalloc redzone check, by
Feng Tang. Also prevent unwanted kfence interaction with all slub kunit
tests.

61766652

mm/slub, kunit: Add a test case for kmalloc redzone check · 6cd6d33c

Feng Tang authored Nov 30, 2022

kmalloc redzone check for slub has been merged, and it's better to add
a kunit case for it, which is inspired by a real-world case as described
in commit 120ee599 ("staging: octeon-usb: prevent memory corruption"):

"
  octeon-hcd will crash the kernel when SLOB is used. This usually happens
  after the 18-byte control transfer when a device descriptor is read.
  The DMA engine is always transferring full 32-bit words and if the
  transfer is shorter, some random garbage appears after the buffer.
  The problem is not visible with SLUB since it rounds up the allocations
  to word boundary, and the extra bytes will go undetected.
"

To avoid interrupting the normal functioning of kmalloc caches, a
kmem_cache mimicing kmalloc cache is created with similar flags, and
kmalloc_trace() is used to really test the orig_size and redzone setup.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

6cd6d33c

mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation · 4d9dd4b0

Feng Tang authored Nov 30, 2022

When kfence is enabled, the buffer allocated from the test case
could be from a kfence pool, and the operation could be also
caught and reported by kfence first, causing the case to fail.

With default kfence setting, this is very difficult to be triggered.
By changing CONFIG_KFENCE_NUM_OBJECTS from 255 to 16383, and
CONFIG_KFENCE_SAMPLE_INTERVAL from 100 to 5, the allocation from
kfence did hit 7 times in different slub_kunit cases out of 900
times of boot test.

To avoid this, initially we tried is_kfence_address() to check this
and repeated allocation till finding a non-kfence address. Vlastimil
Babka suggested SLAB_SKIP_KFENCE flag could be used to achieve this,
and better add a wrapper function for simplifying cache creation.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

4d9dd4b0

21 Nov, 2022 14 commits

Merge branch 'slab/for-6.2/alloc_size' into slab/for-next · b5e72d27

Vlastimil Babka authored Nov 21, 2022

Two patches from Kees Cook [1]:

These patches work around a deficiency in GCC (>=11) and Clang (<16)
where the __alloc_size attribute does not apply to inlines.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503

This manifests as reduced overflow detection coverage for many allocation
sites under CONFIG_FORTIFY_SOURCE=y, where the allocation size was not
actually being propagated to __builtin_dynamic_object_size().

[1] https://lore.kernel.org/all/20221118034713.gonna.754-kees@kernel.org/

b5e72d27

Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next · 90e9b23a

Vlastimil Babka authored Nov 11, 2022

kmalloc() redzone improvements by Feng Tang

From cover letter [1]:

kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so there is an extra space than what is originally
requested.

This patchset tries to extend the redzone sanity check to the extra
kmalloced buffer than requested, to better detect un-legitimate access
to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE)

[1] https://lore.kernel.org/all/20221021032405.1825078-1-feng.tang@intel.com/

90e9b23a

Merge branch 'slab/for-6.2/fit_rcu_head' into slab/for-next · 76537db3

Vlastimil Babka authored Nov 10, 2022

A series by myself to reorder fields in struct slab to allow the
embedded rcu_head to grow (for debugging purposes). Requires changes to
isolate_movable_page() to skip slab pages which can otherwise become
false-positive __PageMovable due to its use of low bits in
page->mapping.

76537db3

Merge branch 'slab/for-6.2/tools' into slab/for-next · 1c1aaa33
Vlastimil Babka authored Nov 10, 2022
```
A patch for tools/vm/slabinfo to give more useful feedback when not run
as a root, by Rong Tao.
```
1c1aaa33

Merge branch 'slab/for-6.2/slub-sysfs' into slab/for-next · c64b95d3

Vlastimil Babka authored Nov 10, 2022

- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code
  and optimize boot time with late initialization.
- Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable
  again as it can be both useful and safe, by Alexander Atanasov.

c64b95d3

Merge branch 'slab/for-6.2/locking' into slab/for-next · 14d3eb66

Vlastimil Babka authored Nov 10, 2022

A patch from Jiri Kosina that makes SLAB's list_lock a raw_spinlock_t.
While there are no plans to make SLAB actually compatible with
PREEMPT_RT or any other future, it makes !PREEMPT_RT lockdep happy.

14d3eb66

Merge branch 'slab/for-6.2/cleanups' into slab/for-next · 4b28ba9e

Vlastimil Babka authored Nov 10, 2022

- Removal of dead code from deactivate_slab() by Hyeonggon Yoo.
- Fix of BUILD_BUG_ON() for sufficient early percpu size by Baoquan He.
- Make kmem_cache_alloc() kernel-doc less misleading, by myself.

4b28ba9e

slab: Remove special-casing of const 0 size allocations · 6fa57d78

Kees Cook authored Nov 17, 2022

Passing a constant-0 size allocation into kmalloc() or kmalloc_node()
does not need to be a fast-path operation, so the static return value
can be removed entirely. This makes sure that all paths through the
inlines result in a full extern function call, where __alloc_size()
hints will actually be seen[1] by GCC. (A constant return value of 0
means the "0" allocation size won't be propagated by the inline.)

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

6fa57d78

slab: Clean up SLOB vs kmalloc() definition · 3bf01933

Kees Cook authored Nov 17, 2022

As already done for kmalloc_node(), clean up the #ifdef usage in the
definition of kmalloc() so that the SLOB-only version is an entirely
separate and much more readable function.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

3bf01933

mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head · 130d4df5

Vlastimil Babka authored Aug 26, 2022

Joel reports [1] that increasing the rcu_head size for debugging
purposes used to work before struct slab was split from struct page, but
now runs into the various SLAB_MATCH() sanity checks of the layout.

This is because the rcu_head in struct page is in union with large
sub-structures and has space to grow without exceeding their size, while
in struct slab (for SLAB and SLUB) it's in union only with a list_head.

On closer inspection (and after the previous patch) we can put all
fields except slab_cache to a union with rcu_head, as slab_cache is
sufficient for the rcu freeing callbacks to work and the rest can be
overwritten by rcu_head without causing issues.

This is only somewhat complicated by the need to keep SLUB's
freelist+counters aligned for cmpxchg_double. As a result the fields
need to be reordered so that slab_cache is first (after page flags) and
the union with rcu_head follows. For consistency, do that for SLAB as
well, although not necessary there.

As a result, the rcu_head field in struct page and struct slab is no
longer at the same offset, but that doesn't matter as there is no
casting that would rely on that in the slab freeing callbacks, so we can
just drop the respective SLAB_MATCH() check.

Also we need to update the SLAB_MATCH() for compound_head to reflect the
new ordering.

While at it, also add a static_assert to check the alignment needed for
cmpxchg_double so mistakes are found sooner than a runtime GPF.

[1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

130d4df5

mm/migrate: make isolate_movable_page() skip slab pages · 8b881763

Vlastimil Babka authored Nov 04, 2022

In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA.  Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].

To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().

Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.

[1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

8b881763

mm/slab: move and adjust kernel-doc for kmem_cache_alloc · 838de63b

Vlastimil Babka authored Nov 10, 2022

Alexander reports an issue with the kmem_cache_alloc() comment in
mm/slab.c:

> The current comment mentioned that the flags only matters if the
> cache has no available objects. It's different for the __GFP_ZERO
> flag which will ensure that the returned object is always zeroed
> in any case.

> I have the feeling I run into this question already two times if
> the user need to zero the object or not, but the user does not need
> to zero the object afterwards. However another use of __GFP_ZERO
> and only zero the object if the cache has no available objects would
> also make no sense.

and suggests thus mentioning __GFP_ZERO as the exception. But on closer
inspection, the part about flags being only relevant if cache has no
available objects is misleading. The slab user has no reliable way to
determine if there are available objects, and e.g. the might_sleep()
debug check can be performed even if objects are available, so passing
correct flags given the allocation context always matters.

Thus remove that sentence completely, and while at it, move the comment
to from SLAB-specific mm/slab.c to the common include/linux/slab.h
The comment otherwise refers flags description for kmalloc(), so add
__GFP_ZERO comment there and remove a very misleading GFP_HIGHUSER
(not applicable to slab) description from there. Mention kzalloc() and
kmem_cache_zalloc() shortcuts.
Reported-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/all/20221011145413.8025-1-aahringo@redhat.com/Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

838de63b

mm/slub, percpu: correct the calculation of early percpu allocation size · a0dc161a

Baoquan He authored Oct 24, 2022

SLUB allocator relies on percpu allocator to initialize its ->cpu_slab
during early boot. For that, the dynamic chunk of percpu which serves
the early allocation need be large enough to satisfy the kmalloc
creation.

However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't
consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that
with correct calculation.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

a0dc161a

percpu: adjust the value of PERCPU_DYNAMIC_EARLY_SIZE · e8753e41

Baoquan He authored Nov 13, 2022

LKP reported a build failure as below on the following patch "mm/slub,
percpu: correct the calculation of early percpu allocation size"

~~~~~~
In file included from <command-line>:
In function 'alloc_kmem_cache_cpus',
   inlined from 'kmem_cache_open' at mm/slub.c:4340:6:
>> >> include/linux/compiler_types.h:357:45: error: call to '__compiletime_assert_474' declared with attribute error:
BUILD_BUG_ON failed: PERCPU_DYNAMIC_EARLY_SIZE < NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH * sizeof(struct kmem_cache_cpu)
     357 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
~~~~~~

From the kernel config file provided by LKP, the building was made on
arm64 with below Kconfig item enabled:

  CONFIG_ZONE_DMA=y
  CONFIG_SLUB_CPU_PARTIAL=y
  CONFIG_DEBUG_LOCK_ALLOC=y
  CONFIG_SLUB_STATS=y
  CONFIG_ARM64_PAGE_SHIFT=16
  CONFIG_ARM64_64K_PAGES=y

Then we will have:
  NR_KMALLOC_TYPES:4
  KMALLOC_SHIFT_HIGH:17
  sizeof(struct kmem_cache_cpu):184

The product of them is 12512, which is bigger than PERCPU_DYNAMIC_EARLY_SIZE,
12K. Hence, the BUILD_BUG_ON in alloc_kmem_cache_cpus() is triggered.

Earlier, in commit 099a19d9 ("percpu: allow limited allocation
before slab is online"), PERCPU_DYNAMIC_EARLY_SIZE was introduced and
set to 12K which is equal to the then PERPCU_DYNAMIC_RESERVE.
Later, in commit 1a4d7607 ("percpu: implement asynchronous chunk
population"), PERPCU_DYNAMIC_RESERVE was increased by 8K, while
PERCPU_DYNAMIC_EARLY_SIZE was kept unchanged.

So, here increase PERCPU_DYNAMIC_EARLY_SIZE by 8K too to accommodate to
the slub's requirement.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

e8753e41

11 Nov, 2022 1 commit

mm/slub: extend redzone check to extra allocated kmalloc space than requested · 946fa0db

Feng Tang authored Oct 21, 2022

kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.

To better detect out of bound access or abuse of this space, add
redzone sanity check for it.

In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.

In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

946fa0db

10 Nov, 2022 3 commits

mm: kasan: Extend kasan_metadata_size() to also cover in-object size · 5d1ba310

Feng Tang authored Oct 21, 2022

When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.

There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1].

To solve it, slub code needs to know the in-object kasan's meta
data size. Currently, there is existing kasan_metadata_size()
which returns the kasan's metadata size inside slub's metadata
area, so extend it to also cover the in-object meta size by
adding a boolean flag 'in_object'.

There is no functional change to existing code logic.

[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

5d1ba310

mm/slub: only zero requested size of buffer for kzalloc when debug enabled · 9ce67395

Feng Tang authored Oct 21, 2022

kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.

To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that redzone sanity check could be
added to the extra space later.

For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more when debug is enabled. (Thanks to Kees Cook's effort to
sanitize all ksize() user cases [1], this won't be a big issue).

[1]. https://lore.kernel.org/all/20220922031013.2150682-1-keescook@chromium.org/#rSigned-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

9ce67395

tools/vm/slabinfo: indicates the cause of the EACCES error · 654058e6

Rong Tao authored Nov 08, 2022

If you don't run slabinfo with a superuser, return 0 when read_slab_dir()
reads get_obj_and_str("slabs", &t), because fopen() fails (sometimes
EACCES), causing slabcache() to return directly, without any error during
this time, we should tell the user about the EACCES problem instead of
running successfully($?=0) without any error printing.

 For example:
 $ ./slabinfo
 Permission denied, Try using superuser  <== What this submission did
 $ sudo ./slabinfo
 Name            Objects Objsize   Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
 Acpi-Namespace     5950      48  286.7K         65/0/5   85 0   0  99
 Acpi-Operand      13664      72  999.4K       231/0/13   56 0   0  98
 ...
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

654058e6

07 Nov, 2022 1 commit

mm, slab: remove duplicate kernel-doc comment for ksize() · c18c20f1

Vlastimil Babka authored Nov 07, 2022

Akira reports:

> "make htmldocs" reports duplicate C declaration of ksize() as follows:

> /linux/Documentation/core-api/mm-api:43: ./mm/slab_common.c:1428: WARNING: Duplicate C declaration, also defined at core-api/mm-api:212.
> Declaration is '.. c:function:: size_t ksize (const void *objp)'.

> This is due to the kernel-doc comment for ksize() declaration added in
> include/linux/slab.h by commit 05a94065 ("slab: Introduce
> kmalloc_size_roundup()").

There is an older kernel-doc comment for ksize() definition in
mm/slab_common.c, which is not only duplicated, but also contradicts the
new one - the additional storage discovered by ksize() should not be
used by callers anymore. Delete the old kernel-doc.
Reported-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/all/d33440f6-40cf-9747-3340-e54ffaf7afb8@gmail.com/
Fixes: 05a94065 ("slab: Introduce kmalloc_size_roundup()")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

c18c20f1

06 Nov, 2022 1 commit

mm/slab_common: Restore passing "caller" for tracing · 32868715

Kees Cook authored Nov 04, 2022

The "caller" argument was accidentally being ignored in a few places
that were recently refactored. Restore these "caller" arguments, instead
of _RET_IP_.

Fixes: 11e9734b ("mm/slab_common: unify NUMA and UMA version of tracepoints")
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

32868715

04 Nov, 2022 1 commit

mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() · eb4940d4

Vlastimil Babka authored Nov 04, 2022

For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where
the allocation size is build-time constant) to save a function call, by
inlining kmalloc_trace() to a kmem_cache_alloc() call.

However since commit 6edf2576 ("mm/slub: enable debugging memory wasting of
kmalloc") this path now fails to pass the original request size to be
eventually recorded (for kmalloc caches with debugging enabled).

We could adjust the code to call __kmem_cache_alloc_node() as the
CONFIG_TRACING variant, but that would as a result inline a call with 5
parameters, bloating the kmalloc() call sites. The cost of extra function
call (to kmalloc_trace()) seems like a lesser evil.

It also appears that the !CONFIG_TRACING variant is incompatible with upcoming
hardening efforts [1] so it's easier if we just remove it now. Kernels with no
tracing are rare these days and the benefit is dubious anyway.

[1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed

Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

eb4940d4

03 Nov, 2022 1 commit

mm/slab_common: repair kernel-doc for __ksize() · a2076201

Lukas Bulwahn authored Oct 31, 2022

Commit 445d41d7 ("Merge branch 'slab/for-6.1/kmalloc_size_roundup' into
slab/for-next") resolved a conflict of two concurrent changes to __ksize().

However, it did not adjust the kernel-doc comment of __ksize(), while the
name of the argument to __ksize() was renamed.

Hence, ./scripts/ kernel-doc -none mm/slab_common.c warns about it.

Adjust the kernel-doc comment for __ksize() for make W=1 happiness.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

a2076201

24 Oct, 2022 6 commits

mm/slub: perform free consistency checks before call_rcu · bc29d5bd

Vlastimil Babka authored Aug 26, 2022

For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
freeing. The rcu callback rcu_free_slab() calls __free_slab() that
currently includes checking the slab consistency for caches with
SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
to be intact.

Because in the next patch we want to allow rcu_head in struct slab to
become larger in debug configurations and thus potentially overwrite
more fields through a union than slab_list, we want to limit the fields
used in rcu_free_slab().  Thus move the consistency checks to
free_slab() before call_rcu(). This can be done safely even for
SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
occur after freeing them.

As a result, only the slab->slab_cache field has to be physically
separate from rcu_head for the freeing callback to work. We also save
some cycles in the rcu callback for caches with consistency checks
enabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

bc29d5bd

mm/slab: Annotate kmem_cache_node->list_lock as raw · b539ce9f

Jiri Kosina authored Oct 21, 2022

The list_lock can be taken in hardirq context when do_drain() is being
called via IPI on all cores, and therefore lockdep complains about it,
because it can't be preempted on PREEMPT_RT.

That's not a real issue, as SLAB can't be built on PREEMPT_RT anyway, but
we still want to get rid of the warning on non-PREEMPT_RT builds.

Annotate it therefore as a raw lock in order to get rid of he lockdep
warning below.

	 =============================
	 [ BUG: Invalid wait context ]
	 6.1.0-rc1-00134-ge35184f3 #4 Not tainted
	 -----------------------------
	 swapper/3/0 is trying to lock:
	 ffff8bc88086dc18 (&parent->list_lock){..-.}-{3:3}, at: do_drain+0x57/0xb0
	 other info that might help us debug this:
	 context-{2:2}
	 no locks held by swapper/3/0.
	 stack backtrace:
	 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.1.0-rc1-00134-ge35184f3 #4
	 Hardware name: LENOVO 20K5S22R00/20K5S22R00, BIOS R0IET38W (1.16 ) 05/31/2017
	 Call Trace:
	  <IRQ>
	  dump_stack_lvl+0x6b/0x9d
	  __lock_acquire+0x1519/0x1730
	  ? build_sched_domains+0x4bd/0x1590
	  ? __lock_acquire+0xad2/0x1730
	  lock_acquire+0x294/0x340
	  ? do_drain+0x57/0xb0
	  ? sched_clock_tick+0x41/0x60
	  _raw_spin_lock+0x2c/0x40
	  ? do_drain+0x57/0xb0
	  do_drain+0x57/0xb0
	  __flush_smp_call_function_queue+0x138/0x220
	  __sysvec_call_function+0x4f/0x210
	  sysvec_call_function+0x4b/0x90
	  </IRQ>
	  <TASK>
	  asm_sysvec_call_function+0x16/0x20
	 RIP: 0010:mwait_idle+0x5e/0x80
	 Code: 31 d2 65 48 8b 04 25 80 ed 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 0b 78 46 00 31 c0 48 89 c1 fb 0f 01 c9 <eb> 06 fb 0f 1f 44 00 00 65 48 8b 04 25 80 ed 01 00 f0 80 60 02 df
	 RSP: 0000:ffffa90940217ee0 EFLAGS: 00000246
	 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
	 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9bb9f93a
	 RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
	 R10: ffffa90940217ea8 R11: 0000000000000000 R12: ffffffffffffffff
	 R13: 0000000000000000 R14: ffff8bc88127c500 R15: 0000000000000000
	  ? default_idle_call+0x1a/0xa0
	  default_idle_call+0x4b/0xa0
	  do_idle+0x1f1/0x2c0
	  ? _raw_spin_unlock_irqrestore+0x56/0x70
	  cpu_startup_entry+0x19/0x20
	  start_secondary+0x122/0x150
	  secondary_startup_64_no_verify+0xce/0xdb
	  </TASK>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

b539ce9f

mm/slub: remove dead code for debug caches on deactivate_slab() · a8e53869

Hyeonggon Yoo authored Oct 14, 2022

After commit c7323a5a ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), SLUB never installs percpu slab for debug caches
and thus never deactivates percpu slab for them.

Since only debug caches use the full list, SLUB no longer deactivates to
full list. Remove dead code in deactivate_slab().
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

a8e53869

mm: Make failslab writable again · 7c82b3b3

Alexander Atanasov authored Sep 20, 2022

In (060807f8 mm, slub: make remaining slub_debug related attributes
read-only) failslab was made read-only.
I think it became a collateral victim to the two other options for which
the reasons are perfectly valid.
Here is why:
 - sanity_checks and trace are slab internal debug options,
   failslab is used for fault injection.
 - for fault injections, which by presumption are random, it
   does not matter if it is not set atomically. And you need to
   set atleast one more option to trigger fault injection.
 - in a testing scenario you may need to change it at runtime
   example: module loading - you test all allocations limited
   by the space option. Then you move to test only your module's
   own slabs.
 - when set by command line flags it effectively disables all
   cache merges.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.czSigned-off-by: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

7c82b3b3

mm: slub: make slab_sysfs_init() a late_initcall · 1a5ad30b

Rasmus Villemoes authored Sep 30, 2022

Currently, slab_sysfs_init() is an __initcall aka device_initcall. It
is rather time-consuming; on my board it takes around 11ms. That's
about 1% of the time budget I have from U-Boot letting go and until
linux must assume responsibility of keeping the external watchdog
happy.

There's no particular reason this would need to run at device_initcall
time, so instead make it a late_initcall to allow vital functionality
to get started a bit sooner.

This actually ends up winning more than just those 11ms, because the
slab caches that get created during other device_initcalls (and before
my watchdog device gets probed) now don't end up doing the somewhat
expensive sysfs_slab_add() themselves. Some example lines (with
initcall_debug set) before/after:

initcall ext4_init_fs+0x0/0x1ac returned 0 after 1386 usecs
initcall journal_init+0x0/0x138 returned 0 after 517 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 294 usecs

initcall ext4_init_fs+0x0/0x1ac returned 0 after 240 usecs
initcall journal_init+0x0/0x138 returned 0 after 32 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 18 usecs

Altogether, this means I now get to petting the watchdog around 17ms
sooner. [Of course, the time the other initcalls save is instead spent
in slab_sysfs_init(), which goes from 11ms to 16ms, so there's no
overall change in boot time.]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

1a5ad30b

mm: slub: remove dead and buggy code from sysfs_slab_add() · 979857ea

Rasmus Villemoes authored Sep 30, 2022

The function sysfs_slab_add() has two callers:

One is slab_sysfs_init(), which first initializes slab_kset, and only
when that succeeds sets slab_state to FULL, and then proceeds to call
sysfs_slab_add() for all previously created slabs.

The other is __kmem_cache_create(), but only after a

	if (slab_state <= UP)
		return 0;

check.

So in other words, sysfs_slab_add() is never called without
slab_kset (aka the return value of cache_kset()) being non-NULL.

And this is just as well, because if we ever did take this path and
called kobject_init(&s->kobj), and then later when called again from
slab_sysfs_init() would end up calling kobject_init_and_add(), we
would hit

	if (kobj->state_initialized) {
		/* do not error out as sometimes we can recover */
		pr_err("kobject (%p): tried to init an initialized object, something is seriously wrong.\n",
		dump_stack();
	}

in kobject.c.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

979857ea

23 Oct, 2022 9 commits

Linux 6.1-rc2 · 247f34f7
Linus Torvalds authored Oct 23, 2022

247f34f7

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 05b4ebd2

Linus Torvalds authored Oct 23, 2022

Pull kvm fixes from Paolo Bonzini:
 "RISC-V:

   - Fix compilation without RISCV_ISA_ZICBOM

   - Fix kvm_riscv_vcpu_timer_pending() for Sstc

  ARM:

   - Fix a bug preventing restoring an ITS containing mappings for very
     large and very sparse device topology

   - Work around a relocation handling error when compiling the nVHE
     object with profile optimisation

   - Fix for stage-2 invalidation holding the VM MMU lock for too long
     by limiting the walk to the largest block mapping size

   - Enable stack protection and branch profiling for VHE

   - Two selftest fixes

  x86:

   - add compat implementation for KVM_X86_SET_MSR_FILTER ioctl

  selftests:

   - synchronize includes between include/uapi and tools/include/uapi"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  tools: include: sync include/api/linux/kvm.h
  KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTER
  KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()
  kvm: Add support for arch compat vm ioctls
  RISC-V: KVM: Fix kvm_riscv_vcpu_timer_pending() for Sstc
  RISC-V: Fix compilation without RISCV_ISA_ZICBOM
  KVM: arm64: vgic: Fix exit condition in scan_its_table()
  KVM: arm64: nvhe: Fix build with profile optimization
  KVM: selftests: Fix number of pages for memory slot in memslot_modification_stress_test
  KVM: arm64: selftests: Fix multiple versions of GIC creation
  KVM: arm64: Enable stack protection and branch profiling for VHE
  KVM: arm64: Limit stage2_apply_range() batch size to largest block
  KVM: arm64: Work out supported block level at compile time

05b4ebd2

Revert "mfd: syscon: Remove repetition of the regmap_get_val_endian()" · ca4582c2

Jason A. Donenfeld authored Oct 08, 2022

This reverts commit 72a95859.

It broke reboots on big-endian MIPS and MIPS64 malta QEMU instances,
which use the syscon driver.  Little-endian is not effected, which means
likely it's important to handle regmap_get_val_endian() in this function
after all.

Fixes: 72a95859 ("mfd: syscon: Remove repetition of the regmap_get_val_endian()")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Lee Jones <lee@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ca4582c2

kernel/utsname_sysctl.c: Fix hostname polling · 52826d3b

Linus Torvalds authored Oct 23, 2022

Commit bfca3dd3 ("kernel/utsname_sysctl.c: print kernel arch") added
a new entry to the uts_kern_table[] array, but didn't update the
UTS_PROC_xyz enumerators of older entries, breaking anything that used
them.

Which is admittedly not many cases: it's really just the two uses of
uts_proc_notify() in kernel/sys.c. But apparently journald-systemd
actually uses this to detect hostname changes.
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Fixes: bfca3dd3 ("kernel/utsname_sysctl.c: print kernel arch")
Link: https://lore.kernel.org/lkml/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/0c2b92a6-0f25-9538-178f-eee3b06da23f@secunet.com/
Cc: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

52826d3b

Merge tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a7038524

Linus Torvalds authored Oct 23, 2022

Pull perf fixes from Borislav Petkov:

 - Fix raw data handling when perf events are used in bpf

 - Rework how SIGTRAPs get delivered to events to address a bunch of
   problems with it. Add a selftest for that too

* tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bpf: Fix sample_flags for bpf_perf_event_output
  selftests/perf_events: Add a SIGTRAP stress test with disables
  perf: Fix missing SIGTRAPs

a7038524

Merge tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c70055d8

Linus Torvalds authored Oct 23, 2022

Pull scheduler fixes from Borislav Petkov:

 - Adjust code to not trip up CFI

 - Fix sched group cookie matching

* tag 'sched_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Introduce struct balance_callback to avoid CFI mismatches
  sched/core: Fix comparison in sched_group_cookie_match()

c70055d8

Merge tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6204a81a

Linus Torvalds authored Oct 23, 2022

Pull objtool fix from Borislav Petkov:

 - Fix ORC stack unwinding when GCOV is enabled

* tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/unwind/orc: Fix unreliable stack dump with gcov

6204a81a

Merge tag 'x86_urgent_for_v6.0_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 295dad10

Linus Torvalds authored Oct 23, 2022

Pull x86 fixes from Borislav Petkov:
 "As usually the case, right after a major release, the tip urgent
  branches accumulate a couple more fixes than normal. And here is the
  x86, a bit bigger, urgent pile.

   - Use the correct CPU capability clearing function on the error path
     in Intel perf LBR

   - A CFI fix to ftrace along with a simplification

   - Adjust handling of zero capacity bit mask for resctrl cache
     allocation on AMD

   - A fix to the AMD microcode loader to attempt patch application on
     every logical thread

   - A couple of topology fixes to handle CPUID leaf 0x1f enumeration
     info properly

   - Drop a -mabi=ms compiler option check as both compilers support it
     now anyway

   - A couple of fixes to how the initial, statically allocated FPU
     buffer state is setup and its interaction with dynamic states at
     runtime"

* tag 'x86_urgent_for_v6.0_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly
  perf/x86/intel/lbr: Use setup_clear_cpu_cap() instead of clear_cpu_cap()
  ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()
  x86/ftrace: Remove ftrace_epilogue()
  x86/resctrl: Fix min_cbm_bits for AMD
  x86/microcode/AMD: Apply the patch early on every logical thread
  x86/topology: Fix duplicated core ID within a package
  x86/topology: Fix multiple packages shown on a single-package system
  hwmon/coretemp: Handle large core ID value
  x86/Kconfig: Drop check for -mabi=ms for CONFIG_EFI_STUB
  x86/fpu: Exclude dynamic states from init_fpstate
  x86/fpu: Fix the init_fpstate size check with the actual size
  x86/fpu: Configure init_fpstate attributes orderly

295dad10

Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux · 942e01ab

Linus Torvalds authored Oct 23, 2022

Pull io_uring follow-up from Jens Axboe:
 "Currently the zero-copy has automatic fallback to normal transmit, and
  it was decided that it'd be cleaner to return an error instead if the
  socket type doesn't support it.

  Zero-copy does work with UDP and TCP, it's more of a future proofing
  kind of thing (eg for samba)"

* tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux:
  io_uring/net: fail zc sendmsg when unsupported by socket
  io_uring/net: fail zc send when unsupported by socket
  net: flag sockets supporting msghdr originated zerocopy

942e01ab