Commits · e601ded4247f959702adb5170ca8abac17a0313f · Kirill Smelkov / linux

28 Mar, 2023 33 commits

mm: memory: use folio_throttle_swaprate() in page_copy_prealloc() · e601ded4

Kefeng Wang authored Mar 02, 2023

Directly use folio_throttle_swaprate() instead of
cgroup_throttle_swaprate().

Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e601ded4

mm: memory: use folio_throttle_swaprate() in do_swap_page() · 4231f842

Kefeng Wang authored Mar 02, 2023

Directly use folio_throttle_swaprate() instead of
cgroup_throttle_swaprate().

Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4231f842

mm: huge_memory: convert __do_huge_pmd_anonymous_page() to use a folio · cfe3236d

Kefeng Wang authored Mar 02, 2023

Patch series "mm: remove cgroup_throttle_swaprate() completely", v2.

Convert all the caller functions of cgroup_throttle_swaprate() to use
folios, and use folio_throttle_swaprate(), which allows us to remove
cgroup_throttle_swaprate() completely.

This patch (of 7):

Convert from page to folio within __do_huge_pmd_anonymous_page(), as we
need the precise page which is to be stored at this PTE in the folio, the
function still keep a page as the parameter.

Link: https://lkml.kernel.org/r/20230302115835.105364-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230302115835.105364-2-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cfe3236d

kasan: call clear_page with a match-all tag instead of changing page tag · 16d91faf

Peter Collingbourne authored Feb 16, 2023

Instead of changing the page's tag solely in order to obtain a pointer
with a match-all tag and then changing it back again, just convert the
pointer that we get from kmap_atomic() into one with a match-all tag
before passing it to clear_page().

On a certain microarchitecture, this has been observed to cause a
measurable improvement in microbenchmark performance, presumably as a
result of being able to avoid the atomic operations on the page tag.

Link: https://lkml.kernel.org/r/20230216195924.3287772-1-pcc@google.comSigned-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I0249822cc29097ca7a04ad48e8eb14871f80e711Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

16d91faf

selftests: cgroup: add 'malloc' failures checks in test_memcontrol · af7df1c9

Ivan Orlov authored Feb 26, 2023

There are several 'malloc' calls in test_memcontrol, which can be
unsuccessful.  This patch will add 'malloc' failures checking to give more
details about test's fail reasons and avoid possible undefined behavior
during the future null dereference (like the one in
alloc_anon_50M_check_swap function).

Link: https://lkml.kernel.org/r/20230226131634.34366-1-ivan.orlov0322@gmail.comSigned-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

af7df1c9

mm/rmap: use atomic_try_cmpxchg in set_tlb_ubc_flush_pending · bdeb9188

Uros Bizjak authored Feb 27, 2023

Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in set_tlb_ubc_flush_pending.  86 CMPXCHG instruction returns success in
ZF flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails.

No functional change intended.

Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.comSigned-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bdeb9188

mm/debug: use %pGt to display page_type in dump_page() · f2421a16

Hyeonggon Yoo authored Jan 30, 2023

Some page flags are stored in page_type rather than ->flags field.
Use newly introduced page type %pGt in dump_page().

Below are some examples:

page:00000000da7184dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x101cb3
flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff)
page_type: 0xffffffff()
raw: 02ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: newly allocated page

page:00000000da7184dd refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x101cb3
flags: 0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff)
page_type: 0xffffff7f(buddy)
raw: 02ffff0000000000 ffff88813fff8e80 ffff88813fff8e80 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: freed page

page:0000000042202316 refcount:3 mapcount:2 mapping:0000000000000000 index:0x7f634722a pfn:0x11994e
memcg:ffff888100135000
anon flags: 0x2ffff0000080024(uptodate|active|swapbacked|node=0|zone=2|lastcpupid=0xffff)
page_type: 0x1()
raw: 02ffff0000080024 0000000000000000 dead000000000122 ffff8881193398f1
raw: 00000007f634722a 0000000000000000 0000000300000001 ffff888100135000
page dumped because: user-mapped page

Link: https://lkml.kernel.org/r/20230130042514.2418-4-42.hyeyoo@gmail.comSigned-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f2421a16

mm, printk: introduce new format %pGt for page_type · 4c85c0be

Hyeonggon Yoo authored Jan 30, 2023

%pGp format is used to display 'flags' field of a struct page.  However,
some page flags (i.e.  PG_buddy, see page-flags.h for more details) are
stored in page_type field.  To display human-readable output of page_type,
introduce %pGt format.

It is important to note the meaning of bits are different in page_type. 
if page_type is 0xffffffff, no flags are set.  Setting PG_buddy
(0x00000080) flag results in a page_type of 0xffffff7f.  Clearing a bit
actually means setting a flag.  Bits in page_type are inverted when
displaying type names.

Only values for which page_type_has_type() returns true are considered as
page_type, to avoid confusion with mapcount values.  if it returns false,
only raw values are displayed and not page type names.

Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.comSigned-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>	[vsprintf part]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4c85c0be

mmflags.h: use less error prone method to define pageflag_names · e26fcc02

Hyeonggon Yoo authored Jan 30, 2023

Patch series "mm, printk: introduce new format for page_type", v4.

This series moves PG_slab page flag to page_type, freeing one bit in
page->flags and introduces %pGt format that prints human-readable
page_type like %pGp for printing page flags.

See changelog of patch 2 for more implementation details.

Thanks everyone that gave valuable comments.


This patch (of 3):

Use helper macro to decrease chances of typo when defining pageflag_names.

Link: https://lkml.kernel.org/r/20230130042514.2418-1-42.hyeyoo@gmail.com
Link: https://lore.kernel.org/lkml/Y6AycLbpjVzXM5I9@smile.fi.intel.com
Link: https://lkml.kernel.org/r/20230130042514.2418-2-42.hyeyoo@gmail.comSigned-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e26fcc02

mm: add tracepoints to ksm · 739100c8

Stefan Roesch authored Feb 10, 2023

This adds the following tracepoints to ksm:
- start / stop scan
- ksm enter / exit
- merge a page
- merge a page with ksm
- remove a page
- remove a rmap item

This patch has been split off from the RFC patch series "mm:
process/cgroup ksm support".

Link: https://lkml.kernel.org/r/20230210214645.2720847-1-shr@devkernel.ioSigned-off-by: Stefan Roesch <shr@devkernel.io>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

739100c8

powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN · 77f68ebe

Nicholas Piggin authored Feb 03, 2023

On a 16-socket 192-core POWER8 system, the context_switch1_threads
benchmark from will-it-scale (see earlier changelog), upstream can achieve
a rate of about 1 million context switches per second, due to contention
on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option.  This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing a
kernel compile, which is in the noise in terms of kernel cycles.

Link: https://lkml.kernel.org/r/20230203071837.1136453-6-npiggin@gmail.comSigned-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

77f68ebe

lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme · 2655421a

Nicholas Piggin authored Feb 03, 2023

On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications.  user<->idle switch
is one of the important cases.  Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.

Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.

The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm.  Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.

Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs.  There are a lot of options to
reduce them if necessary, described in comments.

The near-worst-case can be benchmarked with will-it-scale:

  context_switch1_threads -t $(($(nproc) / 2))

This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.

[ Rik came up with basically the same idea a few years ago, so credit
  to him for that. ]

Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.comSigned-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2655421a

lazy tlb: allow lazy tlb mm refcounting to be configurable · 88e3009b

Nicholas Piggin authored Feb 03, 2023

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched.  This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the last
refcount is dropped.  Currently this is always enabled, so the patch
introduces no functional change.

Link: https://lkml.kernel.org/r/20230203071837.1136453-4-npiggin@gmail.comSigned-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

88e3009b

lazy tlb: introduce lazy tlb mm refcount helper functions · aa464ba9

Nicholas Piggin authored Feb 03, 2023

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.comSigned-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

aa464ba9

kthread: simplify kthread_use_mm refcounting · 6cad87b0

Nicholas Piggin authored Feb 03, 2023

Patch series "shoot lazy tlbs (lazy tlb refcount scalability
improvement)", v7.

This series improves scalability of context switching between user and
kernel threads on large systems with a threaded process spread across a
lot of CPUs.

Discussion of v6 here:
https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/


This patch (of 5):

Remove the special case avoiding refcounting when the mm to be used is the
same as the kernel thread's active (lazy tlb) mm.  kthread_use_mm() should
not be such a performance critical path that this matters much.  This
simplifies a later change to lazy tlb mm refcounting.

Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.comSigned-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6cad87b0

mm/zswap: try to avoid worst-case scenario on same element pages · 62bf1258

Taejoon Song authored Feb 06, 2023

The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.

Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before looping
through all elements, we might have some chances to quickly detect
non-same element pages.

1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
   number and measures the speed at off-line

Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512.
The speed is based on the time to consume page_same_filled() function
only.  The result, on average, is listed as below:

                                   Num of Iter    Speed(MB/s)
Looping-Forward (Orig)                 38            99265
Looping-Backward                       36           102725
Last-element-check (This Patch)        33           125072

The result shows that the average iteration count decreases by 13% and the
speed increases by 25% with this patch.  This patch does not increase the
overall time complexity, though.

I also ran simpler version which uses backward loop.  Just looping
backward also makes some improvement, but less than this patch.

A similar change has already been made to zram in 90f82cbf ("zram: try
to avoid worst-case scenario on same element pages").

Link: https://lkml.kernel.org/r/20230205190036.1730134-1-taejoon.song@lge.comSigned-off-by: Taejoon Song <taejoon.song@lge.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Taejoon Song <taejoon.song@lge.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <yjay.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

62bf1258

mm: multi-gen LRU: improve design doc · 32d32ef1

T.J. Alumbaugh authored Feb 14, 2023

This patch improves the design doc. Specifically,
  1. add a section for the per-memcg mm_struct list, and
  2. add a section for the PID controller.

Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.comSigned-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

32d32ef1

mm: multi-gen LRU: clean up sysfs code · 9a52b2f3

T.J. Alumbaugh authored Feb 14, 2023

This patch cleans up the sysfs code. Specifically,
  1. use sysfs_emit(),
  2. use __ATTR_RW(), and
  3. constify multi-gen LRU struct attribute_group.

Link: https://lkml.kernel.org/r/20230214035445.1250139-1-talumbau@google.comSigned-off-by: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9a52b2f3

x86/mm/pat: clear VM_PAT if copy_p4d_range failed · d155df53

Ma Wupeng authored Feb 17, 2023

Syzbot reports a warning in untrack_pfn().  Digging into the root we found
that this is due to memory allocation failure in pmd_alloc_one.  And this
failure is produced due to failslab.

In copy_page_range(), memory alloaction for pmd failed.  During the error
handling process in copy_page_range(), mmput() is called to remove all
vmas.  While untrack_pfn this empty pfn, warning happens.

Here's a simplified flow:

dup_mm
  dup_mmap
    copy_page_range
      copy_p4d_range
        copy_pud_range
          copy_pmd_range
            pmd_alloc
              __pmd_alloc
                pmd_alloc_one
                  page = alloc_pages(gfp, 0);
                    if (!page)
                      return NULL;
    mmput
        exit_mmap
          unmap_vmas
            unmap_single_vma
              untrack_pfn
                follow_phys
                  WARN_ON_ONCE(1);

Since this vma is not generate successfully, we can clear flag VM_PAT.  In
this case, untrack_pfn() will not be called while cleaning this vma.

Function untrack_pfn_moved() has also been renamed to fit the new logic.

Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.comSigned-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d155df53

mm/userfaultfd: support WP on multiple VMAs · a1b92a3f

Muhammad Usama Anjum authored Feb 17, 2023

mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA. 
We are facing a use case where multiple VMAs are present in one range of
interest.  For example, the following pseudocode reproduces the error
which we are trying to fix:

- Allocate memory of size 16 pages with PROT_NONE with mmap
- Register userfaultfd
- Change protection of the first half (1 to 8 pages) of memory to
  PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
- Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
  out.

This is a simple use case where user may or may not know if the memory
area has been divided into multiple VMAs.

We need an implementation which doesn't disrupt the already present users.
So keeping things simple, stop going over all the VMAs if any one of the
VMA hasn't been registered in WP mode.  While at it, remove the un-needed
error check as well.

[akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build]
Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.comSigned-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reported-by: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a1b92a3f

mm, page_alloc: reduce page alloc/free sanity checks · 700d2e9a

Vlastimil Babka authored Feb 16, 2023

Historically, we have performed sanity checks on all struct pages being
allocated or freed, making sure they have no unexpected page flags or
certain field values.  This can detect insufficient cleanup and some cases
of use-after-free, although on its own it can't always identify the
culprit.  The result is a warning and the "bad page" being leaked.

The checks do need some cpu cycles, so in 4.7 with commits 479f854a
("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
and 4db7548c ("mm, page_alloc: defer debugging checks of freed pages
until a PCP drain") they were no longer performed in the hot paths when
allocating and freeing from pcplists, but only when pcplists are bypassed,
refilled or drained.  For debugging purposes, with CONFIG_DEBUG_VM enabled
the checks were instead still done in the hot paths and not when refilling
or draining pcplists.

With 4462b32c ("mm, page_alloc: more extensive free page checking with
debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks
back to hot pahs.  When both debug_pagealloc and CONFIG_DEBUG_VM are
enabled, the checks are done both in hotpaths and pcplist refill/drain.

Even though the non-debug default today might seem to be a sensible
tradeoff between overhead and ability to detect bad pages, on closer look
it's arguably not.  As most allocations go through the pcplists, catching
any bad pages when refilling or draining pcplists has only a small chance,
insufficient for debugging or serious hardening purposes.  On the other
hand the cost of the checks is concentrated in the already expensive
drain/refill batching operations, and those are done under the often
contended zone lock.  That was recently identified as an issue for page
allocation and the zone lock contention reduced by moving the checks
outside of the locked section with a patch "mm: reduce lock contention of
pcp buffer refill", but the cost of the checks is still visible compared
to their removal [1].  In the pcplist draining path free_pcppages_bulk()
the checks are still done under zone->lock.

Thus, remove the checks from pcplist refill and drain paths completely.
Introduce a static key check_pages_enabled to control checks during page
allocation a freeing (whether pcplist is used or bypassed). The static
key is enabled if either is true:

- kernel is built with CONFIG_DEBUG_VM=y (debugging)
- debug_pagealloc or page poisoning is boot-time enabled (debugging)
- init_on_alloc or init_on_free is boot-time enabled (hardening)

The resulting user visible changes:
- no checks when draining/refilling pcplists - less overhead, with
  likely no practical reduction of ability to catch bad pages
- no checks when bypassing pcplists in default config (no
  debugging/hardening) - less overhead etc. as above
- on typical hardened kernels [2], checks are now performed on each page
  allocation/free (previously only when bypassing/draining/refilling
  pcplists) - the init_on_alloc/init_on_free enabled should be sufficient
  indication for preferring more costly alloc/free operations for
  hardening purposes and we shouldn't need to introduce another toggle
- code (various wrappers) removal and simplification

[1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/
[2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/

[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: make check_pages_enabled static]
Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.czReported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

700d2e9a

mm: reduce lock contention of pcp buffer refill · 2ede3c13

Alexander Halbuer authored Feb 01, 2023

rmqueue_bulk() batches the allocation of multiple elements to refill the
per-CPU buffers into a single hold of the zone lock.  Each element is
allocated and checked using check_pcp_refill().  The check touches every
related struct page which is especially expensive for higher order
allocations (huge pages).

This patch reduces the time holding the lock by moving the check out of
the critical section similar to rmqueue_buddy() which allocates a single
element.

Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.

Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.deSigned-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2ede3c13

mm: cma: make kobj_type structure constant · a4a4659d

Thomas Weißschuh authored Feb 20, 2023

Since commit ee6d3dd4 ("driver core: make kobj_type constant.") the
driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definition to prevent
modification at runtime.

Link: https://lkml.kernel.org/r/20230220-kobj_type-mm-cma-v1-1-45996cff1a81@weissschuh.netSigned-off-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a4a4659d

mm/khugepaged: alloc_charge_hpage() take care of mem charge errors · 94c02ad7

Peter Xu authored Feb 22, 2023

If memory charge failed, instead of returning the hpage but with an error,
allow the function to cleanup the folio properly, which is normally what a
function should do in this case - either return successfully, or return
with no side effect of partial runs with an indicated error.

This will also avoid the caller calling mem_cgroup_uncharge()
unnecessarily with either anon or shmem path (even if it's safe to do so).

Link: https://lkml.kernel.org/r/20230222195247.791227-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Stevens <stevensd@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

94c02ad7

mm: hugetlb_vmemmap: simplify hugetlb_vmemmap_init() a bit · 12318566

Muchun Song authored Feb 23, 2023

The check of IS_ENABLED(CONFIG_PROC_SYSCTL) is unnecessary since
register_sysctl_init() will be empty in this case. So, there is no
warnings after removing the check.

Link: https://lkml.kernel.org/r/20230223065947.64134-1-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

12318566

mailmap: add an entry for Leonard Crestez · bdd034de

Florian Fainelli authored Mar 24, 2023

Link: https://lkml.kernel.org/r/20230324130737.3360169-1-f.fainelli@gmail.comSigned-off-by: Florian Fainelli <f.fainelli@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
Cc: Leonard Crestez <cdleonard@gmail.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bdd034de

mm: kfence: fix handling discontiguous page · 1f2803b2

Muchun Song authored Mar 23, 2023

The struct pages could be discontiguous when the kfence pool is allocated
via alloc_contig_pages() with CONFIG_SPARSEMEM and
!CONFIG_SPARSEMEM_VMEMMAP.

This may result in setting PG_slab and memcg_data to a arbitrary
address (may be not used as a struct page), which in the worst case
might corrupt the kernel.

So the iteration should use nth_page().

Link: https://lkml.kernel.org/r/20230323025003.94447-1-songmuchun@bytedance.com
Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1f2803b2

mm: kfence: fix PG_slab and memcg_data clearing · 3ee2d747

Muchun Song authored Mar 20, 2023

It does not reset PG_slab and memcg_data when KFENCE fails to initialize
kfence pool at runtime.  It is reporting a "Bad page state" message when
kfence pool is freed to buddy.  The checking of whether it is a compound
head page seems unnecessary since we already guarantee this when
allocating kfence pool.   Remove the check to simplify the code.

Link: https://lkml.kernel.org/r/20230320030059.20189-1-songmuchun@bytedance.com
Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3ee2d747

fsdax: dedupe should compare the min of two iters' length · e900ba10

Shiyang Ruan authored Mar 22, 2023

In an dedupe comparison iter loop, the length of iomap_iter decreases
because it implies the remaining length after each iteration.

The dedupe command will fail with -EIO if the range is larger than one 
page size and not aligned to the page size.  Also report warning in dmesg:

[ 4338.498374] ------------[ cut here ]------------
[ 4338.498689] WARNING: CPU: 3 PID: 1415645 at fs/iomap/iter.c:16 
...

The compare function should use the min length of the current iters,
not the total length.

Link: https://lkml.kernel.org/r/1679469958-2-1-git-send-email-ruansy.fnst@fujitsu.com
Fixes: 0e79e373 ("fsdax: dedupe: iter two files at the same time")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e900ba10

fsdax: unshare: zero destination if srcmap is HOLE or UNWRITTEN · 13dd4e04

Shiyang Ruan authored Mar 22, 2023

unshare copies data from source to destination.  But if the source is
HOLE or UNWRITTEN extents, we should zero the destination, otherwise
the HOLE or UNWRITTEN part will be user-visible old data of the new
allocated extent.

Found by running generic/649 while mounting with -o dax=always on pmem.

Link: https://lkml.kernel.org/r/1679483469-2-1-git-send-email-ruansy.fnst@fujitsu.com
Fixes: d984648e ("fsdax,xfs: port unshare to fsdax")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

13dd4e04

lib/Kconfig.debug: correct help info of LOCKDEP_STACK_TRACE_HASH_BITS · f478b998

Tiezhu Yang authored Mar 21, 2023

We can see the following definition in kernel/locking/lockdep_internals.h:

  #define STACK_TRACE_HASH_SIZE	(1 << CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS)

CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS is related with STACK_TRACE_HASH_SIZE
instead of MAX_STACK_TRACE_ENTRIES, fix it.

Link: https://lkml.kernel.org/r/1679380508-20830-1-git-send-email-yangtiezhu@loongson.cn
Fixes: 5dc33592 ("lockdep: Allow tuning tracing capacity constants.")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f478b998

Kconfig.debug: fix SCHED_DEBUG dependency · 35260cf5

ye xingchen authored Jan 29, 2023

The path for SCHED_DEBUG is /sys/kernel/debug/sched.  So, SCHED_DEBUG
should depend on DEBUG_FS, not PROC_FS.

Link: https://lkml.kernel.org/r/202301291110098787982@zte.com.cnSigned-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

35260cf5

.mailmap: add entry for Leonard Göhrs · 1a4b52ce

Leonard Göhrs authored Mar 21, 2023

My very first kernel commit:

e4e1d47c ("ALSA: ppc: remove redundant checks in PS3 driver probe")

was sent with the umlaut in my last name transcribed (Göhrs -> Goehrs).

Add a mailmap entry so all my commits use the same name.

Link: https://lkml.kernel.org/r/20230321145525.1317230-1-l.goehrs@pengutronix.deSigned-off-by: Leonard Göhrs <l.goehrs@pengutronix.de>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1a4b52ce

26 Mar, 2023 7 commits

Linux 6.3-rc4 · 197b6b60
Linus Torvalds authored Mar 26, 2023

197b6b60

Merge tag 'usb-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 0ec57cfa

Linus Torvalds authored Mar 26, 2023

Pull USB / Thunderbolt driver fixes from Greg KH:
 "Here are a small set of USB and Thunderbolt driver fixes for reported
  problems and a documentation update, for 6.3-rc4.

  Included in here are:

   - documentation update for uvc gadget driver

   - small thunderbolt driver fixes

   - cdns3 driver fixes

   - dwc3 driver fixes

   - dwc2 driver fixes

   - chipidea driver fixes

   - typec driver fixes

   - onboard_usb_hub device id updates

   - quirk updates

  All of these have been in linux-next with no reported problems"

* tag 'usb-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (30 commits)
  usb: dwc2: fix a race, don't power off/on phy for dual-role mode
  usb: dwc2: fix a devres leak in hw_enable upon suspend resume
  usb: chipidea: core: fix possible concurrent when switch role
  usb: chipdea: core: fix return -EINVAL if request role is the same with current role
  thunderbolt: Rename shadowed variables bit to interrupt_bit and auto_clear_bit
  thunderbolt: Disable interrupt auto clear for rings
  thunderbolt: Use const qualifier for `ring_interrupt_index`
  usb: gadget: Use correct endianness of the wLength field for WebUSB
  uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS583Gen 2
  usb: cdnsp: changes PCI Device ID to fix conflict with CNDS3 driver
  usb: cdns3: Fix issue with using incorrect PCI device function
  usb: cdnsp: Fixes issue with redundant Status Stage
  MAINTAINERS: make me a reviewer of USB/IP
  thunderbolt: Use scale field when allocating USB3 bandwidth
  thunderbolt: Limit USB3 bandwidth of certain Intel USB4 host routers
  thunderbolt: Call tb_check_quirks() after initializing adapters
  thunderbolt: Add missing UNSET_INBOUND_SBTX for retimer access
  thunderbolt: Fix memory leak in margining
  usb: dwc2: drd: fix inconsistent mode if role-switch-default-mode="host"
  docs: usb: Add documentation for the UVC Gadget
  ...

0ec57cfa

Merge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 18940c88

Linus Torvalds authored Mar 26, 2023

Pull scheduler fix from Borislav Petkov:

 - Fix a corner case where vruntime of a task is not being sanitized

* tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Sanitize vruntime of entity being migrated

18940c88

Merge tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 974fc943

Linus Torvalds authored Mar 26, 2023

Pull perf fix from Borislav Petkov:

 - Properly clear perf event status tracking in the AMD perf event
   overflow handler

* tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Always clear status for idx

974fc943

Merge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f6cdaeb0

Linus Torvalds authored Mar 26, 2023

Pull core fixes from Borislav Petkov:

 - Do the delayed RCU wakeup for kthreads in the proper order so that
   former doesn't get ignored

 - A noinstr warning fix

* tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
  entry: Fix noinstr warning in __enter_from_user_mode()

f6cdaeb0

Merge tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 986c6374

Linus Torvalds authored Mar 26, 2023

Pull x86 fixes from Borislav Petkov:

 - Add a AMX ptrace self test

 - Prevent a false-positive warning when retrieving the (invalid)
   address of dynamic FPU features in their init state which are not
   saved in init_fpstate at all

 - Randomize per-CPU entry areas only when KASLR is enabled

* tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/x86/amx: Add a ptrace test
  x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf()
  x86/mm: Do not shuffle CPU entry areas without KASLR

986c6374

Merge tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6 · 6485ac65

Linus Torvalds authored Mar 26, 2023

Pull cifs client fixes from Steve French:
 "Twelve cifs/smb3 client fixes (most also for stable)

   - forced umount fix

   - fix for two perf regressions

   - reconnect fixes

   - small debugging improvements

   - multichannel fixes"

* tag 'smb3-client-fixes-6.3-rc3' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix unusable share after force unmount failure
  cifs: fix dentry lookups in directory handle cache
  smb3: lower default deferred close timeout to address perf regression
  cifs: fix missing unload_nls() in smb2_reconnect()
  cifs: avoid race conditions with parallel reconnects
  cifs: append path to open_enter trace event
  cifs: print session id while listing open files
  cifs: dump pending mids for all channels in DebugData
  cifs: empty interface list when server doesn't support query interfaces
  cifs: do not poll server interfaces too regularly
  cifs: lock chan_lock outside match_session
  cifs: check only tcon status on tcon related functions

6485ac65