Commits · e2e3fdc7d4afdb8e7ba981eba7827993f2d390a8 · Kirill Smelkov / linux

13 May, 2022 40 commits

swap: turn get_swap_page() into folio_alloc_swap() · e2e3fdc7

Matthew Wilcox (Oracle) authored May 12, 2022

This removes an assumption that a large folio is HPAGE_PMD_NR pages
in size.

Link: https://lkml.kernel.org/r/20220504182857.4013401-8-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e2e3fdc7

vmscan: convert the writeback handling in shrink_page_list() to folios · d33e4e14

Matthew Wilcox (Oracle) authored May 12, 2022

Slightly more efficient due to fewer calls to compound_head().

Link: https://lkml.kernel.org/r/20220504182857.4013401-7-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d33e4e14

vmscan: use folio_mapped() in shrink_page_list() · 1bee2c16

Matthew Wilcox (Oracle) authored May 12, 2022

Remove some legacy function calls.

Link: https://lkml.kernel.org/r/20220504182857.4013401-6-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1bee2c16

mm: remove alloc_pages_vma() · adf88aa8

Matthew Wilcox (Oracle) authored May 12, 2022

All callers have now been converted to use vma_alloc_folio(), so convert
the body of alloc_pages_vma() to allocate folios instead.

Link: https://lkml.kernel.org/r/20220504182857.4013401-5-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

adf88aa8

alpha: fix alloc_zeroed_user_highpage_movable() · f9c668d2

Matthew Wilcox (Oracle) authored May 12, 2022

Due to a typo, the final argument to alloc_page_vma() didn't refer to a
real variable. This only affected CONFIG_NUMA, which was marked BROKEN in
2006 and removed from alpha in 2021. Found due to a refactoring patch.

Link: https://lkml.kernel.org/r/20220504182857.4013401-4-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f9c668d2

mm/huge_memory: convert do_huge_pmd_anonymous_page() to use vma_alloc_folio() · cb196ee1

Matthew Wilcox (Oracle) authored May 12, 2022

Remove the use of this old API, eliminating a call to
prep_transhuge_page().

Link: https://lkml.kernel.org/r/20220504182857.4013401-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cb196ee1

shmem: convert shmem_alloc_hugepage() to use vma_alloc_folio() · dfe98499

Matthew Wilcox (Oracle) authored May 12, 2022

Patch series "Folio patches for 5.19", v2.


This patch (of 26):

For now, return the head page of the folio, but remove use of the old
alloc_pages_vma() API.

Link: https://lkml.kernel.org/r/20220504182857.4013401-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20220504182857.4013401-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

dfe98499

mm/shmem: remove duplicate include in memory.c · 54943a1a

Wan Jiabing authored May 12, 2022

Fix following checkincludes.pl warning:
mm/memory.c: linux/mm_inline.h is included more than once.

The include is in line 44. Remove the duplicated here.

Link: https://lkml.kernel.org/r/20220427064717.803019-1-wanjiabing@vivo.comSigned-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

54943a1a

mm/vmscan: don't use NUMA_NO_NODE as indicator of page on different node · ed657e55

Wei Yang authored May 12, 2022

Now we are sure there is at least one page on page_list, so it is safe to
get the nid of it. This means it is not necessary to use NUMA_NO_NODE as
an indicator for the beginning of iteration or a page on different node.

Link: https://lkml.kernel.org/r/20220429014426.29223-2-richard.weiyang@gmail.comSigned-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ed657e55

mm/vmscan: filter empty page_list at the beginning · 1ae65e27

Wei Yang authored May 12, 2022

node_page_list would always be !empty on finishing the loop, except
page_list is empty.

Let's handle empty page_list before doing any real work including touching
PF_MEMALLOC flag.

Link: https://lkml.kernel.org/r/20220429014426.29223-1-richard.weiyang@gmail.comSigned-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1ae65e27

mm/vmscan: use helper folio_is_file_lru() · f19a27e3

Miaohe Lin authored May 12, 2022

Use helper folio_is_file_lru() to check whether folio is file lru.  Minor
readability improvement.

[linmiaohe@huawei.com: use folio_is_file_lru()]
  Link: https://lkml.kernel.org/r/20220428105802.21389-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220425111232.23182-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f19a27e3

mm/vmscan: remove obsolete comment in kswapd_run · 4355e4b2

Miaohe Lin authored May 12, 2022

Since commit 6b700b5b ("mm/vmscan.c: remove cpu online notification
for now"), cpu online notification is removed.  So kswapd won't move to
proper cpus if cpus are hot-added.  Remove this obsolete comment.

Link: https://lkml.kernel.org/r/20220425111232.23182-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4355e4b2

mm/vmscan: take all base pages of THP into account when race with speculative reference · 9aafcffc

Miaohe Lin authored May 12, 2022

If the page has buffers, shrink_page_list will try to free the buffer
mappings associated with the page and try to free the page as well.  In
the rare race with speculative reference, the page will be freed shortly
by speculative reference.  But nr_reclaimed is not incremented correctly
when we come across the THP.  We need to account all the base pages in
this case.

Link: https://lkml.kernel.org/r/20220425111232.23182-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9aafcffc

mm/vmscan: introduce helper function reclaim_page_list() · 1fe47c0b

Miaohe Lin authored May 12, 2022

Introduce helper function reclaim_page_list() to eliminate the duplicated
code of doing shrink_page_list() and putback_lru_page.  Also we can
separate node reclaim from node page list operation this way.  No
functional change intended.

Link: https://lkml.kernel.org/r/20220425111232.23182-3-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1fe47c0b

mm/vmscan: add a comment about MADV_FREE pages check in folio_check_dirty_writeback · 32a331a7

Miaohe Lin authored May 12, 2022

Patch series "A few cleanup and fixup patches for vmscan

This series contains a few patches to remove obsolete comment, introduce
helper to remove duplicated code and so no.  Also we take all base pages
of THP into account in rare race condition.  More details can be found in
the respective changelogs.


This patch (of 6):

The MADV_FREE pages check in folio_check_dirty_writeback is a bit hard to
follow.  Add a comment to make the code clear.

Link: https://lkml.kernel.org/r/20220425111232.23182-2-linmiaohe@huawei.comSuggested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

32a331a7

mm/vmscan: not necessary to re-init the list for each iteration · 048f6e1a

Wei Yang authored May 12, 2022

node_page_list is defined with LIST_HEAD and be cleaned until
list_empty.

So it is not necessary to re-init it again.

[akpm@linux-foundation.org: remove unneeded braces]
Link: https://lkml.kernel.org/r/20220426021743.21007-1-richard.weiyang@gmail.comSigned-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

048f6e1a

mm: convert sysfs input to bool using kstrtobool() · 717aeab4

Jagdish Gediya authored May 12, 2022

Sysfs input conversion to corrosponding bool value e.g.  "false" or "0" to
false, "true" or "1" to true are currently handled through strncmp at
multiple places.  Use kstrtobool() to convert sysfs input to bool value.

[akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy]
Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.comSigned-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

717aeab4

lib/kstrtox.c: add "false"/"true" support to kstrtobool() · 0d6ea3ac

Jagdish Gediya authored May 12, 2022

At many places in kernel, It is necessary to convert sysfs input to
corresponding bool value e.g.  "false" or "0" need to be converted to bool
false, "true" or "1" need to be converted to bool true, places where such
conversion is needed currently check the input string manually,
kstrtobool() can be utilized at such places but currently it doesn't have
support to accept "false"/"true".

Add support to accept "false"/"true" as valid string in kstrtobool().

[akpm@linux-foundation.org: undo s/iff/if/, per Matthew]
Link: https://lkml.kernel.org/r/20220426180203.70782-1-jvgediya@linux.ibm.comSigned-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0d6ea3ac

mm/vmscan: take min_slab_pages into account when try to call shrink_node · d8ff6fde

Miaohe Lin authored May 12, 2022

Since commit 6b4f7799 ("mm: vmscan: invoke slab shrinkers from
shrink_zone()"), slab reclaim and lru page reclaim are done together in
the shrink_node.  So we should take min_slab_pages into account when try
to call shrink_node.

Link: https://lkml.kernel.org/r/20220425112118.20924-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d8ff6fde

drivers: virtio_mem: use pageblock size as the minimum virtio_mem size. · 448b8ec3

Zi Yan authored May 12, 2022

alloc_contig_range() now only needs to be aligned to pageblock_nr_pages,
drop virtio_mem size requirement that it needs to be MAX_ORDER_NR_PAGES.

Link: https://lkml.kernel.org/r/20220425143118.2850746-7-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

448b8ec3

mm: cma: use pageblock_order as the single alignment · 11ac3e87

Zi Yan authored May 12, 2022

Now alloc_contig_range() works at pageblock granularity.  Change CMA
allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
alignment.

Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

11ac3e87

mm: page_isolation: enable arbitrary range page isolation. · 6e263fff

Zi Yan authored May 12, 2022

Now start_isolate_page_range() is ready to handle arbitrary range
isolation, so move the alignment check/adjustment into the function body. 
Do the same for its counterpart undo_isolate_page_range(). 
alloc_contig_range(), its caller, can pass an arbitrary range instead of a
MAX_ORDER_NR_PAGES aligned one.

Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6e263fff

mm: make alloc_contig_range work at pageblock granularity · b2c9e2fb

Zi Yan authored May 12, 2022

alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
merging pageblocks with different migratetypes.  It might unnecessarily
convert extra pageblocks at the beginning and at the end of the range. 
Change alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified by alloc_contig_range().  Because these=

Partially isolated pages causes free page accounting issues.  The free
pages will be split and freed into separate migratetype lists; the in-use=

Pages will be migrated then the freed pages will be handled in the
aforementioned way.

[ziy@nvidia.com: fix deadlock/crash]
  Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b2c9e2fb

mm: page_isolation: check specified range for unmovable pages · 844fbae6

Zi Yan authored May 12, 2022

Enable set_migratetype_isolate() to check specified range for unmovable
pages during isolation to prepare arbitrary range page isolation.  The
functionality will take effect in upcoming commits by adjusting the
callers of start_isolate_page_range(), which uses
set_migratetype_isolate().

For example, alloc_contig_range(), which calls start_isolate_page_range(),
accepts unaligned ranges, but because page isolation is currently done at
MAX_ORDER_NR_PAEGS granularity, pages that are out of the specified range
but withint MAX_ORDER_NR_PAEGS alignment might be attempted for isolation
and the failure of isolating these unrelated pages fails the whole
operation undesirably.

Link: https://lkml.kernel.org/r/20220425143118.2850746-3-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

844fbae6

mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c · b48d8a8e

Zi Yan authored May 12, 2022

Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11.

This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
and alloc_contig_range(). It prepares for my upcoming changes to make
MAX_ORDER adjustable at boot time[1].

The MAX_ORDER - 1 alignment requirement comes from that
alloc_contig_range() isolates pageblocks to remove free memory from buddy
allocator but isolating only a subset of pageblocks within a page spanning
across multiple pageblocks causes free page accounting issues.  Isolated
page might not be put into the right free list, since the code assumes the
migratetype of the first pageblock as the whole free page migratetype. 
This is based on the discussion at [2].

To remove the requirement, this patchset:
1. isolates pages at pageblock granularity instead of
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
2. splits free pages across the specified range or migrates in-use pages
   across the specified range then splits the freed page to avoid free page
   accounting issues (it happens when multiple pageblocks within a single page
   have different migratetypes);
3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
   range during isolation to avoid alloc_contig_range() failure when pageblocks
   within a MAX_ORDER - 1 aligned range are allocated separately.
4. returns pages not in the range as it did before.

One optimization might come later:
1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
   migratetypes when isolation fails in the middle of the range.

[1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/


This patch (of 6):

has_unmovable_pages() is only used in mm/page_isolation.c.  Move it from
mm/page_alloc.c and make it static.

Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.comSigned-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Minchan Kim <minchan@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b48d8a8e

cgroup: fix racy check in alloc_pagecache_max_30M() helper function · c1a31a2f

David Vernet authored May 12, 2022

alloc_pagecache_max_30M() in the cgroup memcg tests performs a 50MB
pagecache allocation, which it expects to be capped at 30MB due to the
calling process having a memory.high setting of 30MB.  After the
allocation, the function contains a check that verifies that MB(29) <
memory.current <= MB(30).  This check can actually fail
non-deterministically.

The testcases that use this function are test_memcg_high() and
test_memcg_max(), which set memory.min and memory.max to 30MB respectively
for the cgroup under test.  The allocation can slightly exceed this number
in both cases, and for memory.max, the process performing the allocation
will not have the OOM killer invoked as it's performing a pagecache
allocation.  This patchset therefore updates the above check to instead
use the verify_close() helper function.

Link: https://lkml.kernel.org/r/20220423155619.3669555-6-void@manifault.comSigned-off-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c1a31a2f

cgroup: remove racy check in test_memcg_sock() · 83031680

David Vernet authored May 12, 2022

test_memcg_sock() in the cgroup memcg tests, verifies expected memory
accounting for sockets. The test forks a process which functions as a TCP
server, and sends large buffers back and forth between itself (as the TCP
client) and the forked TCP server. While doing so, it verifies that
memory.current and memory.stat.sock look correct.

There is currently a check in tcp_client() which asserts memory.current >=
memory.stat.sock. This check is racy, as between memory.current and
memory.stat.sock being queried, a packet could come in which causes
mem_cgroup_charge_skmem() to be invoked. This could cause
memory.stat.sock to exceed memory.current. Reversing the order of
querying doesn't address the problem either, as memory may be reclaimed
between the two calls. Instead, this patch just removes that assertion
altogether, and instead relies on the values_close() check that follows to
validate the expected accounting.

Link: https://lkml.kernel.org/r/20220423155619.3669555-5-void@manifault.comSigned-off-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

83031680

cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events() · 72b1e03a

David Vernet authored May 12, 2022

The test_memcg_oom_group_leaf_events() testcase in the cgroup memcg tests
validates that processes in a group that perform allocations exceeding
memory.oom.group are killed. It also validates that the
memory.events.oom_kill events are properly propagated in this case.

Commit 06e11c907ea4 ("kselftests: memcg: update the oom group leaf events
test") fixed test_memcg_oom_group_leaf_events() to account for the fact
that the memory.events.oom_kill events in a child cgroup is propagated up
to its parent. This behavior can actually be configured by the
memory_localevents mount option, so this patch updates the testcase to
properly account for the possible presence of this mount option.

Link: https://lkml.kernel.org/r/20220423155619.3669555-4-void@manifault.comSigned-off-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

72b1e03a

cgroup: account for memory_recursiveprot in test_memcg_low() · cdc69458

David Vernet authored May 12, 2022

The test_memcg_low() testcase in test_memcontrol.c verifies the expected
behavior of groups using the memory.low knob.  Part of the testcase
verifies that a group with memory.low that experiences reclaim due to
memory pressure elsewhere in the system, observes memory.events.low events
as a result of that reclaim.

In commit 8a931f80 ("mm: memcontrol: recursive memory.low
protection"), the memory controller was updated to propagate memory.low
and memory.min protection from a parent group to its children via a
configurable memory_recursiveprot mount option.  This unfortunately broke
the memcg tests, which asserts that a sibling that experienced reclaim but
had a memory.low value of 0, would not observe any memory.low events. 
This patch updates test_memcg_low() to account for the new behavior
introduced by memory_recursiveprot.

So as to make the test resilient to multiple configurations, the patch
also adds a new proc_mount_contains() helper that checks for a string in
/proc/mounts, and is used to toggle behavior based on whether the default
memory_recursiveprot was present.

Link: https://lkml.kernel.org/r/20220423155619.3669555-3-void@manifault.comSigned-off-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cdc69458

cgroups: refactor children cgroups in memcg tests · f0cdaa56

David Vernet authored May 12, 2022

Patch series "Fix bugs in memcontroller cgroup tests", v2.

tools/testing/selftests/cgroup/test_memcontrol.c contains a set of
testcases which validate expected behavior of the cgroup memory
controller.  Roman Gushchin recently sent out a patchset that fixed a few
issues in the test.  This patchset continues that effort by fixing a few
more issues that were causing non-deterministic failures in the suite. 
With this patchset, I'm unable to reproduce any more errors after running
the tests in a continuous loop for many iterations.  Before, I was able to
reproduce at least one of the errors fixed in this patchset with just one
or two runs.


This patch (of 5):

In test_memcg_min() and test_memcg_low(), there is an array of four
sibling cgroups.  All but one of these sibling groups does a 50MB
allocation, and the group that does no allocation is the third of four in
the array.  This is not a problem per se, but makes it a bit tricky to do
some assertions in test_memcg_low(), as we want to make assertions on the
siblings based on whether or not they performed allocations.  Having a
static index before which all groups have performed an allocation makes
this cleaner.

This patch therefore reorders the sibling groups so that the group that
performs no allocations is the last in the array.  A follow-on patch will
leverage this to fix a bug in the test that incorrectly asserts that a
sibling group that had performed an allocation, but only had protection
from its parent, will not observe any memory.events.low events during
reclaim.

Link: https://lkml.kernel.org/r/20220423155619.3669555-1-void@manifault.com
Link: https://lkml.kernel.org/r/20220423155619.3669555-2-void@manifault.comSigned-off-by: David Vernet <void@manifault.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f0cdaa56

mm/uffd: move USERFAULTFD configs into mm/ · 430529b5

Peter Xu authored May 12, 2022

We used to have USERFAULTFD configs stored in init/.  It makes sense as a
start because that's the default place for storing syscall related
configs.

However userfaultfd evolved a bit in the past few years and some more
config options were added.  They're no longer related to syscalls and
start to be not suitable to be kept in the init/ directory anymore,
because they're pure mm concepts.

But it's not ideal either to keep the userfaultfd configs separate from
each other.  Hence this patch moves the userfaultfd configs under init/ to
be under mm/ so that we'll start to group all userfaultfd configs
together.

We do have quite a few examples of syscall related configs that are not
put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING,
MEMFD_CREATE..  They all reside in the dir where they're more suitable for
the concept.  So it seems there's no restriction to keep the role of
having syscall related CONFIG_* under init/ only.

Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

430529b5

userfaultfd/selftests: use swap() instead of open coding it · 1bf08313

Guo Zhengkui authored May 12, 2022

Address the following coccicheck warning:

tools/testing/selftests/vm/userfaultfd.c:1536:21-22: WARNING opportunity
for swap().
tools/testing/selftests/vm/userfaultfd.c:1540:33-34: WARNING opportunity
for swap().

by using swap() for the swapping of variable values and drop
`tmp_area` that is not needed any more.

`swap()` macro in userfaultfd.c is introduced in commit 68169686
("selftests: vm: remove dependecy from internal kernel macros")

It has been tested with gcc (Debian 8.3.0-6) 8.3.0.

Link: https://lkml.kernel.org/r/20220407123141.4998-1-guozhengkui@vivo.comSigned-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1bf08313

selftests/uffd: enable uffd-wp for shmem/hugetlbfs · c0eeeb02

Peter Xu authored May 12, 2022

After we added support for shmem and hugetlbfs, we can turn uffd-wp test
on always now.

Link: https://lkml.kernel.org/r/20220405014932.15212-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c0eeeb02

mm: enable PTE markers by default · 81e0f15f

Peter Xu authored May 12, 2022

Enable PTE markers by default.  On x86_64 it means it'll auto-enable
PTE_MARKER_UFFD_WP as well.

[peterx@redhat.com: hide PTE_MARKER option]
  Link: https://lkml.kernel.org/r/20220419202531.27415-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220405014929.15158-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

81e0f15f

mm/uffd: enable write protection for shmem & hugetlbfs · b1f9e876

Peter Xu authored May 12, 2022

We've had all the necessary changes ready for both shmem and hugetlbfs. 
Turn on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too
because all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b1f9e876

mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfs · 8e165e73

Peter Xu authored May 12, 2022

This requires the pagemap code to be able to recognize the newly
introduced swap special pte for uffd-wp, meanwhile the general case for
hugetlb that we recently start to support.  It should make pagemap uffd-wp
support complete.

Link: https://lkml.kernel.org/r/20220405014923.15047-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8e165e73

mm/khugepaged: don't recycle vma pgtable if uffd-wp registered · deb4c93a

Peter Xu authored May 12, 2022

When we're trying to collapse a 2M huge shmem page, don't retract pgtable
pmd page if it's registered with uffd-wp, because that pgtable could have
pte markers installed.  Recycling of that pgtable means we'll lose the pte
markers.  That could cause data loss for an uffd-wp enabled application on
shmem.

Instead of disabling khugepaged on these files, simply skip retracting
these special VMAs, then the page cache can still be merged into a huge
thp, and other mm/vma can still map the range of file with a huge thp when
proper.

Note that checking VM_UFFD_WP needs to be done with mmap_sem held for
write, that avoids race like:

         khugepaged                             user thread
         ==========                             ===========
     check VM_UFFD_WP, not set
                                       UFFDIO_REGISTER with uffd-wp on shmem
                                       wr-protect some pages (install markers)
     take mmap_sem write lock
     erase pmd and free pmd page
      --> pte markers are dropped unnoticed!

Link: https://lkml.kernel.org/r/20220405014921.14994-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

deb4c93a

mm/hugetlb: handle uffd-wp during fork() · bc70fbf2

Peter Xu authored May 12, 2022

Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
because for uffd-wp it's the dst vma that matters on deciding how we
should treat uffd-wp protected ptes.

We should recognize pte markers during fork and do the pte copy if needed.

[lkp@intel.com: vma_needs_copy can be static]
  Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bc70fbf2

mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0

Peter Xu authored May 12, 2022

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
if unmapping an entire vma or synchronized such that faults can not race
with the unmap operation.  This requires passing zap_flags all the way to
the lowest level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
faults.  The exception is hole punch which will first unmap without any
synchronization.  Later when hole punch actually removes the page from the
file, it will check to see if there was a subsequent fault and if so take
the hugetlb fault mutex while unmapping again.  This second unmap will
pass in ZAP_FLAG_DROP_MARKER.

The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was
wr-protected to be writable, even in an extremely short period.  That
could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
faults after that call and before remove_inode_hugepages() is executed,
the page cache can be mapped writable again in the small racy window, that
can cause unexpected data overwritten.

[peterx@redhat.com: fix sparse warning]
  Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
[akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

05e90bd0

mm/hugetlb: allow uffd wr-protect none ptes · 60dfaad6

Peter Xu authored May 12, 2022

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte.  Meanwhile we also need to be able to recognize a
uffd-wp marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the
huge page size fetcher.

Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

60dfaad6