1. 13 May, 2022 40 commits
    • Miaohe Lin's avatar
      mm/vmscan: introduce helper function reclaim_page_list() · 1fe47c0b
      Miaohe Lin authored
      Introduce helper function reclaim_page_list() to eliminate the duplicated
      code of doing shrink_page_list() and putback_lru_page.  Also we can
      separate node reclaim from node page list operation this way.  No
      functional change intended.
      
      Link: https://lkml.kernel.org/r/20220425111232.23182-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1fe47c0b
    • Miaohe Lin's avatar
      mm/vmscan: add a comment about MADV_FREE pages check in folio_check_dirty_writeback · 32a331a7
      Miaohe Lin authored
      Patch series "A few cleanup and fixup patches for vmscan
      
      This series contains a few patches to remove obsolete comment, introduce
      helper to remove duplicated code and so no.  Also we take all base pages
      of THP into account in rare race condition.  More details can be found in
      the respective changelogs.
      
      
      This patch (of 6):
      
      The MADV_FREE pages check in folio_check_dirty_writeback is a bit hard to
      follow.  Add a comment to make the code clear.
      
      Link: https://lkml.kernel.org/r/20220425111232.23182-2-linmiaohe@huawei.comSuggested-by: default avatarHuang, Ying <ying.huang@intel.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32a331a7
    • Wei Yang's avatar
      mm/vmscan: not necessary to re-init the list for each iteration · 048f6e1a
      Wei Yang authored
      node_page_list is defined with LIST_HEAD and be cleaned until
      list_empty.
      
      So it is not necessary to re-init it again.
      
      [akpm@linux-foundation.org: remove unneeded braces]
      Link: https://lkml.kernel.org/r/20220426021743.21007-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      048f6e1a
    • Jagdish Gediya's avatar
      mm: convert sysfs input to bool using kstrtobool() · 717aeab4
      Jagdish Gediya authored
      Sysfs input conversion to corrosponding bool value e.g.  "false" or "0" to
      false, "true" or "1" to true are currently handled through strncmp at
      multiple places.  Use kstrtobool() to convert sysfs input to bool value.
      
      [akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy]
      Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.comSigned-off-by: default avatarJagdish Gediya <jvgediya@linux.ibm.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      717aeab4
    • Jagdish Gediya's avatar
      lib/kstrtox.c: add "false"/"true" support to kstrtobool() · 0d6ea3ac
      Jagdish Gediya authored
      At many places in kernel, It is necessary to convert sysfs input to
      corresponding bool value e.g.  "false" or "0" need to be converted to bool
      false, "true" or "1" need to be converted to bool true, places where such
      conversion is needed currently check the input string manually,
      kstrtobool() can be utilized at such places but currently it doesn't have
      support to accept "false"/"true".
      
      Add support to accept "false"/"true" as valid string in kstrtobool().
      
      [akpm@linux-foundation.org: undo s/iff/if/, per Matthew]
      Link: https://lkml.kernel.org/r/20220426180203.70782-1-jvgediya@linux.ibm.comSigned-off-by: default avatarJagdish Gediya <jvgediya@linux.ibm.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d6ea3ac
    • Miaohe Lin's avatar
      mm/vmscan: take min_slab_pages into account when try to call shrink_node · d8ff6fde
      Miaohe Lin authored
      Since commit 6b4f7799 ("mm: vmscan: invoke slab shrinkers from
      shrink_zone()"), slab reclaim and lru page reclaim are done together in
      the shrink_node.  So we should take min_slab_pages into account when try
      to call shrink_node.
      
      Link: https://lkml.kernel.org/r/20220425112118.20924-1-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8ff6fde
    • Zi Yan's avatar
      drivers: virtio_mem: use pageblock size as the minimum virtio_mem size. · 448b8ec3
      Zi Yan authored
      alloc_contig_range() now only needs to be aligned to pageblock_nr_pages,
      drop virtio_mem size requirement that it needs to be MAX_ORDER_NR_PAGES.
      
      Link: https://lkml.kernel.org/r/20220425143118.2850746-7-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      448b8ec3
    • Zi Yan's avatar
      mm: cma: use pageblock_order as the single alignment · 11ac3e87
      Zi Yan authored
      Now alloc_contig_range() works at pageblock granularity.  Change CMA
      allocation, which uses alloc_contig_range(), to use pageblock_nr_pages
      alignment.
      
      Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11ac3e87
    • Zi Yan's avatar
      mm: page_isolation: enable arbitrary range page isolation. · 6e263fff
      Zi Yan authored
      Now start_isolate_page_range() is ready to handle arbitrary range
      isolation, so move the alignment check/adjustment into the function body. 
      Do the same for its counterpart undo_isolate_page_range(). 
      alloc_contig_range(), its caller, can pass an arbitrary range instead of a
      MAX_ORDER_NR_PAGES aligned one.
      
      Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e263fff
    • Zi Yan's avatar
      mm: make alloc_contig_range work at pageblock granularity · b2c9e2fb
      Zi Yan authored
      alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
      merging pageblocks with different migratetypes.  It might unnecessarily
      convert extra pageblocks at the beginning and at the end of the range. 
      Change alloc_contig_range() to work at pageblock granularity.
      
      Special handling is needed for free pages and in-use pages across the
      boundaries of the range specified by alloc_contig_range().  Because these=
      
      Partially isolated pages causes free page accounting issues.  The free
      pages will be split and freed into separate migratetype lists; the in-use=
      
      Pages will be migrated then the freed pages will be handled in the
      aforementioned way.
      
      [ziy@nvidia.com: fix deadlock/crash]
        Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
      Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2c9e2fb
    • Zi Yan's avatar
      mm: page_isolation: check specified range for unmovable pages · 844fbae6
      Zi Yan authored
      Enable set_migratetype_isolate() to check specified range for unmovable
      pages during isolation to prepare arbitrary range page isolation.  The
      functionality will take effect in upcoming commits by adjusting the
      callers of start_isolate_page_range(), which uses
      set_migratetype_isolate().
      
      For example, alloc_contig_range(), which calls start_isolate_page_range(),
      accepts unaligned ranges, but because page isolation is currently done at
      MAX_ORDER_NR_PAEGS granularity, pages that are out of the specified range
      but withint MAX_ORDER_NR_PAEGS alignment might be attempted for isolation
      and the failure of isolating these unrelated pages fails the whole
      operation undesirably.
      
      Link: https://lkml.kernel.org/r/20220425143118.2850746-3-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      844fbae6
    • Zi Yan's avatar
      mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c · b48d8a8e
      Zi Yan authored
      Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11.
      
      This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
      and alloc_contig_range(). It prepares for my upcoming changes to make
      MAX_ORDER adjustable at boot time[1].
      
      The MAX_ORDER - 1 alignment requirement comes from that
      alloc_contig_range() isolates pageblocks to remove free memory from buddy
      allocator but isolating only a subset of pageblocks within a page spanning
      across multiple pageblocks causes free page accounting issues.  Isolated
      page might not be put into the right free list, since the code assumes the
      migratetype of the first pageblock as the whole free page migratetype. 
      This is based on the discussion at [2].
      
      To remove the requirement, this patchset:
      1. isolates pages at pageblock granularity instead of
         max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
      2. splits free pages across the specified range or migrates in-use pages
         across the specified range then splits the freed page to avoid free page
         accounting issues (it happens when multiple pageblocks within a single page
         have different migratetypes);
      3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
         range during isolation to avoid alloc_contig_range() failure when pageblocks
         within a MAX_ORDER - 1 aligned range are allocated separately.
      4. returns pages not in the range as it did before.
      
      One optimization might come later:
      1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
         migratetypes when isolation fails in the middle of the range.
      
      [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/
      [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/
      
      
      This patch (of 6):
      
      has_unmovable_pages() is only used in mm/page_isolation.c.  Move it from
      mm/page_alloc.c and make it static.
      
      Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b48d8a8e
    • David Vernet's avatar
      cgroup: fix racy check in alloc_pagecache_max_30M() helper function · c1a31a2f
      David Vernet authored
      alloc_pagecache_max_30M() in the cgroup memcg tests performs a 50MB
      pagecache allocation, which it expects to be capped at 30MB due to the
      calling process having a memory.high setting of 30MB.  After the
      allocation, the function contains a check that verifies that MB(29) <
      memory.current <= MB(30).  This check can actually fail
      non-deterministically.
      
      The testcases that use this function are test_memcg_high() and
      test_memcg_max(), which set memory.min and memory.max to 30MB respectively
      for the cgroup under test.  The allocation can slightly exceed this number
      in both cases, and for memory.max, the process performing the allocation
      will not have the OOM killer invoked as it's performing a pagecache
      allocation.  This patchset therefore updates the above check to instead
      use the verify_close() helper function.
      
      Link: https://lkml.kernel.org/r/20220423155619.3669555-6-void@manifault.comSigned-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1a31a2f
    • David Vernet's avatar
      cgroup: remove racy check in test_memcg_sock() · 83031680
      David Vernet authored
      test_memcg_sock() in the cgroup memcg tests, verifies expected memory
      accounting for sockets.  The test forks a process which functions as a TCP
      server, and sends large buffers back and forth between itself (as the TCP
      client) and the forked TCP server.  While doing so, it verifies that
      memory.current and memory.stat.sock look correct.
      
      There is currently a check in tcp_client() which asserts memory.current >=
      memory.stat.sock.  This check is racy, as between memory.current and
      memory.stat.sock being queried, a packet could come in which causes
      mem_cgroup_charge_skmem() to be invoked.  This could cause
      memory.stat.sock to exceed memory.current.  Reversing the order of
      querying doesn't address the problem either, as memory may be reclaimed
      between the two calls.  Instead, this patch just removes that assertion
      altogether, and instead relies on the values_close() check that follows to
      validate the expected accounting.
      
      Link: https://lkml.kernel.org/r/20220423155619.3669555-5-void@manifault.comSigned-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83031680
    • David Vernet's avatar
      cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events() · 72b1e03a
      David Vernet authored
      The test_memcg_oom_group_leaf_events() testcase in the cgroup memcg tests
      validates that processes in a group that perform allocations exceeding
      memory.oom.group are killed.  It also validates that the
      memory.events.oom_kill events are properly propagated in this case.
      
      Commit 06e11c907ea4 ("kselftests: memcg: update the oom group leaf events
      test") fixed test_memcg_oom_group_leaf_events() to account for the fact
      that the memory.events.oom_kill events in a child cgroup is propagated up
      to its parent.  This behavior can actually be configured by the
      memory_localevents mount option, so this patch updates the testcase to
      properly account for the possible presence of this mount option.
      
      Link: https://lkml.kernel.org/r/20220423155619.3669555-4-void@manifault.comSigned-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72b1e03a
    • David Vernet's avatar
      cgroup: account for memory_recursiveprot in test_memcg_low() · cdc69458
      David Vernet authored
      The test_memcg_low() testcase in test_memcontrol.c verifies the expected
      behavior of groups using the memory.low knob.  Part of the testcase
      verifies that a group with memory.low that experiences reclaim due to
      memory pressure elsewhere in the system, observes memory.events.low events
      as a result of that reclaim.
      
      In commit 8a931f80 ("mm: memcontrol: recursive memory.low
      protection"), the memory controller was updated to propagate memory.low
      and memory.min protection from a parent group to its children via a
      configurable memory_recursiveprot mount option.  This unfortunately broke
      the memcg tests, which asserts that a sibling that experienced reclaim but
      had a memory.low value of 0, would not observe any memory.low events. 
      This patch updates test_memcg_low() to account for the new behavior
      introduced by memory_recursiveprot.
      
      So as to make the test resilient to multiple configurations, the patch
      also adds a new proc_mount_contains() helper that checks for a string in
      /proc/mounts, and is used to toggle behavior based on whether the default
      memory_recursiveprot was present.
      
      Link: https://lkml.kernel.org/r/20220423155619.3669555-3-void@manifault.comSigned-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cdc69458
    • David Vernet's avatar
      cgroups: refactor children cgroups in memcg tests · f0cdaa56
      David Vernet authored
      Patch series "Fix bugs in memcontroller cgroup tests", v2.
      
      tools/testing/selftests/cgroup/test_memcontrol.c contains a set of
      testcases which validate expected behavior of the cgroup memory
      controller.  Roman Gushchin recently sent out a patchset that fixed a few
      issues in the test.  This patchset continues that effort by fixing a few
      more issues that were causing non-deterministic failures in the suite. 
      With this patchset, I'm unable to reproduce any more errors after running
      the tests in a continuous loop for many iterations.  Before, I was able to
      reproduce at least one of the errors fixed in this patchset with just one
      or two runs.
      
      
      This patch (of 5):
      
      In test_memcg_min() and test_memcg_low(), there is an array of four
      sibling cgroups.  All but one of these sibling groups does a 50MB
      allocation, and the group that does no allocation is the third of four in
      the array.  This is not a problem per se, but makes it a bit tricky to do
      some assertions in test_memcg_low(), as we want to make assertions on the
      siblings based on whether or not they performed allocations.  Having a
      static index before which all groups have performed an allocation makes
      this cleaner.
      
      This patch therefore reorders the sibling groups so that the group that
      performs no allocations is the last in the array.  A follow-on patch will
      leverage this to fix a bug in the test that incorrectly asserts that a
      sibling group that had performed an allocation, but only had protection
      from its parent, will not observe any memory.events.low events during
      reclaim.
      
      Link: https://lkml.kernel.org/r/20220423155619.3669555-1-void@manifault.com
      Link: https://lkml.kernel.org/r/20220423155619.3669555-2-void@manifault.comSigned-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0cdaa56
    • Peter Xu's avatar
      mm/uffd: move USERFAULTFD configs into mm/ · 430529b5
      Peter Xu authored
      We used to have USERFAULTFD configs stored in init/.  It makes sense as a
      start because that's the default place for storing syscall related
      configs.
      
      However userfaultfd evolved a bit in the past few years and some more
      config options were added.  They're no longer related to syscalls and
      start to be not suitable to be kept in the init/ directory anymore,
      because they're pure mm concepts.
      
      But it's not ideal either to keep the userfaultfd configs separate from
      each other.  Hence this patch moves the userfaultfd configs under init/ to
      be under mm/ so that we'll start to group all userfaultfd configs
      together.
      
      We do have quite a few examples of syscall related configs that are not
      put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING,
      MEMFD_CREATE..  They all reside in the dir where they're more suitable for
      the concept.  So it seems there's no restriction to keep the role of
      having syscall related CONFIG_* under init/ only.
      
      Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      430529b5
    • Guo Zhengkui's avatar
      userfaultfd/selftests: use swap() instead of open coding it · 1bf08313
      Guo Zhengkui authored
      Address the following coccicheck warning:
      
      tools/testing/selftests/vm/userfaultfd.c:1536:21-22: WARNING opportunity
      for swap().
      tools/testing/selftests/vm/userfaultfd.c:1540:33-34: WARNING opportunity
      for swap().
      
      by using swap() for the swapping of variable values and drop
      `tmp_area` that is not needed any more.
      
      `swap()` macro in userfaultfd.c is introduced in commit 68169686
      ("selftests: vm: remove dependecy from internal kernel macros")
      
      It has been tested with gcc (Debian 8.3.0-6) 8.3.0.
      
      Link: https://lkml.kernel.org/r/20220407123141.4998-1-guozhengkui@vivo.comSigned-off-by: default avatarGuo Zhengkui <guozhengkui@vivo.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1bf08313
    • Peter Xu's avatar
      selftests/uffd: enable uffd-wp for shmem/hugetlbfs · c0eeeb02
      Peter Xu authored
      After we added support for shmem and hugetlbfs, we can turn uffd-wp test
      on always now.
      
      Link: https://lkml.kernel.org/r/20220405014932.15212-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0eeeb02
    • Peter Xu's avatar
      mm: enable PTE markers by default · 81e0f15f
      Peter Xu authored
      Enable PTE markers by default.  On x86_64 it means it'll auto-enable
      PTE_MARKER_UFFD_WP as well.
      
      [peterx@redhat.com: hide PTE_MARKER option]
        Link: https://lkml.kernel.org/r/20220419202531.27415-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220405014929.15158-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81e0f15f
    • Peter Xu's avatar
      mm/uffd: enable write protection for shmem & hugetlbfs · b1f9e876
      Peter Xu authored
      We've had all the necessary changes ready for both shmem and hugetlbfs. 
      Turn on all the shmem/hugetlbfs switches for userfaultfd-wp.
      
      We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too
      because all existing types now support write protection mode.
      
      Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.
      
      Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1f9e876
    • Peter Xu's avatar
      mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfs · 8e165e73
      Peter Xu authored
      This requires the pagemap code to be able to recognize the newly
      introduced swap special pte for uffd-wp, meanwhile the general case for
      hugetlb that we recently start to support.  It should make pagemap uffd-wp
      support complete.
      
      Link: https://lkml.kernel.org/r/20220405014923.15047-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e165e73
    • Peter Xu's avatar
      mm/khugepaged: don't recycle vma pgtable if uffd-wp registered · deb4c93a
      Peter Xu authored
      When we're trying to collapse a 2M huge shmem page, don't retract pgtable
      pmd page if it's registered with uffd-wp, because that pgtable could have
      pte markers installed.  Recycling of that pgtable means we'll lose the pte
      markers.  That could cause data loss for an uffd-wp enabled application on
      shmem.
      
      Instead of disabling khugepaged on these files, simply skip retracting
      these special VMAs, then the page cache can still be merged into a huge
      thp, and other mm/vma can still map the range of file with a huge thp when
      proper.
      
      Note that checking VM_UFFD_WP needs to be done with mmap_sem held for
      write, that avoids race like:
      
               khugepaged                             user thread
               ==========                             ===========
           check VM_UFFD_WP, not set
                                             UFFDIO_REGISTER with uffd-wp on shmem
                                             wr-protect some pages (install markers)
           take mmap_sem write lock
           erase pmd and free pmd page
            --> pte markers are dropped unnoticed!
      
      Link: https://lkml.kernel.org/r/20220405014921.14994-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      deb4c93a
    • Peter Xu's avatar
      mm/hugetlb: handle uffd-wp during fork() · bc70fbf2
      Peter Xu authored
      Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
      because for uffd-wp it's the dst vma that matters on deciding how we
      should treat uffd-wp protected ptes.
      
      We should recognize pte markers during fork and do the pte copy if needed.
      
      [lkp@intel.com: vma_needs_copy can be static]
        Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
      Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc70fbf2
    • Peter Xu's avatar
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu authored
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • Peter Xu's avatar
      mm/hugetlb: allow uffd wr-protect none ptes · 60dfaad6
      Peter Xu authored
      Teach hugetlbfs code to wr-protect none ptes just in case the page cache
      existed for that pte.  Meanwhile we also need to be able to recognize a
      uffd-wp marker pte and remove it for uffd_wp_resolve.
      
      Since at it, introduce a variable "psize" to replace all references to the
      huge page size fetcher.
      
      Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60dfaad6
    • Peter Xu's avatar
      mm/hugetlb: handle pte markers in page faults · c64e912c
      Peter Xu authored
      Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
      there, we just need to make sure we don't assume hugetlb_no_page() only
      handles none pte, so when detecting pte change we should use pte_same()
      rather than pte_none().  We need to pass in the old_pte to do the
      comparison.
      
      Check the original pte to see whether it's a pte marker, if it is, we
      should recover uffd-wp bit on the new pte to be installed, so that the
      next write will be trapped by uffd.
      
      Link: https://lkml.kernel.org/r/20220405014909.14761-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c64e912c
    • Peter Xu's avatar
      mm/hugetlb: handle UFFDIO_WRITEPROTECT · 5a90d5a1
      Peter Xu authored
      This starts from passing cp_flags into hugetlb_change_protection() so
      hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
      
      huge_pte_clear_uffd_wp() is introduced to handle the case where the
      UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
      
      Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a90d5a1
    • Peter Xu's avatar
      mm/hugetlb: take care of UFFDIO_COPY_MODE_WP · 6041c691
      Peter Xu authored
      Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the
      stack.  Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.
      
      Hugetlb pages are only managed by hugetlbfs, so we're safe even without
      setting dirty bit in the huge pte if the page is installed as read-only. 
      However we'd better still keep the dirty bit set for a read-only
      UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match
      what we do with shmem, but also because the page does contain dirty data
      that the kernel just copied from the userspace.
      
      Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6041c691
    • Peter Xu's avatar
      mm/hugetlb: hook page faults for uffd write protection · 166f3ecc
      Peter Xu authored
      Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp
      faults.
      
      We do this slightly earlier than hugetlb_cow() so that we can avoid taking
      some extra locks that we definitely don't need.
      
      Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      166f3ecc
    • Peter Xu's avatar
      mm/hugetlb: introduce huge pte version of uffd-wp helpers · 229f3fa7
      Peter Xu authored
      They will be used in the follow up patches to either check/set/clear
      uffd-wp bit of a huge pte.
      
      So far it reuses all the small pte helpers.  Archs can overwrite these
      versions when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the
      future.
      
      Link: https://lkml.kernel.org/r/20220405014858.14531-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      229f3fa7
    • Peter Xu's avatar
      mm/shmem: handle uffd-wp during fork() · c56d1b62
      Peter Xu authored
      Normally we skip copy page when fork() for VM_SHARED shmem, but we can't
      skip it anymore if uffd-wp is enabled on dst vma.  This should only happen
      when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem
      vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we
      should copy the pgtables with uffd-wp bit and pte markers, because these
      information will be lost otherwise.
      
      Since the condition checks will become even more complicated for deciding
      "whether a vma needs to copy the pgtable during fork()", introduce a
      helper vma_needs_copy() for it, so everything will be clearer.
      
      Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c56d1b62
    • Peter Xu's avatar
      mm/shmem: allows file-back mem to be uffd wr-protected on thps · 019c2d8b
      Peter Xu authored
      We don't have "huge" version of pte markers, instead when necessary we
      split the thp.
      
      However split the thp is not enough, because file-backed thp is handled
      totally differently comparing to anonymous thps: rather than doing a real
      split, the thp pmd will simply got cleared in __split_huge_pmd_locked().
      
      That is not enough if e.g.  when there is a thp covers range [0, 2M) but
      we want to wr-protect small page resides in [4K, 8K) range, because after
      __split_huge_pmd() returns, there will be a none pmd, and
      change_pmd_range() will just skip it right after the split.
      
      Here we leverage the previously introduced change_pmd_prepare() macro so
      that we'll populate the pmd with a pgtable page after the pmd split (in
      which process the pmd will be cleared for cases like shmem).  Then
      change_pte_range() will do all the rest for us by installing the uffd-wp
      pte marker at any none pte that we'd like to wr-protect.
      
      Link: https://lkml.kernel.org/r/20220405014852.14413-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      019c2d8b
    • Peter Xu's avatar
      mm/shmem: allow uffd wr-protect none pte for file-backed mem · fe2567eb
      Peter Xu authored
      File-backed memory differs from anonymous memory in that even if the pte
      is missing, the data could still resides either in the file or in
      page/swap cache.  So when wr-protect a pte, we need to consider none ptes
      too.
      
      We do that by installing the uffd-wp pte markers when necessary.  So when
      there's a future write to the pte, the fault handler will go the special
      path to first fault-in the page as read-only, then report to userfaultfd
      server with the wr-protect message.
      
      On the other hand, when unprotecting a page, it's also possible that the
      pte got unmapped but replaced by the special uffd-wp marker.  Then we'll
      need to be able to recover from a uffd-wp pte marker into a none pte, so
      that the next access to the page will fault in correctly as usual when
      accessed the next time.
      
      Special care needs to be taken throughout the change_protection_range()
      process.  Since now we allow user to wr-protect a none pte, we need to be
      able to pre-populate the page table entries if we see (!anonymous &&
      MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always
      skip when the pgtable entry does not exist.
      
      For example, the pgtable can be missing for a whole chunk of 2M pmd, but
      the page cache can exist for the 2M range.  When we want to wr-protect one
      4K page within the 2M pmd range, we need to pre-populate the pgtable and
      install the pte marker showing that we want to get a message and block the
      thread when the page cache of that 4K page is written.  Without
      pre-populating the pmd, change_protection() will simply skip that whole
      pmd.
      
      Note that this patch only covers the small pages (pte level) but not
      covering any of the transparent huge pages yet.  That will be done later,
      and this patch will be a preparation for it too.
      
      Link: https://lkml.kernel.org/r/20220405014850.14352-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe2567eb
    • Peter Xu's avatar
      mm/shmem: persist uffd-wp bit across zapping for file-backed · 999dad82
      Peter Xu authored
      File-backed memory is prone to being unmapped at any time.  It means all
      information in the pte will be dropped, including the uffd-wp flag.
      
      To persist the uffd-wp flag, we'll use the pte markers.  This patch
      teaches the zap code to understand uffd-wp and know when to keep or drop
      the uffd-wp bit.
      
      Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
      don't want to persist such an information, for example, when destroying
      the whole vma, or punching a hole in a shmem file.  For the rest cases we
      should never drop the uffd-wp bit, or the wr-protect information will get
      lost.
      
      The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
      memory.c because it'll be further referenced in hugetlb files later.
      
      Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      999dad82
    • Peter Xu's avatar
      mm/shmem: handle uffd-wp special pte in page fault handler · 9c28a205
      Peter Xu authored
      File-backed memories are prone to unmap/swap so the ptes are always
      unstable, because they can be easily faulted back later using the page
      cache.  This could lead to uffd-wp getting lost when unmapping or swapping
      out such memory.  One example is shmem.  PTE markers are needed to store
      those information.
      
      This patch prepares it by handling uffd-wp pte markers first it is applied
      elsewhere, so that the page fault handler can recognize uffd-wp pte
      markers.
      
      The handling of uffd-wp pte markers is similar to missing fault, it's just
      that we'll handle this "missing fault" when we see the pte markers,
      meanwhile we need to make sure the marker information is kept during
      processing the fault.
      
      This is a slow path of uffd-wp handling, because zapping of wr-protected
      shmem ptes should be rare.  So far it should only trigger in two
      conditions:
      
        (1) When trying to punch holes in shmem_fallocate(), there is an
            optimization to zap the pgtables before evicting the page.
      
        (2) When swapping out shmem pages.
      
      Because of this, the page fault handling is simplifed too by not sending
      the wr-protect message in the 1st page fault, instead the page will be
      installed read-only, so the uffd-wp message will be generated in the next
      fault, which will trigger the do_wp_page() path of general uffd-wp
      handling.
      
      Disable fault-around for all uffd-wp registered ranges for extra safety
      just like uffd-minor fault, and clean the code up.
      
      Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c28a205
    • Peter Xu's avatar
      mm/shmem: take care of UFFDIO_COPY_MODE_WP · 8ee79edf
      Peter Xu authored
      Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply
      the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with
      UFFDIO_COPY_MODE_WP.  wp_copy lands mfill_atomic_install_pte() finally.
      
      Note: we must do pte_wrprotect() if !writable in
      mfill_atomic_install_pte(), as mk_pte() could return a writable pte (e.g.,
      when VM_SHARED on a shmem file).
      
      Link: https://lkml.kernel.org/r/20220405014841.14185-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ee79edf
    • Peter Xu's avatar
      mm/uffd: PTE_MARKER_UFFD_WP · 1db9dbc2
      Peter Xu authored
      This patch introduces the 1st user of pte marker: the uffd-wp marker.
      
      When the pte marker is installed with the uffd-wp bit set, it means this
      pte was wr-protected by uffd.
      
      We will use this special pte to arm the ptes that got either unmapped or
      swapped out for a file-backed region that was previously wr-protected. 
      This special pte could trigger a page fault just like swap entries.
      
      This idea is greatly inspired by Hugh and Andrea in the discussion, which
      is referenced in the links below.
      
      Some helpers are introduced to detect whether a swap pte is uffd
      wr-protected.  After the pte marker introduced, one swap pte can be
      wr-protected in two forms: either it is a normal swap pte and it has
      _PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP
      set.
      
      [peterx@redhat.com: fixup]
        Link: https://lkml.kernel.org/r/YkzKiM8tI4+qOfXF@xz-m1.local
      Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
      Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
      Link: https://lkml.kernel.org/r/20220405014838.14131-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1db9dbc2
    • Peter Xu's avatar
      mm: check against orig_pte for finish_fault() · f46f2ade
      Peter Xu authored
      This patch allows do_fault() to trigger on !pte_none() cases too.  This
      prepares for the pte markers to be handled by do_fault() just like none
      pte.
      
      To achieve this, instead of unconditionally check against pte_none() in
      finish_fault(), we may hit the case that the orig_pte was some pte marker
      so what we want to do is to replace the pte marker with some valid pte
      entry.  Then if orig_pte was set we'd want to check the current *pte
      (under pgtable lock) against orig_pte rather than none pte.
      
      Right now there's no solid way to safely reference orig_pte because when
      pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
      it's not safe to reference it.
      
      There's another solution proposed before this patch to do pte_clear() for
      vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
      because arm32 could have assumption that pte_t* pointer will always reside
      on a real ram32 pgtable, not any kernel stack variable.
      
      To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
      set along with orig_pte when there is valid orig_pte, or it'll be cleared
      when orig_pte was not initialized.
      
      It'll be updated every time we call handle_pte_fault(), so e.g.  if a page
      fault retry happened it'll be properly updated along with orig_pte.
      
      [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/
      
      [akpm@linux-foundation.org: coding-style cleanups]
      [peterx@redhat.com: fix crash reported by Marek]
        Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f46f2ade