1. 21 Apr, 2023 16 commits
  2. 18 Apr, 2023 24 commits
    • Longlong Xia's avatar
      mm: ksm: support hwpoison for ksm page · 4248d008
      Longlong Xia authored
      hwpoison_user_mappings() is updated to support ksm pages, and add
      collect_procs_ksm() to collect processes when the error hit an ksm page. 
      The difference from collect_procs_anon() is that it also needs to traverse
      the rmap-item list on the stable node of the ksm page.  At the same time,
      add_to_kill_ksm() is added to handle ksm pages.  And
      task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
      to_kill list.  This is because when scanning the list, if the pages that
      make up the ksm page all come from the same process, they may be added
      repeatedly.
      
      Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.comSigned-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4248d008
    • Longlong Xia's avatar
      mm: memory-failure: refactor add_to_kill() · 4f775086
      Longlong Xia authored
      Patch series "mm: ksm: support hwpoison for ksm page", v2.
      
      Currently, ksm does not support hwpoison.  As ksm is being used more
      widely for deduplication at the system level, container level, and process
      level, supporting hwpoison for ksm has become increasingly important. 
      However, ksm pages were not processed by hwpoison in 2009 [1].
      
      The main method of implementation:
      
      1. Refactor add_to_kill() and add new add_to_kill_*() to better
         accommodate the handling of different types of pages.
      
      2.  Add collect_procs_ksm() to collect processes when the error hit an
         ksm page.
      
      3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to
         the to_kill list.  
      
      4. Try_to_unmap ksm page (already supported).
      
      5. Handle related processes such as sending SIGBUS.
      
      Tested with poisoning to ksm page from
      1) different process
      2) one process
      
      and with/without memory_failure_early_kill set, the processes are killed
      as expected with the patchset.  
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
      commit/?h=01e00f88
      
      
      This patch (of 2):
      
      The page_address_in_vma() is used to find the user virtual address of page
      in add_to_kill(), but it doesn't support ksm due to the ksm page->index
      unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller
      to pass it, also rename the function to __add_to_kill(), and adding
      add_to_kill_anon_file() for handling anonymous pages and file pages,
      adding add_to_kill_fsdax() for handling fsdax pages.
      
      Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei.com
      Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei.comSigned-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f775086
    • Jeff Xu's avatar
      selftests/memfd: fix test_sysctl · 3cc0c373
      Jeff Xu authored
      sysctl memfd_noexec is pid-namespaced, non-reservable, and inherent to the
      child process.
      
      Move the inherence test from init ns to child ns, so init ns can keep the
      default value.
      
      Link: https://lkml.kernel.org/r/20230414022801.2545257-1-jeffxu@google.comSigned-off-by: default avatarJeff Xu <jeffxu@google.com>
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
        Link: https://lore.kernel.org/oe-lkp/202303312259.441e35db-yujie.liu@intel.comTested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3cc0c373
    • Chaitanya S Prakash's avatar
      selftests/mm: run hugetlb testcases of va switch · c025da0f
      Chaitanya S Prakash authored
      The va_high_addr_switch selftest is used to test mmap across 128TB
      boundary.  It divides the selftest cases into two main categories on the
      basis of size.  One set is used to create mappings that are multiples of
      PAGE_SIZE while the other creates mappings that are multiples of
      HUGETLB_SIZE.
      
      In order to run the hugetlb testcases the binary must be appended with
      "--run-hugetlb" but the file that used to run the test only invokes the
      binary, thereby completely skipping the hugetlb testcases.  Hence, the
      required statement has been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-6-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c025da0f
    • Chaitanya S Prakash's avatar
      selftests/mm: configure nr_hugepages for arm64 · 2f489e2e
      Chaitanya S Prakash authored
      Arm64 has a default hugepage size of 512MB when CONFIG_ARM64_64K_PAGES=y
      is enabled.  While testing on arm64 platforms having up to 4PB of virtual
      address space, a minimum of 6 hugepages were required for all test cases
      to pass.  Support for this requirement has been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-5-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f489e2e
    • Chaitanya S Prakash's avatar
      selftests/mm: add platform independent in code comments · c2af2a41
      Chaitanya S Prakash authored
      The in code comments for the selftest were made on the basis of 128TB
      switch, an architecture feature specific to PowerPc and x86 platforms. 
      Keeping in mind the support added for arm64 platforms which implements a
      256TB switch, a more generic explanation has been provided.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-4-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2af2a41
    • Chaitanya S Prakash's avatar
      selftests/mm: rename va_128TBswitch to va_high_addr_switch · bbe16872
      Chaitanya S Prakash authored
      As the initial selftest only took into consideration PowperPC and x86
      architectures, on adding support for arm64, a platform independent naming
      convention is chosen.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-3-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe16872
    • Chaitanya S Prakash's avatar
      selftests/mm: add support for arm64 platform on va switch · cd834afa
      Chaitanya S Prakash authored
      Patch series "selftests/mm: Implement support for arm64 on va".
      
      The va_128TBswitch selftest is designed and implemented for PowerPC and
      x86 architectures which support a 128TB switch, up to 256TB of virtual
      address space and hugepage sizes of 16MB and 2MB respectively.  Arm64
      platforms on the other hand support a 256Tb switch, up to 4PB of virtual
      address space and a default hugepage size of 512MB when 64k pagesize is
      enabled.
      
      These architectural differences require introducing support for arm64
      platforms, after which a more generic naming convention is suggested.  The
      in code comments are amended to provide a more platform independent
      explanation of the working of the code and nr_hugepages are configured as
      required.  Finally, the file running the testcase is modified in order to
      prevent skipping of hugetlb testcases of va_high_addr_switch.
      
      
      This patch (of 5):
      
      Arm64 platforms have the ability to support 64kb pagesize, 512MB default
      hugepage size and up to 4PB of virtual address space.  The address switch
      occurs at 256TB as opposed to 128TB.  Hence, the necessary support has
      been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-1-chaitanyas.prakash@arm.com
      Link: https://lkml.kernel.org/r/20230323105243.2807166-2-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd834afa
    • Luca Vizzarro's avatar
      memfd: pass argument of memfd_fcntl as int · f7b8f70b
      Luca Vizzarro authored
      The interface for fcntl expects the argument passed for the command
      F_ADD_SEALS to be of type int.  The current code wrongly treats it as a
      long.  In order to avoid access to undefined bits, we should explicitly
      cast the argument to int.
      
      This commit changes the signature of all the related and helper functions
      so that they treat the argument as int instead of long.
      
      Link: https://lkml.kernel.org/r/20230414152459.816046-5-Luca.Vizzarro@arm.comSigned-off-by: default avatarLuca Vizzarro <Luca.Vizzarro@arm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: David Laight <David.Laight@ACULAB.com>
      Cc: Mark Rutland <Mark.Rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7b8f70b
    • Kalesh Singh's avatar
      mm: Multi-gen LRU: remove wait_event_killable() · 7f63cf2d
      Kalesh Singh authored
      Android 14 and later default to MGLRU [1] and field telemetry showed
      occasional long tail latency (>100ms) in the reclaim path.
      
      Tracing revealed priority inversion in the reclaim path.  In
      try_to_inc_max_seq(), when high priority tasks were blocked on
      wait_event_killable(), the preemption of the low priority task to call
      wake_up_all() caused those high priority tasks to wait longer than
      necessary.  In general, this problem is not different from others of its
      kind, e.g., one caused by mutex_lock().  However, it is specific to MGLRU
      because it introduced the new wait queue lruvec->mm_state.wait.
      
      The purpose of this new wait queue is to avoid the thundering herd
      problem.  If many direct reclaimers rush into try_to_inc_max_seq(), only
      one can succeed, i.e., the one to wake up the rest, and the rest who
      failed might cause premature OOM kills if they do not wait.  So far there
      is no evidence supporting this scenario, based on how often the wait has
      been hit.  And this begs the question how useful the wait queue is in
      practice.
      
      Based on Minchan's recommendation, which is in line with his commit
      6d4675e6 ("mm: don't be stuck to rmap lock on reclaim path") and the
      rest of the MGLRU code which also uses trylock when possible, remove the
      wait queue.
      
      [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
      
      Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarWei Wang <wvw@google.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f63cf2d
    • Yang Yang's avatar
      mm: workingset: update description of the source file · ed8f3f99
      Yang Yang authored
      The calculation of workingset size is the core logic of handling refault,
      it had been updated several times[1][2] after workingset.c was created[3].
      But the description hadn't been updated accordingly, this mismatch may
      confuse the readers.  So we update the description to make it consistent
      to the code.
      
      [1] commit 34e58cac ("mm: workingset: let cache workingset challenge anon")
      [2] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      [3] commit a528910e ("mm: thrash detection-based file cache sizing")
      
      Link: https://lkml.kernel.org/r/202304131634494948454@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8f3f99
    • Pavankumar Kondeti's avatar
      printk: export console trace point for kcsan/kasan/kfence/kmsan · 1f6ab566
      Pavankumar Kondeti authored
      The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules. 
      Since this tracepoint is not exported, these modules iterate over all
      available tracepoints to find the console trace point.  Export the trace
      point so that it can be directly used.
      
      Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.comSigned-off-by: default avatarPavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f6ab566
    • Yosry Ahmed's avatar
      mm: vmscan: refactor updating current->reclaim_state · c7b23b68
      Yosry Ahmed authored
      During reclaim, we keep track of pages reclaimed from other means than
      LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
      which we stash a pointer to in current task_struct.
      
      However, we keep track of more than just reclaimed slab pages through
      this.  We also use it for clean file pages dropped through pruned inodes,
      and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
      helper function that wraps updating it through current, so that future
      changes to this logic are contained within include/linux/swap.h.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c7b23b68
    • Yosry Ahmed's avatar
      mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() · ef05e689
      Yosry Ahmed authored
      Move set_task_reclaim_state() near flush_reclaim_state() so that all
      helpers manipulating reclaim_state are in close proximity.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef05e689
    • Yosry Ahmed's avatar
      mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim · 583c27a1
      Yosry Ahmed authored
      Patch series "Ignore non-LRU-based reclaim in memcg reclaim", v6.
      
      Upon running some proactive reclaim tests using memory.reclaim, we noticed
      some tests flaking where writing to memory.reclaim would be successful
      even though we did not reclaim the requested amount fully Looking further
      into it, I discovered that *sometimes* we overestimate the number of
      reclaimed pages in memcg reclaim.
      
      Reclaimed pages through other means than LRU-based reclaim are tracked
      through reclaim_state in struct scan_control, which is stashed in current
      task_struct.  These pages are added to the number of reclaimed pages
      through LRUs.  For memcg reclaim, these pages generally cannot be linked
      to the memcg under reclaim and can cause an overestimated count of
      reclaimed pages.  This short series tries to address that.
      
      Patch 1 ignores pages reclaimed outside of LRU reclaim in memcg reclaim. 
      The pages are uncharged anyway, so even if we end up under-reporting
      reclaimed pages we will still succeed in making progress during charging.
      
      Patches 2-3 are just refactoring.  Patch 2 moves set_reclaim_state()
      helper next to flush_reclaim_state().  Patch 3 adds a helper that wraps
      updating current->reclaim_state, and renames reclaim_state->reclaimed_slab
      to reclaim_state->reclaimed.
      
      
      This patch (of 3):
      
      We keep track of different types of reclaimed pages through
      reclaim_state->reclaimed_slab, and we add them to the reported number of
      reclaimed pages.  For non-memcg reclaim, this makes sense.  For memcg
      reclaim, we have no clue if those pages are charged to the memcg under
      reclaim.
      
      Slab pages are shared by different memcgs, so a freed slab page may have
      only been partially charged to the memcg under reclaim.  The same goes for
      clean file pages from pruned inodes (on highmem systems) or xfs buffer
      pages, there is no simple way to currently link them to the memcg under
      reclaim.
      
      Stop reporting those freed pages as reclaimed pages during memcg reclaim. 
      This should make the return value of writing to memory.reclaim, and may
      help reduce unnecessary reclaim retries during memcg charging.  Writing to
      memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
      for this case we want to include any freed pages, so use the
      global_reclaim() check instead of !cgroup_reclaim().
      
      Generally, this should make the return value of
      try_to_free_mem_cgroup_pages() more accurate.  In some limited cases (e.g.
      freed a slab page that was mostly charged to the memcg under reclaim),
      the return value of try_to_free_mem_cgroup_pages() can be underestimated,
      but this should be fine.  The freed pages will be uncharged anyway, and we
      can charge the memcg the next time around as we usually do memcg reclaim
      in a retry loop.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230413104034.1086717-2-yosryahmed@google.com
      Fixes: f2fe7b09 ("mm: memcg/slab: charge individual slab objects
      instead of pages")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      583c27a1
    • Alexander Potapenko's avatar
      mm: apply __must_check to vmap_pages_range_noflush() · d905ae2b
      Alexander Potapenko authored
      To prevent errors when vmap_pages_range_noflush() or
      __vmap_pages_range_noflush() silently fail (see the link below for an
      example), annotate them with __must_check so that the callers do not
      unconditionally assume the mapping succeeded.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-4-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarDipanjan Das <mail.dipanjan.das@gmail.com>
        Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d905ae2b
    • Alexander Potapenko's avatar
      mm: kmsan: apply __must_check to non-void functions · bb1508c2
      Alexander Potapenko authored
      Non-void KMSAN hooks may return error codes that indicate that KMSAN
      failed to reflect the changed memory state in the metadata (e.g.  it could
      not create the necessary memory mappings).  In such cases the callers
      should handle the errors to prevent the tool from using the inconsistent
      metadata in the future.
      
      We mark non-void hooks with __must_check so that error handling is not
      skipped.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-3-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dipanjan Das <mail.dipanjan.das@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb1508c2
    • Liu Shixin's avatar
      mm: hwpoison: support recovery from HugePage copy-on-write faults · 1cb9dc4b
      Liu Shixin authored
      copy-on-write of hugetlb user pages with uncorrectable errors will result
      in a kernel crash.  This is because the copy is performed in kernel mode
      and in general we can not handle accessing memory with such errors while
      in kernel mode.  Commit a873dfe1 ("mm, hwpoison: try to recover from
      copy-on write faults") introduced the routine copy_user_highpage_mc() to
      gracefully handle copying of user pages with uncorrectable errors. 
      However, the separate hugetlb copy-on-write code paths were not modified
      as part of commit a873dfe1.
      
      Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so
      that they can also gracefully handle uncorrectable errors in user pages. 
      This involves changing the hugetlb specific routine
      copy_user_large_folio() from type void to int so that it can return an
      error.  Modify the hugetlb userfaultfd code in the same way so that it can
      return -EHWPOISON if it encounters an uncorrectable error.
      
      Link: https://lkml.kernel.org/r/20230413131349.2524210-1-liushixin2@huawei.comSigned-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1cb9dc4b
    • Yosry Ahmed's avatar
      memcg: page_cgroup_ino() get memcg from the page's folio · ec342603
      Yosry Ahmed authored
      In a kernel with added WARN_ON_ONCE(PageTail) in page_memcg_check(), we
      observed a warning from page_cgroup_ino() when reading /proc/kpagecgroup. 
      This warning was added to catch fragile reads of a page memcg.  Make
      page_cgroup_ino() get memcg from the page's folio using
      folio_memcg_check(): that gives it the correct memcg for each page of a
      folio, so is the right fix.
      
      Note that page_folio() is racy, the page's folio can change from under us,
      but the entire function is racy and documented as such.
      
      I dithered between the right fix and the safer "fix": it's unlikely but
      conceivable that some userspace has learnt that /proc/kpagecgroup gives no
      memcg on tail pages, and compensates for that in some (racy) way: so
      continuing to give no memcg on tails, without warning, might be safer.
      
      But hwpoison_filter_task(), the only other user of page_cgroup_ino(),
      persuaded me.  It looks as if it currently leaves out tail pages of the
      selected memcg, by mistake: whereas hwpoison_inject() uses compound_head()
      and expects the tails to be included.  So hwpoison testing coverage has
      probably been restricted by the wrong output from page_cgroup_ino() (if
      that memcg filter is used at all): in the short term, it might be safer
      not to enable wider coverage there, but long term we would regret that.
      
      This is based on a patch originally written by Hugh Dickins and retains
      most of the original commit log [1]
      
      The patch was changed to use folio_memcg_check(page_folio(page)) instead
      of page_memcg_check(compound_head(page)) based on discussions with Matthew
      Wilcox; where he stated that callers of page_memcg_check() should stop
      using it due to the ambiguity around tail pages -- instead they should use
      folio_memcg_check() and handle tail pages themselves.
      
      Link: https://lkml.kernel.org/r/20230412003451.4018887-1-yosryahmed@google.com
      Link: https://lore.kernel.org/linux-mm/20230313083452.1319968-1-yosryahmed@google.com/ [1]
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec342603
    • Aneesh Kumar K.V's avatar
      mm/hugetlb_vmemmap: rename ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP · 0b376f1e
      Aneesh Kumar K.V authored
      Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
      indicate devdax and hugetlb vmemmap optimization support.  Hence rename
      that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP
      
      Link: https://lkml.kernel.org/r/20230412050025.84346-2-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Tarun Sahu <tsahu@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b376f1e
    • Aneesh Kumar K.V's avatar
      mm/vmemmap/devdax: fix kernel crash when probing devdax devices · 87a7ae75
      Aneesh Kumar K.V authored
      commit 4917f55b ("mm/sparse-vmemmap: improve memory savings for
      compound devmaps") added support for using optimized vmmemap for devdax
      devices.  But how vmemmap mappings are created are architecture specific. 
      For example, powerpc with hash translation doesn't have vmemmap mappings
      in init_mm page table instead they are bolted table entries in the
      hardware page table
      
      vmemmap_populate_compound_pages() used by vmemmap optimization code is not
      aware of these architecture-specific mapping.  Hence allow architecture to
      opt for this feature.  I selected architectures supporting
      HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
      
      This patch fixes the below crash on ppc64.
      
      BUG: Unable to handle kernel data access on write at 0xc00c000100400038
      Faulting instruction address: 0xc000000001269d90
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in:
      CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
      Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
      NIP:  c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
      REGS: c000000003632c30 TRAP: 0300   Not tainted  (6.3.0-rc5-150500.34-default+)
      MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24842228  XER: 00000000
      CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
      ....
      NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
      LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
      Call Trace:
      [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
      [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
      [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
      [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
      [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
      [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
      [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
      [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
      [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
      [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
      [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
      [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
      [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
      [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
      [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
      [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
      [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
      [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
      [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
      [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
      [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
      [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
      [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
      [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
      [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
      [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
      [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
      [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
      [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
      [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
      [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
      [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
      
      Link: https://lkml.kernel.org/r/20230411142214.64464-1-aneesh.kumar@linux.ibm.com
      Fixes: 4917f55b ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: default avatarTarun Sahu <tsahu@linux.ibm.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      87a7ae75
    • Peter Xu's avatar
      selftests/mm: add uffdio register ioctls test · 43759d44
      Peter Xu authored
      This new test tests against the returned ioctls from UFFDIO_REGISTER,
      where put into uffdio_register.ioctls.
      
      This also tests the expected failure cases of UFFDIO_REGISTER, aka:
      
        - Register with empty mode should fail with -EINVAL
        - Register minor without page cache (anon) should fail with -EINVAL
      
      Link: https://lkml.kernel.org/r/20230412164548.329376-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43759d44
    • Peter Xu's avatar
      selftests/mm: add shmem-private test to uffd-stress · 5aec236f
      Peter Xu authored
      The userfaultfd stress test never tested private shmem, which I think was
      overlooked long due.  Add it so it matches with uffd unit test and it'll
      cover all memory supported with the three memory types.
      
      Meanwhile, rename the memory types a bit.  Considering shared mem is the
      major use case for both shmem / hugetlbfs, changing from:
      
        anon, hugetlb, hugetlb_shared, shmem
      
      To (with shmem-private added):
      
        anon, hugetlb, hugetlb-private, shmem, shmem-private
      
      Add the shmem-private to run_vmtests.sh too.
      
      Link: https://lkml.kernel.org/r/20230412164546.329355-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5aec236f
    • Peter Xu's avatar
      selftests/mm: drop sys/dev test in uffd-stress test · 111fd29b
      Peter Xu authored
      With the new uffd unit test covering the /dev/userfaultfd path and syscall
      path of uffd initializations, we can safely drop the devnode test in the
      old stress test.
      
      One thing is to avoid duplication of running the stress test twice which is
      an overkill to only test the /dev/ interface in run_vmtests.sh.
      
      The other benefit is now all uffd tests (that uses userfaultfd_open) can
      run automatically as long as any type of interface is enabled (either
      syscall or dev), so it's more likely to succeed rather than fail due to
      unprivilege.
      
      With this patch lands, we can drop all the "mem_type:XXX" handlings too.
      
      Link: https://lkml.kernel.org/r/20230412164525.329176-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      111fd29b