1. 12 Sep, 2022 31 commits
    • Abel Wu's avatar
      mm/mempolicy: fix lock contention on mems_allowed · 12c1dc8e
      Abel Wu authored
      The mems_allowed field can be modified by other tasks, so it isn't safe to
      access it with alloc_lock unlocked even in the current process context.
      
      Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
      and B is changing cpusetA's cpuset.mems:
      
        A (set_mempolicy)		B (echo xx > cpuset.mems)
        -------------------------------------------------------
        pol = mpol_new();
      				update_tasks_nodemask(cpusetA) {
      				  foreach t in cpusetA {
      				    cpuset_change_task_nodemask(t) {
        mpol_set_nodemask(pol) {
      				      task_lock(t); // t could be A
          new = f(A->mems_allowed);
      				      update t->mems_allowed;
          pol.create(pol, new);
      				      task_unlock(t);
        }
      				    }
      				  }
      				}
        task_lock(A);
        A->mempolicy = pol;
        task_unlock(A);
      
      In this case A's pol->nodes is computed by old mems_allowed, and could
      be inconsistent with A's new mems_allowed.
      
      While it is different when replacing vmas' policy: the pol->nodes is
      gone wild only when current_cpuset_is_being_rebound():
      
        A (mbind)			B (echo xx > cpuset.mems)
        -------------------------------------------------------
        pol = mpol_new();
        mmap_write_lock(A->mm);
      				cpuset_being_rebound = cpusetA;
      				update_tasks_nodemask(cpusetA) {
      				  foreach t in cpusetA {
      				    cpuset_change_task_nodemask(t) {
        mpol_set_nodemask(pol) {
      				      task_lock(t); // t could be A
          mask = f(A->mems_allowed);
      				      update t->mems_allowed;
          pol.create(pol, mask);
      				      task_unlock(t);
        }
      				    }
        foreach v in A->mm {
          if (cpuset_being_rebound == cpusetA)
            pol.rebind(pol, cpuset.mems);
          v->vma_policy = pol;
        }
        mmap_write_unlock(A->mm);
      				    mmap_write_lock(t->mm);
      				    mpol_rebind_mm(t->mm);
      				    mmap_write_unlock(t->mm);
      				  }
      				}
      				cpuset_being_rebound = NULL;
      
      In this case, the cpuset.mems, which has already done updating, is finally
      used for calculating pol->nodes, rather than A->mems_allowed.  So it is OK
      to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).
      
      Link: https://lkml.kernel.org/r/20220811124157.74888-1-wuyun.abel@bytedance.com
      Fixes: 78b132e9 ("mm/mempolicy: remove or narrow the lock on current")
      Signed-off-by: default avatarAbel Wu <wuyun.abel@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12c1dc8e
    • Charan Teja Kalla's avatar
      mm/cma_debug: show complete cma name in debugfs directories · 9a79443d
      Charan Teja Kalla authored
      Currently only 12 characters of the cma name is being used as the debug
      directories where as the cma name can be of length CMA_MAX_NAME(=64)
      characters.  One side problem with this is having 2 cma's with first
      common 12 characters would end up in trying to create directories with
      same name and fails with -EEXIST thus can limit cma debug functionality.
      
      The 'cma-' prefix is used initially where cma areas don't have any names
      and are represented by simple integer values.  Since now each cma would be
      having its own name, drop 'cma-' prefix for the cma debug directories as
      they are clearly evident that they are for cma debug through creating them
      in /sys/kernel/debug/cma/ path.
      
      Link: https://lkml.kernel.org/r/1660223729-22461-1-git-send-email-quic_charante@quicinc.comSigned-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9a79443d
    • Christoph Hellwig's avatar
      mm/swap: remove the end_write_func argument to __swap_writepage · cf1e3fe4
      Christoph Hellwig authored
      The argument is always set to end_swap_bio_write, so remove the argument
      and mark end_swap_bio_write static.
      
      Link: https://lkml.kernel.org/r/20220811141741.660214-1-hch@lst.deSigned-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf1e3fe4
    • Alexey Romanov's avatar
      zsmalloc: remove unnecessary size_class NULL check · f24263a5
      Alexey Romanov authored
      pool->size_class array elements can't be NULL, so this check
      is not needed.
      
      In the whole code, we assign pool->size_class[i] values that are
      not NULL. Releasing memory for these values occurs in the
      zs_destroy_pool() function, which also releases and destroys the pool.
      
      In addition, in the zs_stats_size_show() and async_free_zspage(),
      with similar iterations over the array, we don't check it for NULL
      pointer.
      
      Link: https://lkml.kernel.org/r/20220811153755.16102-3-avromanov@sberdevices.ruSigned-off-by: default avatarAlexey Romanov <avromanov@sberdevices.ru>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f24263a5
    • Alexey Romanov's avatar
      zsmalloc: zs_object_copy: add clarifying comment · 050a388b
      Alexey Romanov authored
      Patch series "tidy up zsmalloc implementation"
      
      This patchset remove some unnecessary checks and adds a clarifying
      comment.  While analysing zs_object_copy() function code, I spent some
      time to understand what the call kunmap_atomic(d_addr) is for.  It seems
      that this point is not trivial and it is worth adding a comment.
      
      
      This patch (of 2):
      
      It's not obvious why kunmap_atomic(d_addr) call is needed.
      
      [akpm@linux-foundation.org: tweak comment layout]
      Link: https://lkml.kernel.org/r/20220811153755.16102-1-avromanov@sberdevices.ru
      Link: https://lkml.kernel.org/r/20220811153755.16102-2-avromanov@sberdevices.ruSigned-off-by: default avatarAlexey Romanov <avromanov@sberdevices.ru>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      050a388b
    • Yang Yang's avatar
      mm/vmscan: define macros for refaults in struct lruvec · e9c2dbc8
      Yang Yang authored
      The magic number 0 and 1 are used in several places in vmscan.c.
      Define macros for them to improve code readability.
      
      Link: https://lkml.kernel.org/r/20220808005644.1721066-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e9c2dbc8
    • Axel Rasmussen's avatar
      selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh · 4a7e9225
      Axel Rasmussen authored
      This new mode was recently added to the userfaultfd selftest. We want to
      exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both
      test cases to the script.
      
      Link: https://lkml.kernel.org/r/20220808175614.3885028-6-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a7e9225
    • Axel Rasmussen's avatar
      userfaultfd: update documentation to describe /dev/userfaultfd · 816284a3
      Axel Rasmussen authored
      Explain the different ways to create a new userfaultfd, and how access
      control works for each way.
      
      [axelrasmussen@google.com: improve wording in documentation, per Mike]
        Link: https://lkml.kernel.org/r/20220819205201.658693-5-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20220808175614.3885028-5-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      816284a3
    • Axel Rasmussen's avatar
      userfaultfd: selftests: modify selftest to use /dev/userfaultfd · 77c07f7c
      Axel Rasmussen authored
      We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep
      working into the future, so just run the test twice, using each interface.
      
      Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let
      the user choose which to test.
      
      As with other test features, change the behavior based on a new command
      line flag.  Introduce the idea of "test mods", which are generic (not
      specific to a test type) modifications to the behavior of the test.  This
      is sort of borrowed from this RFC patch series [1], but simplified a bit.
      
      The benefit is, in "typical" configurations this test is somewhat slow
      (say, 30sec or something).  Testing both clearly doubles it, so it may not
      always be desirable, as users are likely to use one or the other, but
      never both, in the "real world".
      
      [1]: https://patchwork.kernel.org/project/linux-mm/patch/20201129004548.1619714-14-namit@vmware.com/
      
      [axelrasmussen@google.com: modify selftest to exit with KSFT_SKIP *only* when features are unsupported, per Mike]
        Link: https://lkml.kernel.org/r/20220819205201.658693-4-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20220808175614.3885028-4-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77c07f7c
    • Axel Rasmussen's avatar
      userfaultfd: add /dev/userfaultfd for fine grained access control · 2d5de004
      Axel Rasmussen authored
      Historically, it has been shown that intercepting kernel faults with
      userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
      time) can be exploited, or at least can make some kinds of exploits
      easier.  So, in 37cd0575 "userfaultfd: add UFFD_USER_MODE_ONLY" we
      changed things so, in order for kernel faults to be handled by
      userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
      be configured so that any unprivileged user can do it.
      
      In a typical implementation of a hypervisor with live migration (take
      QEMU/KVM as one such example), we do indeed need to be able to handle
      kernel faults.  But, both options above are less than ideal:
      
      - Toggling the sysctl increases attack surface by allowing any
        unprivileged user to do it.
      
      - Granting the live migration process CAP_SYS_PTRACE gives it this
        ability, but *also* the ability to "observe and control the
        execution of another process [...], and examine and change [its]
        memory and registers" (from ptrace(2)). This isn't something we need
        or want to be able to do, so granting this permission violates the
        "principle of least privilege".
      
      This is all a long winded way to say: we want a more fine-grained way to
      grant access to userfaultfd, without granting other additional permissions
      at the same time.
      
      To achieve this, add a /dev/userfaultfd misc device.  This device provides
      an alternative to the userfaultfd(2) syscall for the creation of new
      userfaultfds.  The idea is, any userfaultfds created this way will be able
      to handle kernel faults, without the caller having any special
      capabilities.  Access to this mechanism is instead restricted using e.g. 
      standard filesystem permissions.
      
      [axelrasmussen@google.com: Handle misc_register() failure properly]
        Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d5de004
    • Axel Rasmussen's avatar
      selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh · a722d705
      Axel Rasmussen authored
      Patch series "userfaultfd: add /dev/userfaultfd for fine grained access
      control", v7.
      
      Why not ...?
      ============
      
      - Why not /proc/[pid]/userfaultfd? Two main points (additional discussion [1]):
      
          - /proc/[pid]/* files are all owned by the user/group of the process, and
            they don't really support chmod/chown. So, without extending procfs it
            doesn't solve the problem this series is trying to solve.
      
          - The main argument *for* this was to support creating UFFDs for remote
            processes. But, that use case clearly calls for CAP_SYS_PTRACE, so to
            support this we could just use the UFFD syscall as-is.
      
      - Why not use a syscall? Access to syscalls is generally controlled by
        capabilities. We don't have a capability which is used for userfaultfd access
        without also granting more / other permissions as well, and adding a new
        capability was rejected [2].
      
          - It's possible a LSM could be used to control access instead, but I have
            some concerns. I don't think this approach would be as easy to use,
            particularly if we were to try to solve this with something heavyweight
            like SELinux. Maybe we could pursue adding a new LSM specifically for
            this user case, but it may be too narrow of a case to justify that.
      
      [1]: https://patchwork.kernel.org/project/linux-mm/cover/20220719195628.3415852-1-axelrasmussen@google.com/
      [2]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-170e5943abe4@schaufler-ca.com/T/
      
      
      This patch (of 5):
      
      This not being included was just a simple oversight.  There are certain
      features (like minor fault support) which are only enabled on shared
      mappings, so without including hugetlb_shared we actually lose a
      significant amount of test coverage.
      
      Link: https://lkml.kernel.org/r/20220808175614.3885028-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20220808175614.3885028-2-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a722d705
    • Kenneth Lee's avatar
      mm/damon/dbgfs: use kmalloc for allocating only one element · b2d4c646
      Kenneth Lee authored
      Use kmalloc(...) rather than kmalloc_array(1, ...) because the number of
      elements we are specifying in this case is 1, kmalloc would accomplish the
      same thing and we can simplify.
      
      Link: https://lkml.kernel.org/r/20220808220019.1680469-1-klee33@uw.eduSigned-off-by: default avatarKenneth Lee <klee33@uw.edu>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2d4c646
    • Shaoqin Huang's avatar
      mm/filemap.c: convert page_endio() to use a folio · 223ce491
      Shaoqin Huang authored
      Replace three calls to compound_head() with one.
      
      Link: https://lkml.kernel.org/r/20220809023256.178194-1-shaoqin.huang@intel.comSigned-off-by: default avatarShaoqin Huang <shaoqin.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      223ce491
    • Kefeng Wang's avatar
      mm: memory-failure: cleanup try_to_split_thp_page() · 2ace36f0
      Kefeng Wang authored
      Since commit 5d1fd5dc ("mm,hwpoison: introduce MF_MSG_UNSPLIT_THP"),
      the action_result(,MF_MSG_UNSPLIT_THP,) called to show memory error event
      in memory_failure(), so the pr_info() in try_to_split_thp_page() is only
      needed in soft_offline_in_use_page().
      
      Meanwhile this could also fix the unexpected prefix for "thp split failed"
      due to commit 96f96763 ("mm: memory-failure: convert to pr_fmt()").
      
      Link: https://lkml.kernel.org/r/20220809111813.139690-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ace36f0
    • Rik van Riel's avatar
      mm: align larger anonymous mappings on THP boundaries · f35b5d7d
      Rik van Riel authored
      Align larger anonymous memory mappings on THP boundaries by going through
      thp_get_unmapped_area if THPs are enabled for the current process.
      
      With this patch, larger anonymous mappings are now THP aligned.  When a
      malloc library allocates a 2MB or larger arena, that arena can now be
      mapped with THPs right from the start, which can result in better TLB hit
      rates and execution time.
      
      Link: https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.comSigned-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f35b5d7d
    • Charan Teja Kalla's avatar
      mm/page_ext: remove unused variable in offline_page_ext · 7b5a0b66
      Charan Teja Kalla authored
      Remove unused variable 'nid' in offline_page_ext().  This is not used
      since the page_ext code inception.
      
      Link: https://lkml.kernel.org/r/1659330397-11817-1-git-send-email-quic_charante@quicinc.comSigned-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b5a0b66
    • Zach O'Keefe's avatar
      selftests/vm: add selftest to verify multi THP collapse · 9d0d9468
      Zach O'Keefe authored
      Add support to allocate and verify collapse of multiple hugepage-sized
      regions into multiple THPs.
      
      Add "nr" argument to check_huge() that instructs check_huge() to check for
      exactly "nr_hpages" THPs.  This has the added benefit of now being able to
      check for exactly 0 THPs, and so callsites that previously checked the
      negation of exactly 1 THP are now more correct.
      
      ->collapse struct collapse_context hook has been expanded with a
      "nr_hpages" argument to collapse "nr_hpages" hugepages.  The
      collapse_full() test has been repurposed to collapse 4 THPs at once.  It
      is expected more tests will want to test multi THP collapse (e.g. 
      file/shmem).
      
      This is of particular benefit to madvise collapse context given that it
      may do many THP collapses during a single syscall.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-19-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9d0d9468
    • Zach O'Keefe's avatar
      selftests/vm: add selftest to verify recollapse of THPs · 1370a21f
      Zach O'Keefe authored
      Add selftest specific to madvise collapse context that tests MADV_COLLAPSE
      is "successful" if a hugepage-aligned/sized region is already pmd-mapped.
      
      This test also verifies that MADV_COLLAPSE can collapse memory into THPs
      even in "madvise" THP mode and the memory isn't marked VM_HUGEPAGE.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-18-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1370a21f
    • Zach O'Keefe's avatar
      selftests/vm: add MADV_COLLAPSE collapse context to selftests · 9330694d
      Zach O'Keefe authored
      Add madvise collapse context to hugepage collapse selftests.  This context
      is tested with /sys/kernel/mm/transparent_hugepage/enabled set to "never"
      in order to avoid unwanted interaction with khugepaged during testing.
      
      Also, refactor updates to sysfs THP settings using a stack so that the THP
      settings from nested callers can be restored.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-17-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9330694d
    • Zach O'Keefe's avatar
      selftests/vm: dedup hugepage allocation logic · be6667b0
      Zach O'Keefe authored
      The code
      
      	p = alloc_mapping();
      	printf("Allocate huge page...");
      	madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
      	fill_memory(p, 0, hpage_pmd_size);
      	if (check_huge(p))
      		success("OK");
      	else
      		fail("Fail");
      
      Is repeated many times in different tests.  Add a helper, alloc_hpage()
      to handle this.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-16-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be6667b0
    • Zach O'Keefe's avatar
      selftests/vm: modularize collapse selftests · 61c2c676
      Zach O'Keefe authored
      Modularize the collapse action of khugepaged collapse selftests by
      introducing a struct collapse_context which specifies how to collapse a
      given memory range and the expected semantics of the collapse.  This can
      be reused later to test other collapse contexts.
      
      Additionally, all tests have logic that checks if a collapse occurred via
      reading /proc/self/smaps, and report if this is different than expected. 
      Move this logic into the per-context ->collapse() hook instead of
      repeating it in every test.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-15-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      61c2c676
    • Zach O'Keefe's avatar
      mm/madvise: add MADV_COLLAPSE to process_madvise() · 876b4a18
      Zach O'Keefe authored
      Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
      CAP_SYS_ADMIN or is requesting collapse of it's own memory.
      
      This is useful for the development of userspace agents that seek to
      optimize THP utilization system-wide by using userspace signals to
      prioritize what memory is most deserving of being THP-backed.
      
      [zokeefe@google.com: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE)]
        Link: https://lkml.kernel.org/r/20220801210946.3069083-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-13-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      876b4a18
    • Zach O'Keefe's avatar
      mm/khugepaged: rename prefix of shared collapse functions · 7d2c4385
      Zach O'Keefe authored
      The following functions are shared between khugepaged and madvise collapse
      contexts.  Replace the "khugepaged_" prefix with generic "hpage_collapse_"
      prefix in such cases:
      
      khugepaged_test_exit() -> hpage_collapse_test_exit()
      khugepaged_scan_abort() -> hpage_collapse_scan_abort()
      khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
      khugepaged_find_target_node() -> hpage_collapse_find_target_node()
      khugepaged_alloc_page() -> hpage_collapse_alloc_page()
      
      The kerenel ABI (e.g.  huge_memory:mm_khugepaged_scan_pmd tracepoint) is
      unaltered.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-11-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d2c4385
    • Zach O'Keefe's avatar
      mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse · 7d8faaf1
      Zach O'Keefe authored
      This idea was introduced by David Rientjes[1].
      
      Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
      a synchronous collapse of memory at their own expense.
      
      The benefits of this approach are:
      
      * CPU is charged to the process that wants to spend the cycles for the
        THP
      * Avoid unpredictable timing of khugepaged collapse
      
      Semantics
      
      This call is independent of the system-wide THP sysfs settings, but will
      fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
      multiple VMAs, the semantics of the collapse over each VMA is independent
      from the others.  This implies a hugepage cannot cross a VMA boundary.  If
      collapse of a given hugepage-aligned/sized region fails, the operation may
      continue to attempt collapsing the remainder of memory specified.
      
      The memory ranges provided must be page-aligned, but are not required to
      be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
      start/end of the range will be clamped to the first/last hugepage-aligned
      address covered by said range.  The memory ranges must span at least one
      hugepage-sized region.
      
      All non-resident pages covered by the range will first be
      swapped/faulted-in, before being internally copied onto a freshly
      allocated hugepage.  Unmapped pages will have their data directly
      initialized to 0 in the new hugepage.  However, for every eligible
      hugepage aligned/sized region to-be collapsed, at least one page must
      currently be backed by memory (a PMD covering the address range must
      already exist).
      
      Allocation for the new hugepage may enter direct reclaim and/or
      compaction, regardless of VMA flags.  When the system has multiple NUMA
      nodes, the hugepage will be allocated from the node providing the most
      native pages.  This operation operates on the current state of the
      specified process and makes no persistent changes or guarantees on how
      pages will be mapped, constructed, or faulted in the future
      
      Return Value
      
      If all hugepage-sized/aligned regions covered by the provided range were
      either successfully collapsed, or were already PMD-mapped THPs, this
      operation will be deemed successful.  On success, process_madvise(2)
      returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
      is returned and errno is set to indicate the error for the most-recently
      attempted hugepage collapse.  Note that many failures might have occurred,
      since the operation may continue to collapse in the event a single
      hugepage-sized/aligned region fails.
      
      	ENOMEM	Memory allocation failed or VMA not found
      	EBUSY	Memcg charging failed
      	EAGAIN	Required resource temporarily unavailable.  Try again
      		might succeed.
      	EINVAL	Other error: No PMD found, subpage doesn't have Present
      		bit set, "Special" page no backed by struct page, VMA
      		incorrectly sized, address not page-aligned, ...
      
      Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
      to provide the caller with actionable feedback so they may take an
      appropriate fallback measure.
      
      Use Cases
      
      An immediate user of this new functionality are malloc() implementations
      that manage memory in hugepage-sized chunks, but sometimes subrelease
      memory back to the system in native-sized chunks via MADV_DONTNEED;
      zapping the pmd.  Later, when the memory is hot, the implementation could
      madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
      coverage and dTLB performance.  TCMalloc is such an implementation that
      could benefit from this[2].
      
      Only privately-mapped anon memory is supported for now, but additional
      support for file, shmem, and HugeTLB high-granularity mappings[2] is
      expected.  File and tmpfs/shmem support would permit:
      
      * Backing executable text by THPs.  Current support provided by
        CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
        might impair services from serving at their full rated load after
        (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
        immediately realize iTLB performance prevents page sharing and demand
        paging, both of which increase steady state memory footprint.  With
        MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
        and lower RAM footprints.
      * Backing guest memory by hugapages after the memory contents have been
        migrated in native-page-sized chunks to a new host, in a
        userfaultfd-based live-migration stack.
      
      [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
      [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
      
      [jrdr.linux@gmail.com: avoid possible memory leak in failure path]
        Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
      [zokeefe@google.com add missing kfree() to madvise_collapse()]
        Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
        Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
      [zokeefe@google.com: delay computation of hpage boundaries until use]]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatar"Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d8faaf1
    • Zach O'Keefe's avatar
      mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage · 50722804
      Zach O'Keefe authored
      When scanning an anon pmd to see if it's eligible for collapse, return
      SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
      SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
      file-collapse path, since the latter might identify pte-mapped compound
      pages.  This is required by MADV_COLLAPSE which necessarily needs to know
      what hugepage-aligned/sized regions are already pmd-mapped.
      
      In order to determine if a pmd already maps a hugepage, refactor
      mm_find_pmd():
      
      Return mm_find_pmd() to it's pre-commit f72e7dcd ("mm: let mm_find_pmd
      fix buggy race with THP fault") behavior.  ksm was the only caller that
      explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
      there (pmd_present() and pmd_trans_huge() checks).
      
      Undo revert change in commit f72e7dcd ("mm: let mm_find_pmd fix buggy
      race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
      and use mm_find_pmd() instead.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50722804
    • Zach O'Keefe's avatar
      mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() · a7f4e6e4
      Zach O'Keefe authored
      MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
      
      hugepage_vma_check() is the authority on determining if a VMA is eligible
      for THP allocation/collapse, and currently enforces the sysfs THP
      settings.  Add a flag to disable these checks.  For now, only apply this
      arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. 
      We can expand this to shmem, which uses
      /sys/kernel/transparent_hugepage/shmem_enabled, later.
      
      Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
      passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
      VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
      THP flag in hugepage_vma_check()", this check also didn't check "never"
      THP mode.  As such, this restores the previous behavior of
      collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
      comment in code for justification why this is OK.
      
      [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f4e6e4
    • Zach O'Keefe's avatar
      mm/khugepaged: add flag to predicate khugepaged-only behavior · d8ea7cc8
      Zach O'Keefe authored
      Add .is_khugepaged flag to struct collapse_control so khugepaged-specific
      behavior can be elided by MADV_COLLAPSE context.
      
      Start by protecting khugepaged-specific heuristics by this flag.  In
      MADV_COLLAPSE, the user presumably has reason to believe the collapse will
      be beneficial and khugepaged heuristics shouldn't prevent the user from
      doing so:
      
      1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
      
      2) requirement that some pages in region being collapsed be young or
         referenced
      
      [zokeefe@google.com: consistently order cc->is_khugepaged and pte_* checks]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-3-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2qJm6FaOQcxkha@google.com/
      Link: https://lkml.kernel.org/r/20220706235936.2197195-7-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8ea7cc8
    • Zach O'Keefe's avatar
      mm/khugepaged: propagate enum scan_result codes back to callers · 50ad2f24
      Zach O'Keefe authored
      Propagate enum scan_result codes back through return values of
      functions downstream of khugepaged_scan_file() and
      khugepaged_scan_pmd() to inform callers if the operation was
      successful, and if not, why.
      
      Since khugepaged_scan_pmd()'s return value already has a specific meaning
      (whether mmap_lock was unlocked or not), add a bool* argument to
      khugepaged_scan_pmd() to retrieve this information.
      
      Change khugepaged to take action based on the return values of
      khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep
      within the collapsing functions themselves.
      
      hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more
      consistent with enum scan_result propagation.
      
      Remove dependency on error pointers to communicate to khugepaged that
      allocation failed and it should sleep; instead just use the result of the
      scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-6-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50ad2f24
    • Zach O'Keefe's avatar
      mm/khugepaged: dedup and simplify hugepage alloc and charging · 9710a78a
      Zach O'Keefe authored
      The following code is duplicated in collapse_huge_page() and
      collapse_file():
      
              gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
      
      	new_page = khugepaged_alloc_page(hpage, gfp, node);
              if (!new_page) {
                      result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                      goto out;
              }
      
              if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                      result = SCAN_CGROUP_CHARGE_FAIL;
                      goto out;
              }
              count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
      
      Also, "node" is passed as an argument to both collapse_huge_page() and
      collapse_file() and obtained the same way, via
      khugepaged_find_target_node().
      
      Move all this into a new helper, alloc_charge_hpage(), and remove the
      duplicate code from collapse_huge_page() and collapse_file().  Also,
      simplify khugepaged_alloc_page() by returning a bool indicating allocation
      success instead of a copy of the allocated struct page *.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-5-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9710a78a
    • Zach O'Keefe's avatar
      mm/khugepaged: add struct collapse_control · 34d6b470
      Zach O'Keefe authored
      Modularize hugepage collapse by introducing struct collapse_control.  This
      structure serves to describe the properties of the requested collapse, as
      well as serve as a local scratch pad to use during the collapse itself.
      
      Start by moving global per-node khugepaged statistics into this new
      structure.  Note that this structure is still statically allocated since
      CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a
      MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
      
      [zokeefe@google.com: use minimal bits to store num page < HPAGE_PMD_NR]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-2-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2CeIm%2FQmQwWh9a@google.com/
      [sfr@canb.auug.org.au: fix build]
        Link: https://lkml.kernel.org/r/20220721195508.15f1e07a@canb.auug.org.au
      [zokeefe@google.com: fix struct collapse_control load_node definition]
        Link: https://lore.kernel.org/linux-mm/202209021349.F73i5d6X-lkp@intel.com/
        Link: https://lkml.kernel.org/r/20220903021221.1130021-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-4-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34d6b470
    • Yang Shi's avatar
      mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA · c6a7f445
      Yang Shi authored
      Patch series "mm: userspace hugepage collapse", v7.
      
      Introduction
      --------------------------------
      
      This series provides a mechanism for userspace to induce a collapse of
      eligible ranges of memory into transparent hugepages in process context,
      thus permitting users to more tightly control their own hugepage
      utilization policy at their own expense.
      
      This idea was introduced by David Rientjes[5].
      
      Interface
      --------------------------------
      
      The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
      leverages the new process_madvise(2) call.
      
      process_madvise(2)
      
      	Performs a synchronous collapse of the native pages
      	mapped by the list of iovecs into transparent hugepages.
      
      	This operation is independent of the system THP sysfs settings,
      	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.
      
      	THP allocation may enter direct reclaim and/or compaction.
      
      	When a range spans multiple VMAs, the semantics of the collapse
      	over of each VMA is independent from the others.
      
      	Caller must have CAP_SYS_ADMIN if not acting on self.
      
      	Return value follows existing process_madvise(2) conventions.  A
      	“success” indicates that all hugepage-sized/aligned regions
      	covered by the provided range were either successfully
      	collapsed, or were already pmd-mapped THPs.
      
      madvise(2)
      
      	Equivalent to process_madvise(2) on self, with 0 returned on
      	“success”.
      
      Current Use-Cases
      --------------------------------
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  With MADV_COLLAPSE, we get the best of both
      	worlds: Peak upfront performance and lower RAM footprints.  Note
      	that subsequent support for file-backed memory is required here.
      
      (2)	malloc() implementations that manage memory in hugepage-sized
      	chunks, but sometimes subrelease memory back to the system in
      	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
      	when the memory is hot, the implementation could
      	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
      	hugepage coverage and dTLB performance.  TCMalloc is such an
      	implementation that could benefit from this[6].  A prior study of
      	Google internal workloads during evaluation of Temeraire, a
      	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
      	all cpu cycles were spent in dTLB stalls, and that increasing
      	hugepage coverage by even small amount can help with that[7].
      
      (3)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.  Note that
      	subsequent support for file/shmem-backed memory is required here.
      
      (4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
      	be mapped at different levels in the page tables[8].  As it's not
      	"transparent" like THP, HugeTLB high-granularity mappings require
      	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
      	for this use case[9].  Note that subsequent support for HugeTLB
      	memory is required here.
      
      Future work
      --------------------------------
      
      Only private anonymous memory is supported by this series. File and
      shmem memory support will be added later.
      
      One possible user of this functionality is a userspace agent that
      attempts to optimize THP utilization system-wide by allocating THPs
      based on, for example, task priority, task performance requirements, or
      heatmaps.  For the latter, one idea that has already surfaced is using
      DAMON to identify hot regions, and driving THP collapse through a new
      DAMOS_COLLAPSE scheme[10].
      
      
      This patch (of 17):
      
      The khugepaged has optimization to reduce huge page allocation calls for
      !CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
      the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
      collapse huge page from a different node, so it doesn't make too much
      sense to carry it.
      
      But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
      before scanning the address space, so it means huge page may be allocated
      even though there is no suitable range for collapsing.  Then the page
      would be just freed if khugepaged already made enough progress.  This
      could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
      run.  This problem actually makes things worse due to the way more
      pointless THP allocations and makes the optimization pointless.
      
      This could be fixed by carrying the huge page across scans, but it will
      complicate the code further and the huge page may be carried indefinitely.
      But if we take one step back, the optimization itself seems not worth
      keeping nowadays since:
      
        * Not too many users build NUMA=n kernel nowadays even though the kernel is
          actually running on a non-NUMA machine. Some small devices may run NUMA=n
          kernel, but I don't think they actually use THP.
        * Since commit 44042b44 ("mm/page_alloc: allow high-order pages to be
          stored on the per-cpu lists"), THP could be cached by pcp.  This actually
          somehow does the job done by the optimization.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@google.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Co-developed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6a7f445
  2. 28 Aug, 2022 9 commits
    • Linus Torvalds's avatar
      Linux 6.0-rc3 · b90cb105
      Linus Torvalds authored
      b90cb105
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm · b467192e
      Linus Torvalds authored
      Pull more hotfixes from Andrew Morton:
       "Seventeen hotfixes.  Mostly memory management things.
      
        Ten patches are cc:stable, addressing pre-6.0 issues"
      
      * tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        .mailmap: update Luca Ceresoli's e-mail address
        mm/mprotect: only reference swap pfn page if type match
        squashfs: don't call kmalloc in decompressors
        mm/damon/dbgfs: avoid duplicate context directory creation
        mailmap: update email address for Colin King
        asm-generic: sections: refactor memory_intersects
        bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
        ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
        Revert "memcg: cleanup racy sum avoidance code"
        mm/zsmalloc: do not attempt to free IS_ERR handle
        binder_alloc: add missing mmap_lock calls when using the VMA
        mm: re-allow pinning of zero pfns (again)
        vmcoreinfo: add kallsyms_num_syms symbol
        mailmap: update Guilherme G. Piccoli's email addresses
        writeback: avoid use-after-free after removing device
        shmem: update folio if shmem_replace_page() updates the page
        mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
      b467192e
    • Linus Torvalds's avatar
      Merge tag 'bitmap-6.0-rc3' of github.com:/norov/linux · 373eff57
      Linus Torvalds authored
      Pull bitmap fixes from Yury Norov:
       "Fix the reported issues, and implements the suggested improvements,
        for the version of the cpumask tests [1] that was merged with commit
        c41e8866 ("lib/test: introduce cpumask KUnit test suite").
      
        These changes include fixes for the tests, and better alignment with
        the KUnit style guidelines"
      
      * tag 'bitmap-6.0-rc3' of github.com:/norov/linux:
        lib/cpumask_kunit: add tests file to MAINTAINERS
        lib/cpumask_kunit: log mask contents
        lib/test_cpumask: follow KUnit style guidelines
        lib/test_cpumask: fix cpu_possible_mask last test
        lib/test_cpumask: drop cpu_possible_mask full test
      373eff57
    • Luca Ceresoli's avatar
      .mailmap: update Luca Ceresoli's e-mail address · 0ebafe2e
      Luca Ceresoli authored
      My Bootlin address is preferred from now on.
      
      Link: https://lkml.kernel.org/r/20220826130515.3011951-1-luca.ceresoli@bootlin.comSigned-off-by: default avatarLuca Ceresoli <luca.ceresoli@bootlin.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ebafe2e
    • Peter Xu's avatar
      mm/mprotect: only reference swap pfn page if type match · 3d2f78f0
      Peter Xu authored
      Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to
      fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]:
      
        kernel BUG at include/linux/swapops.h:117!
        CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S         O L 6.0.0-dbg-DEV #2
        RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
        Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
        c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
        48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
        RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
        RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
        RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
        RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
        R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
        R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
        FS:  00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         change_pte_range+0x36e/0x880
         change_p4d_range+0x2e8/0x670
         change_protection_range+0x14e/0x2c0
         mprotect_fixup+0x1ee/0x330
         do_mprotect_pkey+0x34c/0x440
         __x64_sys_mprotect+0x1d/0x30
      
      It triggers because pfn_swap_entry_to_page() could be called upon e.g. a
      genuine swap entry.
      
      Fix it by only calling it when it's a write migration entry where the page*
      is used.
      
      [1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarYu Zhao <yuzhao@google.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d2f78f0
    • Phillip Lougher's avatar
      squashfs: don't call kmalloc in decompressors · 1f13dff0
      Phillip Lougher authored
      The decompressors may be called while in an atomic section.  So move the
      kmalloc() out of this path, and into the "page actor" init function.
      
      This fixes a regression introduced by commit
      f268eedd ("squashfs: extend "page actor" to handle missing pages")
      
      Link: https://lkml.kernel.org/r/20220822215430.15933-1-phillip@squashfs.org.uk
      Fixes: f268eedd ("squashfs: extend "page actor" to handle missing pages")
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Signed-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f13dff0
    • Badari Pulavarty's avatar
      mm/damon/dbgfs: avoid duplicate context directory creation · d26f6070
      Badari Pulavarty authored
      When user tries to create a DAMON context via the DAMON debugfs interface
      with a name of an already existing context, the context directory creation
      fails but a new context is created and added in the internal data
      structure, due to absence of the directory creation success check.  As a
      result, memory could leak and DAMON cannot be turned on.  An example test
      case is as below:
      
          # cd /sys/kernel/debug/damon/
          # echo "off" >  monitor_on
          # echo paddr > target_ids
          # echo "abc" > mk_context
          # echo "abc" > mk_context
          # echo $$ > abc/target_ids
          # echo "on" > monitor_on  <<< fails
      
      Return value of 'debugfs_create_dir()' is expected to be ignored in
      general, but this is an exceptional case as DAMON feature is depending
      on the debugfs functionality and it has the potential duplicate name
      issue.  This commit therefore fixes the issue by checking the directory
      creation failure and immediately return the error in the case.
      
      Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org
      Fixes: 75c1c2b5 ("mm/damon/dbgfs: support multiple contexts")
      Signed-off-by: default avatarBadari Pulavarty <badari.pulavarty@intel.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[ 5.15.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d26f6070
    • Colin Ian King's avatar
      mailmap: update email address for Colin King · ac733f65
      Colin Ian King authored
      Colin King is working on kernel janitorial fixes in his spare time and
      using his Intel email is confusing.  Use his gmail account as the default
      email address.
      
      Link: https://lkml.kernel.org/r/20220817212753.101109-1-colin.i.king@gmail.comSigned-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac733f65
    • Quanyang Wang's avatar
      asm-generic: sections: refactor memory_intersects · 0c7d7cc2
      Quanyang Wang authored
      There are two problems with the current code of memory_intersects:
      
      First, it doesn't check whether the region (begin, end) falls inside the
      region (virt, vend), that is (virt < begin && vend > end).
      
      The second problem is if vend is equal to begin, it will return true but
      this is wrong since vend (virt + size) is not the last address of the
      memory region but (virt + size -1) is.  The wrong determination will
      trigger the misreporting when the function check_for_illegal_area calls
      memory_intersects to check if the dma region intersects with stext region.
      
      The misreporting is as below (stext is at 0x80100000):
       WARNING: CPU: 0 PID: 77 at kernel/dma/debug.c:1073 check_for_illegal_area+0x130/0x168
       DMA-API: chipidea-usb2 e0002000.usb: device driver maps memory from kernel text or rodata [addr=800f0000] [len=65536]
       Modules linked in:
       CPU: 1 PID: 77 Comm: usb-storage Not tainted 5.19.0-yocto-standard #5
       Hardware name: Xilinx Zynq Platform
        unwind_backtrace from show_stack+0x18/0x1c
        show_stack from dump_stack_lvl+0x58/0x70
        dump_stack_lvl from __warn+0xb0/0x198
        __warn from warn_slowpath_fmt+0x80/0xb4
        warn_slowpath_fmt from check_for_illegal_area+0x130/0x168
        check_for_illegal_area from debug_dma_map_sg+0x94/0x368
        debug_dma_map_sg from __dma_map_sg_attrs+0x114/0x128
        __dma_map_sg_attrs from dma_map_sg_attrs+0x18/0x24
        dma_map_sg_attrs from usb_hcd_map_urb_for_dma+0x250/0x3b4
        usb_hcd_map_urb_for_dma from usb_hcd_submit_urb+0x194/0x214
        usb_hcd_submit_urb from usb_sg_wait+0xa4/0x118
        usb_sg_wait from usb_stor_bulk_transfer_sglist+0xa0/0xec
        usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x38/0x70
        usb_stor_bulk_srb from usb_stor_Bulk_transport+0x150/0x360
        usb_stor_Bulk_transport from usb_stor_invoke_transport+0x38/0x440
        usb_stor_invoke_transport from usb_stor_control_thread+0x1e0/0x238
        usb_stor_control_thread from kthread+0xf8/0x104
        kthread from ret_from_fork+0x14/0x2c
      
      Refactor memory_intersects to fix the two problems above.
      
      Before the 1d7db834 ("dma-debug: use memory_intersects()
      directly"), memory_intersects is called only by printk_late_init:
      
      printk_late_init -> init_section_intersects ->memory_intersects.
      
      There were few places where memory_intersects was called.
      
      When commit 1d7db834 ("dma-debug: use memory_intersects()
      directly") was merged and CONFIG_DMA_API_DEBUG is enabled, the DMA
      subsystem uses it to check for an illegal area and the calltrace above
      is triggered.
      
      [akpm@linux-foundation.org: fix nearby comment typo]
      Link: https://lkml.kernel.org/r/20220819081145.948016-1-quanyang.wang@windriver.com
      Fixes: 97955936 ("asm/sections: add helpers to check for section data")
      Signed-off-by: default avatarQuanyang Wang <quanyang.wang@windriver.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c7d7cc2