1. 12 Oct, 2022 2 commits
    • Chuyi Zhou's avatar
      mm/compaction: fix set skip in fast_find_migrateblock · 7efc3b72
      Chuyi Zhou authored
      When we successfully find a pageblock in fast_find_migrateblock(), the
      block will be set skip-flag through set_pageblock_skip().  However, when
      entering isolate_migratepages_block(), the whole pageblock will be skipped
      due to the branch 'if (!valid_page && IS_ALIGNED(low_pfn,
      pageblock_nr_pages))'.  Eventually we will goto isolate_abort and isolate
      nothing.  That makes fast_find_migrateblock useless.
      
      In this patch, when we find a suitable pageblock in
      fast_find_migrateblock, we do noting but let isolate_migratepages_block to
      set skip flag to the pageblock after scan it.  Normally, we would isolate
      some pages from the fast-find block.
      
      I use mmtest/thpscale-madvhugepage test it. Here is the result:
                                  baseline               patch
      Amean     fault-both-1      1331.66 (   0.00%)     1261.04 *   5.30%*
      Amean     fault-both-3      1383.95 (   0.00%)     1191.69 *  13.89%*
      Amean     fault-both-5      1568.13 (   0.00%)     1445.20 *   7.84%*
      Amean     fault-both-7      1819.62 (   0.00%)     1555.13 *  14.54%*
      Amean     fault-both-12     1106.96 (   0.00%)     1149.43 *  -3.84%*
      Amean     fault-both-18     2196.93 (   0.00%)     1875.77 *  14.62%*
      Amean     fault-both-24     2642.69 (   0.00%)     2671.21 *  -1.08%*
      Amean     fault-both-30     2901.89 (   0.00%)     2857.32 *   1.54%*
      Amean     fault-both-32     3747.00 (   0.00%)     3479.23 *   7.15%*
      
      Link: https://lkml.kernel.org/r/20220713062009.597255-1-zhouchuyi@bytedance.com
      Fixes: 70b44595 ("mm, compaction: use free lists to quickly locate a migration source")
      Signed-off-by: default avatarzhouchuyi <zhouchuyi@bytedance.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7efc3b72
    • Andrew Morton's avatar
      mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static · acfac378
      Andrew Morton authored
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      acfac378
  2. 07 Oct, 2022 5 commits
    • Mike Kravetz's avatar
      hugetlb: allocate vma lock for all sharable vmas · bbff39cc
      Mike Kravetz authored
      The hugetlb vma lock was originally designed to synchronize pmd sharing. 
      As such, it was only necessary to allocate the lock for vmas that were
      capable of pmd sharing.  Later in the development cycle, it was discovered
      that it could also be used to simplify fault/truncation races as described
      in [1].  However, a subsequent change to allocate the lock for all vmas
      that use the page cache was never made.  A fault/truncation race could
      leave pages in a file past i_size until the file is removed.
      
      Remove the previous restriction and allocate lock for all VM_MAYSHARE
      vmas.  Warn in the unlikely event of allocation failure.
      
      [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
      Fixes: "hugetlb: clean up code checking for fault/truncation races"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbff39cc
    • Mike Kravetz's avatar
      hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer · ecfbd733
      Mike Kravetz authored
      hugetlb file truncation/hole punch code may need to back out and take
      locks in order in the routine hugetlb_unmap_file_folio().  This code could
      race with vma freeing as pointed out in [1] and result in accessing a
      stale vma pointer.  To address this, take the vma_lock when clearing the
      vma_lock->vma pointer.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      [mike.kravetz@oracle.com: address build issues]
        Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
      Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
      Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ecfbd733
    • Mike Kravetz's avatar
      hugetlb: fix vma lock handling during split vma and range unmapping · 131a79b4
      Mike Kravetz authored
      Patch series "hugetlb: fixes for new vma lock series".
      
      In review of the series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", Miaohe Lin pointed out two key issues:
      
      1) There is a race in the routine hugetlb_unmap_file_folio when locks
         are dropped and reacquired in the correct order [1].
      
      2) With the switch to using vma lock for fault/truncate synchronization,
         we need to make sure lock exists for all VM_MAYSHARE vmas, not just
         vmas capable of pmd sharing.
      
      These two issues are addressed here.  In addition, having a vma lock
      present in all VM_MAYSHARE vmas, uncovered some issues around vma
      splitting.  Those are also addressed.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      
      This patch (of 3):
      
      The hugetlb vma lock hangs off the vm_private_data field and is specific
      to the vma.  When vm_area_dup() is called as part of vma splitting, the
      vma lock pointer is copied to the new vma.  This will result in issues
      such as double freeing of the structure.  Update the hugetlb open vm_ops
      to allocate a new vma lock for the new vma.
      
      The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
      to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
      anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
      only VM_MAYSHARE was set we would miss the free.  With the introduction of
      the vma lock, a vma can not participate in pmd sharing if vm_private_data
      is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
      free the vma lock to prevent sharing.  Also, update the sharing code to
      make sure vma lock is indeed a condition for pmd sharing. 
      hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
      Fixes: "hugetlb: add vma based lock for pmd sharing"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      131a79b4
    • Yu Zhao's avatar
    • Yu Zhao's avatar
      mm/mglru: don't sync disk for each aging cycle · 14aa8b2d
      Yu Zhao authored
      wakeup_flusher_threads() was added under the assumption that if a system
      runs out of clean cold pages, it might want to write back dirty pages more
      aggressively so that they can become clean and be dropped.
      
      However, doing so can breach the rate limit a system wants to impose on
      writeback, resulting in early SSD wearout.
      
      Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14aa8b2d
  3. 03 Oct, 2022 33 commits
    • Johannes Weiner's avatar
      mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol · e55b9f96
      Johannes Weiner authored
      Since 2d1c4980 ("mm: memcontrol: make swap tracking an integral part
      of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config
      option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP.
      
      Update the sites accordingly and drop the symbol.
      
      [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM,
        which hasn't been a user-visible symbol for over half a decade. ]
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e55b9f96
    • Johannes Weiner's avatar
      mm: memcontrol: use do_memsw_account() in a few more places · b94c4e94
      Johannes Weiner authored
      It's slightly more descriptive and consistent with other places that
      distinguish cgroup1's combined memory+swap accounting scheme from
      cgroup2's dedicated swap accounting.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-4-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b94c4e94
    • Johannes Weiner's avatar
      mm: memcontrol: deprecate swapaccounting=0 mode · b25806dc
      Johannes Weiner authored
      The swapaccounting= commandline option already does very little today.  To
      close a trivial containment failure case, the swap ownership tracking part
      of the swap controller has recently become mandatory (see commit
      2d1c4980 ("mm: memcontrol: make swap tracking an integral part of
      memory control") for details), which makes up the majority of the work
      during swapout, swapin, and the swap slot map.
      
      The only thing left under this flag is the page_counter operations and the
      visibility of the swap control files in the first place, which are rather
      meager savings.  There also aren't many scenarios, if any, where
      controlling the memory of a cgroup while allowing it unlimited access to a
      global swap space is a workable resource isolation strategy.
      
      On the other hand, there have been several bugs and confusion around the
      many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
      accounting without swap accounting, memcg runtime disabled).
      
      This puts the maintenance overhead of retaining the toggle above its
      practical benefits.  Deprecate it.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b25806dc
    • Johannes Weiner's avatar
      mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled · c91bdc93
      Johannes Weiner authored
      Patch series "memcg swap fix & cleanups".
      
      
      This patch (of 4):
      
      Since commit 2d1c4980 ("mm: memcontrol: make swap tracking an integral
      part of memory control"), the cgroup swap arrays are used to track memory
      ownership at the time of swap readahead and swapoff, even if swap space
      *accounting* has been turned off by the user via swapaccount=0 (which sets
      cgroup_memory_noswap).
      
      However, the patch was overzealous: by simply dropping the
      cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
      path, it caused the cgroup arrays being allocated even when the memory
      controller as a whole is disabled.  This is a waste of that memory.
      
      Restore mem_cgroup_disabled() checks, implied previously by
      cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
      callbacks.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220926135704.400818-2-hannes@cmpxchg.org
      Fixes: 2d1c4980 ("mm: memcontrol: make swap tracking an integral part of memory control")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c91bdc93
    • Xiu Jianfeng's avatar
      mm/secretmem: remove reduntant return value · f7c5b1aa
      Xiu Jianfeng authored
      The return value @ret is always 0, so remove it and return 0 directly.
      
      Link: https://lkml.kernel.org/r/20220920012205.246217-1-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7c5b1aa
    • Xin Hao's avatar
      mm/hugetlb: add available_huge_pages() func · 8346d69d
      Xin Hao authored
      In hugetlb.c there are several places which compare the values of
      'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
      add a new available_huge_pages() function to do these.
      
      Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8346d69d
    • Gaosheng Cui's avatar
      mm: remove unused inline functions from include/linux/mm_inline.h · 6b91e5df
      Gaosheng Cui authored
      Remove the following unused inline functions from mm_inline.h:
      
      1.  All uses of add_page_to_lru_list_tail() have been removed since
         commit 7a3dbfe8 ("mm/swap: convert lru_deactivate_file to a
         folio_batch"), and it can be replaced by lruvec_add_folio_tail().
      
      2.  All uses of __clear_page_lru_flags() have been removed since commit
         188e8cae ("mm/swap: convert __page_cache_release() to use a
         folio"), and it can be replaced by __folio_clear_lru_flags().
      
      They are useless, so remove them.
      
      Link: https://lkml.kernel.org/r/20220922110935.1495099-1-cuigaosheng1@huawei.comSigned-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b91e5df
    • Zach O'Keefe's avatar
      selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory · 0f633baa
      Zach O'Keefe authored
      Add :collapse mod to userfaultfd selftest.  Currently this mod is only
      valid for "shmem" test type, but could be used for other test types.
      
      When provided, memory allocated by ->allocate_area() will be
      hugepage-aligned enforced to be hugepage-sized.  userfaultf_minor_test,
      after the UFFD-registered mapping has been populated by UUFD minor fault
      handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to collapse
      the memory into a pmd-mapped THP.
      
      This test is meant to be a functional test of what occurs during
      UFFD-driven live migration of VMs backed by huge tmpfs where, after a
      hugepage-sized region has been successfully migrated (in native page-sized
      chunks, to avoid latency of fetched a hugepage over the network), we want
      to reclaim previous VM performance by remapping it at the PMD level.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-11-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-11-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f633baa
    • Zach O'Keefe's avatar
      selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd · 69d9428c
      Zach O'Keefe authored
      This test tests that MADV_COLLAPSE acting on file/shmem memory for which
      (1) the file extent mapping by the memory is already a huge page in the
      page cache, and (2) the pmd mapping this memory in the target process is
      none.
      
      In practice, (1)+(2) is the state left over after khugepaged has
      successfully collapsed file/shmem memory for a target VMA, but the memory
      has not yet been refaulted.  So, this test in-effect tests MADV_COLLAPSE
      racing with khugepaged to collapse the memory first.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-10-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69d9428c
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse shmem testing · d0d35b60
      Zach O'Keefe authored
      Add memory operations for shmem (memfd) memory, and reuse existing tests
      with the new memory operations.
      
      Shmem tests can be called with "shmem" mem_type, and shmem tests are ran
      with "all" mem_type as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-9-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0d35b60
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse file and tmpfs testing · 1b03d0d5
      Zach O'Keefe authored
      Add memory operations for file-backed and tmpfs memory.  Call existing
      tests with these new memory operations to test collapse functionality of
      khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory.  Not all
      tests are reusable; for example, collapse_swapin_single_pte() which checks
      swap usage.
      
      Refactor test arguments.  Usage is now:
      
      Usage: ./khugepaged <test type> [dir]
      
              <test type>     : <context>:<mem_type>
              <context>       : [all|khugepaged|madvise]
              <mem_type>      : [all|anon|file]
      
              "file,all" mem_type requires [dir] argument
      
              "file,all" mem_type requires kernel built with
              CONFIG_READ_ONLY_THP_FOR_FS=y
      
              if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be
              mounted with huge=madvise option for khugepaged tests to work
      
      Refactor calling tests to make it clear what collapse context / memory
      operations they support, but only invoke tests requested by user.  Also
      log what test is being ran, and with what context / memory, to make test
      logs more human readable.
      
      A new test file is created and deleted for every test to ensure no pages
      remain in the page cache between tests (tests also may attempt to collapse
      different amount of memory).
      
      For file-backed memory where the file is stored on a block device, disable
      /sys/block/<device>/queue/read_ahead_kb so that pages don't find their way
      into the page cache without the tests faulting them in.
      
      Add file and shmem wrappers to vm_utils check for file and shmem hugepages
      in smaps.
      
      [zokeefe@google.com: fix "add thp collapse file and tmpfs testing" for
        tmpfs]
        Link: https://lkml.kernel.org/r/20220913212517.3163701-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220907144521.3115321-8-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-8-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b03d0d5
    • Zach O'Keefe's avatar
      selftests/vm: modularize thp collapse memory operations · 8e638707
      Zach O'Keefe authored
      Modularize operations to setup, cleanup, fault, and check for huge pages,
      for a given memory type.  This allows reusing existing tests with
      additional memory types by defining new memory operations.  Following
      patches will add file and shmem memory types.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-7-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-7-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e638707
    • Zach O'Keefe's avatar
      selftests/vm: dedup THP helpers · c07c343c
      Zach O'Keefe authored
      These files:
      
      tools/testing/selftests/vm/vm_util.c
      tools/testing/selftests/vm/khugepaged.c
      
      Both contain logic to:
      
      1) Determine hugepage size on current system
      2) Read /proc/self/smaps to determine number of THPs at an address
      
      Refactor selftests/vm/khugepaged.c to use the vm_util common helpers and
      add it as a build dependency.
      
      Since selftests/vm/khugepaged.c is the largest user of check_huge(),
      change the signature of check_huge() to match selftests/vm/khugepaged.c's
      useage: take an expected number of hugepages, and return a bool indicating
      if the correct number of hugepages were found.  Add a wrapper,
      check_huge_anon(), in anticipation of checking smaps for file and shmem
      hugepages.
      
      Update existing callsites to use the new pattern / function.
      
      Likewise, check_for_pattern() was duplicated, and it's a general enough
      helper to include in vm_util helpers as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-6-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-6-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c07c343c
    • Zach O'Keefe's avatar
      mm/khugepaged: add tracepoint to hpage_collapse_scan_file() · d41fd201
      Zach O'Keefe authored
      Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
      hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
      
      While this change is targeted at debugging MADV_COLLAPSE pathway, the
      "mm_khugepaged" prefix is retained for symmetry with
      huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
      to prevent changing kernel ABI as much as possible.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d41fd201
    • Zach O'Keefe's avatar
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe authored
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34488399
    • Zach O'Keefe's avatar
      mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds · 58ac9a89
      Zach O'Keefe authored
      The main benefit of THPs are that they can be mapped at the pmd level,
      increasing the likelihood of TLB hit and spending less cycles in page
      table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
      pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
      in physical memory, don't have this advantage.  In fact, one could argue
      they are detrimental to system performance overall since they occupy a
      precious hugepage-aligned/sized region of physical memory that could
      otherwise be used more effectively.  Additionally, pte-mapped hugepages
      can be the cheapest memory to collapse for khugepaged since no new
      hugepage allocation or copying of memory contents is necessary - we only
      need to update the mapping page tables.
      
      In the anonymous collapse path, we are able to collapse pte-mapped
      hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
      effort when compound pages (of any order) are encountered.
      
      Identify pte-mapped hugepages in the file/shmem collapse path.  The
      final step of which makes a racy check of the value of the pmd to
      ensure it maps a pte table.  This should be fine, since races that
      result in false-positive (i.e.  attempt collapse even though we
      shouldn't) will fail later in collapse_pte_mapped_thp() once we
      actually lock mmap_lock and reinspect the pmd value.  Races that result
      in false-negatives (i.e.  where we decide to not attempt collapse, but
      should have) shouldn't be an issue, since in the worst case, we do
      nothing - which is what we've done up to this point.  We make a similar
      check in retract_page_tables().  If we do think we've found a
      pte-mapped hugepgae in khugepaged context, attempt to update page
      tables mapping this hugepage.
      
      Note that these collapses still count towards the
      /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
      and if the pte-mapped hugepage was also mapped into multiple process'
      address spaces, could be incremented for each page table update.  Since we
      increment the counter when a pte-mapped hugepage is successfully added to
      the list of to-collapse pte-mapped THPs, it's possible that we never
      actually update the page table either.  This is different from how
      file/shmem pages_collapsed accounting works today where only a successful
      page cache update is counted (it's also possible here that no page tables
      are actually changed).  Though it incurs some slop, this is preferred to
      either not accounting for the event at all, or plumbing through data in
      struct mm_slot on whether to account for the collapse or not.
      
      Also note that work still needs to be done to support arbitrary compound
      pages, and that this should all be converted to using folios.
      
      [shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58ac9a89
    • Zach O'Keefe's avatar
      mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() · 7c6c6cc4
      Zach O'Keefe authored
      Patch series "mm: add file/shmem support to MADV_COLLAPSE", v4.
      
      This series builds on top of the previous "mm: userspace hugepage
      collapse" series which introduced the MADV_COLLAPSE madvise mode and added
      support for private, anonymous mappings[2], by adding support for file and
      shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
      
      File and shmem support have been added with effort to align with existing
      MADV_COLLAPSE semantics and policy decisions[3].  Collapse of shmem-backed
      memory ignores kernel-guiding directives and heuristics including all
      sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
      options (shmem always supports large folios).  Like anonymous mappings, on
      successful return of MADV_COLLAPSE on file/shmem memory, the contents of
      memory mapped by the addresses provided will be synchronously pmd-mapped
      THPs.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      khugepaged has received a small improvement by association and can now
      detect and collapse pte-mapped THPs.  However, there is still work to be
      done along the file collapse path.  Compound pages of arbitrary order
      still needs to be supported and THP collapse needs to be converted to
      using folios in general.  Eventually, we'd like to move away from the
      read-only and executable-mapped constraints currently imposed on eligible
      files and support any inode claiming huge folio support.  That said, I
      think the series as-is covers enough to claim that MADV_COLLAPSE supports
      file/shmem memory.
      
      Patches 1-3	Implement the guts of the series.
      Patch 4 	Is a tracepoint for debugging.
      Patches 5-9 	Refactor existing khugepaged selftests to work with new
      		memory types + new collapse tests.
      Patch 10 	Adds a userfaultfd selftest mode to mimic a functional test
      		of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
      		(v4 note: "userfaultfd shmem" selftest is failing as of
      		Sep 22 mm-unstable)
      
      [1] https://lore.kernel.org/linux-mm/YyiK8YvVcrtZo0z3@google.com/
      [2] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
      [3] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
      [4] https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@google.com/
      [5] https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@google.com/
      
      
      This patch (of 10):
      
      Extend 'mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()' to
      shmem, allowing callers to ignore
      /sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.
      
      This is intended to be used by MADV_COLLAPSE, and the rationale is
      analogous to the anon/file case: MADV_COLLAPSE is not coupled to
      directives that advise the kernel's decisions on when THPs should be
      considered eligible.  shmem/tmpfs always claims large folio support,
      regardless of sysfs or mount options.
      
      [shy828301@gmail.com: test shmem_huge_force explicitly]
        Link: https://lore.kernel.org/linux-mm/CAHbLzko3A5-TpS0BgBeKkx5cuOkWgLvWXQH=TdgW-baO4rPtdg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220922224046.1143204-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220907144521.3115321-2-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-2-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c6c6cc4
    • Zach O'Keefe's avatar
      selftests/vm: retry on EAGAIN for MADV_COLLAPSE selftest · 3505c8e6
      Zach O'Keefe authored
      MADV_COLLAPSE is a best-effort request that will set errno to an
      actionable value if the request cannot be performed.
      
      For example, if pages are not found on the LRU, or if they are currently
      locked by something else, MADV_COLLAPSE will fail and set errno to EAGAIN
      to inform callers that they may try again.
      
      Since the khugepaged selftest is the first public use of MADV_COLLAPSE,
      set a best practice of checking errno and retrying on EAGAIN.
      
      Link: https://lkml.kernel.org/r/20220922184651.1016461-2-zokeefe@google.com
      Fixes: 9330694d ("selftests/vm: add MADV_COLLAPSE collapse context to selftests")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3505c8e6
    • Zach O'Keefe's avatar
      mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolated · 0f3e2a2c
      Zach O'Keefe authored
      MADV_COLLAPSE is a best-effort request that attempts to set an actionable
      errno value if the request cannot be fulfilled at the time.  EAGAIN should
      be used to communicate that a resource was temporarily unavailable, but
      that the user may try again immediately.
      
      SCAN_DEL_PAGE_LRU is an internal result code used when a page cannot be
      isolated from it's LRU list.  Since this, like SCAN_PAGE_LRU, is likely a
      transitory state, make MADV_COLLAPSE return EAGAIN so that users know they
      may reattempt the operation.
      
      Another important scenario to consider is race with khugepaged. 
      khugepaged might isolate a page while MADV_COLLAPSE is interested in it. 
      Even though racing with khugepaged might mean that the memory has already
      been collapsed, signalling an errno that is non-intrinsic to that memory
      or arguments provided to madvise(2) lets the user know that future
      attempts might (and in this case likely would) succeed, and avoids
      false-negative assumptions by the user.
      
      Link: https://lkml.kernel.org/r/20220922184651.1016461-1-zokeefe@google.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f3e2a2c
    • Zach O'Keefe's avatar
      mm/khugepaged: check compound_order() in collapse_pte_mapped_thp() · 780a4b6f
      Zach O'Keefe authored
      By the time we lock a page in collapse_pte_mapped_thp(), the page mapped
      by the address pushed onto the slot's .pte_mapped_thp[] array might have
      changed arbitrarily since we last looked at it.  We revalidate that the
      page is still the head of a compound page, but we don't revalidate if the
      compound page is of order HPAGE_PMD_ORDER before applying rmap and page
      table updates.
      
      Since the kernel now supports large folios of arbitrary order, and since
      replacing page's pte mappings by a pmd mapping only makes sense for
      compound pages of order HPAGE_PMD_ORDER, revalidate that the compound
      order is indeed of order HPAGE_PMD_ORDER before proceeding.
      
      Link: https://lore.kernel.org/linux-mm/CAHbLzkon+2ky8v9ywGcsTUgXM_B35jt5NThYqQKXW2YV_GUacw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220922222731.1124481-1-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      780a4b6f
    • Liu Shixin's avatar
      mm: hugetlb: fix UAF in hugetlb_handle_userfault · 958f32ce
      Liu Shixin authored
      The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
      and reacquire them again after handle_userfault(), but reacquire the
      vma_lock could lead to UAF[1,2] due to the following race,
      
      hugetlb_fault
        hugetlb_no_page
          /*unlock vma_lock */
          hugetlb_handle_userfault
            handle_userfault
              /* unlock mm->mmap_lock*/
                                                 vm_mmap_pgoff
                                                   do_mmap
                                                     mmap_region
                                                       munmap_vma_range
                                                         /* clean old vma */
              /* lock vma_lock again  <--- UAF */
          /* unlock vma_lock */
      
      Since the vma_lock will unlock immediately after
      hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
      hugetlb_handle_userfault() to fix the issue.
      
      [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
      [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
      Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
      Reported-by: default avatarLiu Zixian <liuzixian4@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      958f32ce
    • Kairui Song's avatar
      mm: memcontrol: make cgroup_memory_noswap a static key · c1b8fdae
      Kairui Song authored
      cgroup_memory_noswap is used in many hot path, so make it a static key
      to lower the kernel overhead.
      
      Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
      with the following code snip in a non-root cgroup:
      
         #include <stdio.h>
         #include <string.h>
         #include <linux/mman.h>
         #include <sys/mman.h>
         #define MB 1024UL * 1024UL
         int main(int argc, char **argv){
            void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            memset(p, 0xff, 8000 * MB);
            madvise(p, 8000 * MB, MADV_PAGEOUT);
            memset(p, 0xff, 8000 * MB);
            return 0;
         }
      
      Before:
                7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
                   4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
                       0      cpu-migrations            #    0.000 /sec
               2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
          12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
             156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
             310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
          18,692,516,591      instructions              #    1.49  insn per cycle
                                                        #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
           4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
              13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
           7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
             649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
           1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
              31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
               6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
               5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
                     765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
               4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
             149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)
      
                 7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )
      
      After:
                6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
                   4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
                       0      cpu-migrations            #    0.000 /sec
               2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
          11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
             161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
             253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
          19,328,171,892      instructions              #    1.65  insn per cycle
                                                        #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
           5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
              12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
           7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
             649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
           1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
              31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
               6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
               6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
                     736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
               4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
             144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)
      
                 6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )
      
      The performance is clearly better. There is no significant hotspot
      improvement according to perf report, as there are quite a few
      callers of memcg_swap_enabled and do_memsw_account (which calls
      memcg_swap_enabled). Many pieces of minor optimizations resulted
      in lower overhead for the branch predictor, and bettter performance.
      
      Link: https://lkml.kernel.org/r/20220919180634.45958-3-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1b8fdae
    • Kairui Song's avatar
      mm: memcontrol: use memcg_kmem_enabled in count_objcg_event · 2eb98919
      Kairui Song authored
      Patch series "mm: memcontrol: cleanup and optimize for two accounting
      params", v2.
      
      
      This patch (of 2):
      
      There are currently two helpers for checking if cgroup kmem
      accounting is enabled:
      
      - mem_cgroup_kmem_disabled
      - memcg_kmem_enabled
      
      mem_cgroup_kmem_disabled is a simple helper that returns true
      if cgroup.memory=nokmem is specified, otherwise returns false.
      
      memcg_kmem_enabled is a bit different, it returns true if
      cgroup.memory=nokmem is not specified and there was at least one
      non-root memory control enabled cgroup ever created. This help improve
      performance when kmem accounting was not actually activated. And it's
      optimized with static branch.
      
      The usage of mem_cgroup_kmem_disabled is for sub-systems that need to
      preallocate data for kmem accounting since they could be initialized
      before kmem accounting is activated. But count_objcg_event doesn't
      need that, so using memcg_kmem_enabled is better here.
      
      Link: https://lkml.kernel.org/r/20220919180634.45958-1-ryncsn@gmail.com
      Link: https://lkml.kernel.org/r/20220919180634.45958-2-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2eb98919
    • Kaixu Xia's avatar
      mm/damon: deduplicate damon_{reclaim,lru_sort}_apply_parameters() · 233f0b31
      Kaixu Xia authored
      The bodies of damon_{reclaim,lru_sort}_apply_parameters() contain
      duplicates.  This commit adds a common function
      damon_set_region_biggest_system_ram_default() to remove the duplicates.
      
      Link: https://lkml.kernel.org/r/6329f00d.a70a0220.9bb29.3678SMTPIN_ADDED_BROKEN@mx.google.comSigned-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      233f0b31
    • Xin Hao's avatar
      mm/damon/sysfs: return 'err' value when call kstrtoul() failed · 30b6242c
      Xin Hao authored
      We had better return the 'err' value when calling kstrtoul() failed, so
      the user will know why it really fails, there do little change, let it
      return the 'err' value when failed.
      
      Link: https://lkml.kernel.org/r/6329ebe0.050a0220.ec4bd.297cSMTPIN_ADDED_BROKEN@mx.google.comSuggested-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30b6242c
    • Ran Xiaokai's avatar
      mm/page_alloc: update comments for rmqueue() · a57ae9ef
      Ran Xiaokai authored
      Since commit 44042b44 ("mm/page_alloc: allow high-order pages to be
      stored on the per-cpu lists"), the per-cpu page allocators (PCP) is not
      only for order-0 pages.  Update the comments.
      
      Link: https://lkml.kernel.org/r/20220918025640.208586-1-ran.xiaokai@zte.com.cnSigned-off-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a57ae9ef
    • Kaixu Xia's avatar
      mm/damon: rename damon_pageout_score() to damon_cold_score() · e3e486e6
      Kaixu Xia authored
      In the beginning there is only one damos_action 'DAMOS_PAGEOUT' that need
      to get the coldness score of a region for a scheme, which using
      damon_pageout_score() to do that.  But now there are also other
      damos_action actions need the coldness score, so rename it to
      damon_cold_score() to make more sense.
      
      Link: https://lkml.kernel.org/r/1663423014-28907-1-git-send-email-kaixuxia@tencent.comSigned-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3e486e6
    • Mike Kravetz's avatar
      hugetlb: freeze allocated pages before creating hugetlb pages · 2b21624f
      Mike Kravetz authored
      When creating hugetlb pages, the hugetlb code must first allocate
      contiguous pages from a low level allocator such as buddy, cma or
      memblock.  The pages returned from these low level allocators are ref
      counted.  This creates potential issues with other code taking speculative
      references on these pages before they can be transformed to a hugetlb
      page.  This issue has been addressed with methods and code such as that
      provided in [1].
      
      Recent discussions about vmemmap freeing [2] have indicated that it would
      be beneficial to freeze all sub pages, including the head page of pages
      returned from low level allocators before converting to a hugetlb page. 
      This helps avoid races if we want to replace the page containing vmemmap
      for the head page.
      
      There have been proposals to change at least the buddy allocator to return
      frozen pages as described at [3].  If such a change is made, it can be
      employed by the hugetlb code.  However, as mentioned above hugetlb uses
      several low level allocators so each would need to be modified to return
      frozen pages.  For now, we can manually freeze the returned pages.  This
      is done in two places:
      
      1) alloc_buddy_huge_page, only the returned head page is ref counted.
         We freeze the head page, retrying once in the VERY rare case where
         there may be an inflated ref count.
      2) prep_compound_gigantic_page, for gigantic pages the current code
         freezes all pages except the head page.  New code will simply freeze
         the head page as well.
      
      In a few other places, code checks for inflated ref counts on newly
      allocated hugetlb pages.  With the modifications to freeze after
      allocating, this code can be removed.
      
      After hugetlb pages are freshly allocated, they are often added to the
      hugetlb free lists.  Since these pages were previously ref counted, this
      was done via put_page() which would end up calling the hugetlb destructor:
      free_huge_page.  With changes to freeze pages, we simply call
      free_huge_page directly to add the pages to the free list.
      
      In a few other places, freshly allocated hugetlb pages were immediately
      put into use, and the expectation was they were already ref counted.  In
      these cases, we must manually ref count the page.
      
      [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
      
      [mike.kravetz@oracle.com: fix NULL pointer dereference]
        Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b21624f
    • Miaohe Lin's avatar
      mm/page_alloc: fix obsolete comment in deferred_pfn_valid() · c9b3637f
      Miaohe Lin authored
      There are no architectures that can have holes in the memory map within a
      pageblock since commit 859a85dd ("mm: remove pfn_valid_within() and
      CONFIG_HOLES_IN_ZONE").  Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20220916072257.9639-17-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9b3637f
    • Miaohe Lin's avatar
      mm/page_alloc: remove obsolete gfpflags_normal_context() · def76fd5
      Miaohe Lin authored
      Since commit dacb5d88 ("tcp: fix page frag corruption on page fault"),
      there's no caller of gfpflags_normal_context().  Remove it as this helper
      is strictly tied to the sk page frag usage and there won't be other user
      in the future.
      
      [linmiaohe@huawei.com: fix htmldocs]
        Link: https://lkml.kernel.org/r/1bc55727-9b66-0e9e-c306-f10c4716ea89@huawei.com
      Link: https://lkml.kernel.org/r/20220916072257.9639-16-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      def76fd5
    • Miaohe Lin's avatar
      mm/page_alloc: use costly_order in WARN_ON_ONCE_GFP() · 896c4d52
      Miaohe Lin authored
      There's no need to check whether order > PAGE_ALLOC_COSTLY_ORDER again. 
      Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20220916072257.9639-15-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      896c4d52
    • Miaohe Lin's avatar
      mm/page_alloc: init local variable buddy_pfn · dae37a5d
      Miaohe Lin authored
      The local variable buddy_pfn could be passed to buddy_merge_likely()
      without initialization if the passed in order is MAX_ORDER - 1.  This
      looks buggy but buddy_pfn won't be used in this case as there's a order >=
      MAX_ORDER - 2 check.  Init buddy_pfn to 0 anyway to avoid possible future
      misuse.
      
      Link: https://lkml.kernel.org/r/20220916072257.9639-14-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dae37a5d
    • Miaohe Lin's avatar
      mm/page_alloc: use helper macro SZ_1{K,M} · c940e020
      Miaohe Lin authored
      Use helper macro SZ_1K and SZ_1M to do the size conversion.  Minor
      readability improvement.
      
      Link: https://lkml.kernel.org/r/20220916072257.9639-13-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c940e020