1. 19 May, 2022 31 commits
  2. 13 May, 2022 9 commits
    • Rei Yamamoto's avatar
      mm, compaction: fast_find_migrateblock() should return pfn in the target zone · bbe832b9
      Rei Yamamoto authored
      At present, pages not in the target zone are added to cc->migratepages
      list in isolate_migratepages_block().  As a result, pages may migrate
      between nodes unintentionally.
      
      This would be a serious problem for older kernels without commit
      a984226f ("mm: memcontrol: remove the pgdata parameter of
      mem_cgroup_page_lruvec"), because it can corrupt the lru list by
      handling pages in list without holding proper lru_lock.
      
      Avoid returning a pfn outside the target zone in the case that it is
      not aligned with a pageblock boundary.  Otherwise
      isolate_migratepages_block() will handle pages not in the target zone.
      
      Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com
      Fixes: 70b44595 ("mm, compaction: use free lists to quickly locate a migration source")
      Signed-off-by: default avatarRei Yamamoto <yamamoto.rei@jp.fujitsu.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
      Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe832b9
    • Gautam Menghani's avatar
      mm/damon: add documentation for Enum value · d4a157f5
      Gautam Menghani authored
      Fix the warning - "Enum value 'NR_DAMON_OPS' not described in enum
      'damon_ops_id'" generated by the command "make pdfdocs"
      
      Link: https://lkml.kernel.org/r/20220508073316.141401-1-gautammenghani201@gmail.comSigned-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4a157f5
    • Ganesan Rajagopal's avatar
      mm/memcontrol: export memcg->watermark via sysfs for v2 memcg · 8e20d4b3
      Ganesan Rajagopal authored
      We run a lot of automated tests when building our software and run into
      OOM scenarios when the tests run unbounded.  v1 memcg exports
      memcg->watermark as "memory.max_usage_in_bytes" in sysfs.  We use this
      metric to heuristically limit the number of tests that can run in parallel
      based on per test historical data.
      
      This metric is currently not exported for v2 memcg and there is no other
      easy way of getting this information.  getrusage() syscall returns
      "ru_maxrss" which can be used as an approximation but that's the max RSS
      of a single child process across all children instead of the aggregated
      max for all child processes.  The only work around is to periodically poll
      "memory.current" but that's not practical for short-lived one-off cgroups.
      
      Hence, expose memcg->watermark as "memory.peak" for v2 memcg.
      
      Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.comSigned-off-by: default avatarGanesan Rajagopal <rganesan@arista.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e20d4b3
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl · 78f39084
      Muchun Song authored
      We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
      reboot the server to enable or disable the feature of optimizing vmemmap
      pages associated with HugeTLB pages.  However, rebooting usually takes a
      long time.  So add a sysctl to enable or disable the feature at runtime
      without rebooting.  Why we need this?  There are 3 use cases.
      
      1) The feature of minimizing overhead of struct page associated with
         each HugeTLB is disabled by default without passing
         "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
         deliver the servers to the users who want to enable this feature, they
         have to configure the grub (change boot cmdline) and reboot the
         servers, whereas rebooting usually takes a long time (we have thousands
         of servers).  It's a very bad experience for the users.  So we need a
         approach to enable this feature after rebooting.  This is a use case in
         our practical environment.
      
      2) Some use cases are that HugeTLB pages are allocated 'on the fly'
         instead of being pulled from the HugeTLB pool, those workloads would be
         affected with this feature enabled.  Those workloads could be
         identified by the characteristics of they never explicitly allocating
         huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
         and then let the pages be allocated from the buddy allocator at fault
         time.  We can confirm it is a real use case from the commit
         099730d6.  For those workloads, the page fault time could be ~2x
         slower than before.  We suspect those users want to disable this
         feature if the system has enabled this before and they don't think the
         memory savings benefit is enough to make up for the performance drop.
      
      3) If the workload which wants vmemmap pages to be optimized and the
         workload which wants to set 'nr_overcommit_hugepages' and does not want
         the extera overhead at fault time when the overcommitted pages be
         allocated from the buddy allocator are deployed in the same server. 
         The user could enable this feature and set 'nr_hugepages' and
         'nr_overcommit_hugepages', then disable the feature.  In this case, the
         overcommited HugeTLB pages will not encounter the extra overhead at
         fault time.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78f39084
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsing · 9c54c522
      Muchun Song authored
      Use kstrtobool rather than open coding "on" and "off" parsing in
      mm/hugetlb_vmemmap.c, which is more powerful to handle all kinds of
      parameters like 'Yy1Nn0' or [oO][NnFf] for "on" and "off".
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-4-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c54c522
    • Muchun Song's avatar
      mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on · 6e02c46b
      Muchun Song authored
      Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap
      on hot added memory.  If "hugetlb_free_vmemmap=on" and
      memory_hotplug.memmap_on_memory" are both passed on the kernel command
      line, optimizing hugetlb pages takes precedence.  However, the global
      variable memmap_on_memory will still be set to 1, even though we will not
      try to allocate memmap on hot added memory.
      
      Also introduce mhp_memmap_on_memory() helper to move the definition of
      "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY.  In the
      next patch, mhp_memmap_on_memory() will also be exported to be used in
      hugetlb_vmemmap.c.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e02c46b
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page crosses page boundaries · 0effdf46
      Muchun Song authored
      Patch series "add hugetlb_optimize_vmemmap sysctl", v11.
      
      This series aims to add hugetlb_optimize_vmemmap sysctl to enable or
      disable the feature of optimizing vmemmap pages associated with HugeTLB
      pages.
      
      
      This patch (of 4):
      
      If the size of "struct page" is not the power of two but with the feature
      of minimizing overhead of struct page associated with each HugeTLB is
      enabled, then the vmemmap pages of HugeTLB will be corrupted after
      remapping (panic is about to happen in theory).  But this only exists when
      !CONFIG_MEMCG && !CONFIG_SLUB on x86_64.  However, it is not a
      conventional configuration nowadays.  So it is not a real word issue, just
      the result of a code review.
      
      But we cannot prevent anyone from configuring that combined configure. 
      This hugetlb_optimize_vmemmap should be disable in this case to fix this
      issue.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220512041142.39501-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0effdf46
    • Baolin Wang's avatar
      mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmapping · a00a8759
      Baolin Wang authored
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
      1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
      
      When unmapping a hugetlb page, we will get the relevant page table entry
      by huge_pte_offset() only once to nuke it.  This is correct for PMD or PUD
      size hugetlb, since they always contain only one pmd entry or pud entry in
      the page table.
      
      However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
      they can contain several continuous pte or pmd entry with same page table
      attributes, so we will nuke only one pte or pmd entry for this
      CONT-PTE/PMD size hugetlb page.
      
      And now try_to_unmap() is only passed a hugetlb page in the case where the
      hugetlb page is poisoned.  Which means now we will unmap only one pte
      entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can
      still access other subpages of a CONT-PTE or CONT-PMD size poisoned
      hugetlb page, which will cause serious issues possibly.
      
      So we should change to use huge_ptep_clear_flush() to nuke the hugetlb
      page table to fix this issue, which already considered CONT-PTE and
      CONT-PMD size hugetlb.
      
      We've already used set_huge_swap_pte_at() to set a poisoned swap entry for
      a poisoned hugetlb page.  Meanwhile adding a VM_BUG_ON() to make sure the
      passed hugetlb page is poisoned in try_to_unmap().
      
      Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a00a8759
    • Baolin Wang's avatar
      mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migration · 5d4af619
      Baolin Wang authored
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
      1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
      
      When migrating a hugetlb page, we will get the relevant page table entry
      by huge_pte_offset() only once to nuke it and remap it with a migration
      pte entry.  This is correct for PMD or PUD size hugetlb, since they always
      contain only one pmd entry or pud entry in the page table.
      
      However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
      they can contain several continuous pte or pmd entry with same page table
      attributes.  So we will nuke or remap only one pte or pmd entry for this
      CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb
      migration.  The problem is we can still continue to modify the subpages'
      data of a hugetlb page during migrating a hugetlb page, which can cause a
      serious data consistent issue, since we did not nuke the page table entry
      and set a migration pte for the subpages of a hugetlb page.
      
      To fix this issue, we should change to use huge_ptep_clear_flush() to nuke
      a hugetlb page table, and remap it with set_huge_pte_at() and
      set_huge_swap_pte_at() when migrating a hugetlb page, which already
      considered the CONT-PTE or CONT-PMD size hugetlb.
      
      [akpm@linux-foundation.org: fix nommu build]
      [baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU]
        Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5d4af619