An error occurred fetching the project authors.
  1. 26 Apr, 2024 2 commits
  2. 25 Apr, 2024 4 commits
    • Miaohe Lin's avatar
      mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() · 52ccdde1
      Miaohe Lin authored
      When I did memory failure tests recently, below warning occurs:
      
      DEBUG_LOCKS_WARN_ON(1)
      WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
      FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      Kernel panic - not syncing: kernel: panic_on_warn set ...
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       panic+0x326/0x350
       check_panic_on_warn+0x4f/0x50
       __warn+0x98/0x190
       report_bug+0x18e/0x1a0
       handle_bug+0x3d/0x70
       exc_invalid_op+0x18/0x70
       asm_exc_invalid_op+0x1a/0x20
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      
      After git bisecting and digging into the code, I believe the root cause is
      that _deferred_list field of folio is unioned with _hugetlb_subpool field.
      In __update_and_free_hugetlb_folio(), folio->_deferred_list is
      initialized leading to corrupted folio->_hugetlb_subpool when folio is
      hugetlb.  Later free_huge_folio() will use _hugetlb_subpool and above
      warning happens.
      
      But it is assumed hugetlb flag must have been cleared when calling
      folio_put() in update_and_free_hugetlb_folio().  This assumption is broken
      due to below race:
      
      CPU1					CPU2
      dissolve_free_huge_page			update_and_free_pages_bulk
       update_and_free_hugetlb_folio		 hugetlb_vmemmap_restore_folios
      					  folio_clear_hugetlb_vmemmap_optimized
        clear_flag = folio_test_hugetlb_vmemmap_optimized
        if (clear_flag) <-- False, it's already cleared.
         __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
        folio_put
         free_huge_folio <-- free_the_page is expected.
      					 list_for_each_entry()
      					  __folio_clear_hugetlb <-- Too late.
      
      Fix this issue by checking whether folio is hugetlb directly instead of
      checking clear_flag to close the race window.
      
      Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com
      Fixes: 32c87719 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52ccdde1
    • Vishal Moola (Oracle)'s avatar
      hugetlb: check for anon_vma prior to folio allocation · 37641efa
      Vishal Moola (Oracle) authored
      Commit 9acad7ba ("hugetlb: use vmf_anon_prepare() instead of
      anon_vma_prepare()") may bailout after allocating a folio if we do not
      hold the mmap lock.  When this occurs, vmf_anon_prepare() will release the
      vma lock.  Hugetlb then attempts to call restore_reserve_on_error(), which
      depends on the vma lock being held.
      
      We can move vmf_anon_prepare() prior to the folio allocation in order to
      avoid calling restore_reserve_on_error() without the vma lock.
      
      Link: https://lkml.kernel.org/r/ZiFqSrSRLhIV91og@fedora
      Fixes: 9acad7ba ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()")
      Reported-by: syzbot+ad1b592fc4483655438b@syzkaller.appspotmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37641efa
    • Matthew Wilcox (Oracle)'s avatar
      mm: turn folio_test_hugetlb into a PageType · d99e3140
      Matthew Wilcox (Oracle) authored
      The current folio_test_hugetlb() can be fooled by a concurrent folio split
      into returning true for a folio which has never belonged to hugetlbfs. 
      This can't happen if the caller holds a refcount on it, but we have a few
      places (memory-failure, compaction, procfs) which do not and should not
      take a speculative reference.
      
      Since hugetlb pages do not use individual page mapcounts (they are always
      fully mapped and use the entire_mapcount field to record the number of
      mappings), the PageType field is available now that page_mapcount()
      ignores the value in this field.
      
      In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
      can result in an oops, as reported by Luis. This happens since 9c5ccf2d
      ("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
      in the PageHuge() testing path.
      
      [willy@infradead.org: update vmcoreinfo]
        Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org
      Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org
      Fixes: 9c5ccf2d ("mm: remove HUGETLB_PAGE_DTOR")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d99e3140
    • Peter Xu's avatar
      mm/hugetlb: fix missing hugetlb_lock for resv uncharge · b76b4690
      Peter Xu authored
      There is a recent report on UFFDIO_COPY over hugetlb:
      
      https://lore.kernel.org/all/000000000000ee06de0616177560@google.com/
      
      350:	lockdep_assert_held(&hugetlb_lock);
      
      Should be an issue in hugetlb but triggered in an userfault context, where
      it goes into the unlikely path where two threads modifying the resv map
      together.  Mike has a fix in that path for resv uncharge but it looks like
      the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
      will update the cgroup pointer, so it requires to be called with the lock
      held.
      
      Link: https://lkml.kernel.org/r/20240417211836.2742593-3-peterx@redhat.com
      Fixes: 79aa925b ("hugetlb_cgroup: fix reservation accounting")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: syzbot+4b8077a5fccc61c385a1@syzkaller.appspotmail.com
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b76b4690
  3. 16 Apr, 2024 1 commit
  4. 12 Mar, 2024 1 commit
    • James Houghton's avatar
      mm: add an explicit smp_wmb() to UFFDIO_CONTINUE · b14d1671
      James Houghton authored
      Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier
      is included as part of UFFDIO_CONTINUE.  That is, a user may believe that
      all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are
      guaranteed to be visible to anyone subsequently reading the page through
      the newly mapped virtual memory region.
      
      Today, such a user happens to be correct.  mmget_not_zero(), for example,
      is called as part of UFFDIO_CONTINUE (and comes before any PTE updates),
      and it implicitly gives us a write barrier.
      
      To be resilient against future changes, include an explicit smp_wmb(). 
      While we're at it, optimize the smp_wmb() that is already incidentally
      present for the HugeTLB case.
      
      Merely making a syscall does not generally imply the memory ordering
      constraints that we need (including on x86).
      
      Link: https://lkml.kernel.org/r/20240307010250.3847179-1-jthoughton@google.comSigned-off-by: default avatarJames Houghton <jthoughton@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b14d1671
  5. 06 Mar, 2024 6 commits
  6. 05 Mar, 2024 4 commits
  7. 22 Feb, 2024 3 commits
  8. 08 Jan, 2024 1 commit
  9. 29 Dec, 2023 4 commits
    • David Hildenbrand's avatar
      mm/rmap: introduce and use hugetlb_try_dup_anon_rmap() · ebe2e35e
      David Hildenbrand authored
      hugetlb rmap handling differs quite a lot from "ordinary" rmap code.  For
      example, hugetlb currently only supports entire mappings, and treats any
      mapping as mapped using a single "logical PTE".  Let's move it out of the
      way so we can overhaul our "ordinary" rmap.  implementation/interface.
      
      So let's introduce and use hugetlb_try_dup_anon_rmap() to make all hugetlb
      handling use dedicated hugetlb_* rmap functions.
      
      Add sanity checks that we end up with the right folios in the right
      functions.
      
      Note that is_device_private_page() does not apply to hugetlb.
      
      Link: https://lkml.kernel.org/r/20231220224504.646757-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ebe2e35e
    • David Hildenbrand's avatar
      mm/rmap: introduce and use hugetlb_add_file_rmap() · 44887f39
      David Hildenbrand authored
      hugetlb rmap handling differs quite a lot from "ordinary" rmap code.  For
      example, hugetlb currently only supports entire mappings, and treats any
      mapping as mapped using a single "logical PTE".  Let's move it out of the
      way so we can overhaul our "ordinary" rmap.  implementation/interface.
      
      Right now we're using page_dup_file_rmap() in some cases where "ordinary"
      rmap code would have used page_add_file_rmap().  So let's introduce and
      use hugetlb_add_file_rmap() instead.  We won't be adding a
      "hugetlb_dup_file_rmap()" functon for the fork() case, as it would be
      doing the same: "dup" is just an optimization for "add".
      
      What remains is a single page_dup_file_rmap() call in fork() code.
      
      Add sanity checks that we end up with the right folios in the right
      functions.
      
      Link: https://lkml.kernel.org/r/20231220224504.646757-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44887f39
    • David Hildenbrand's avatar
      mm/rmap: introduce and use hugetlb_remove_rmap() · e135826b
      David Hildenbrand authored
      hugetlb rmap handling differs quite a lot from "ordinary" rmap code.  For
      example, hugetlb currently only supports entire mappings, and treats any
      mapping as mapped using a single "logical PTE".  Let's move it out of the
      way so we can overhaul our "ordinary" rmap.  implementation/interface.
      
      Let's introduce and use hugetlb_remove_rmap() and remove the hugetlb code
      from page_remove_rmap().  This effectively removes one check on the
      small-folio path as well.
      
      Add sanity checks that we end up with the right folios in the right
      functions.
      
      Note: all possible candidates that need care are page_remove_rmap() that
            pass compound=true.
      
      Link: https://lkml.kernel.org/r/20231220224504.646757-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e135826b
    • David Hildenbrand's avatar
      mm/rmap: rename hugepage_add* to hugetlb_add* · 9d5fafd5
      David Hildenbrand authored
      Patch series "mm/rmap: interface overhaul", v2.
      
      This series overhauls the rmap interface, to get rid of the "bool
      compound" / RMAP_COMPOUND parameter with the goal of making the interface
      less error prone, more future proof, and more natural to extend to
      "batching".  Also, this converts the interface to always consume
      folio+subpage, which speeds up operations on large folios.
      
      Further, this series adds PTE-batching variants for 4 rmap functions,
      whereby only folio_add_anon_rmap_ptes() is used for batching in this
      series when PTE-remapping a PMD-mapped THP.  folio_remove_rmap_ptes(),
      folio_try_dup_anon_rmap_ptes() and folio_dup_file_rmap_ptes() will soon
      come in handy[1,2].
      
      This series performs a lot of folio conversion along the way.  Most of the
      added LOC in the diff are only due to documentation.
      
      As we're moving to a pte/pmd interface where we clearly express the
      mapping granularity we are dealing with, we first get the remainder of
      hugetlb out of the way, as it is special and expected to remain special:
      it treats everything as a "single logical PTE" and only currently allows
      entire mappings.
      
      Even if we'd ever support partial mappings, I strongly assume the
      interface and implementation will still differ heavily: hopefull we can
      avoid working on subpages/subpage mapcounts completely and only add a
      "count" parameter for them to enable batching.
      
      New (extended) hugetlb interface that operates on entire folio:
       * hugetlb_add_new_anon_rmap() -> Already existed
       * hugetlb_add_anon_rmap() -> Already existed
       * hugetlb_try_dup_anon_rmap()
       * hugetlb_try_share_anon_rmap()
       * hugetlb_add_file_rmap()
       * hugetlb_remove_rmap()
      
      New "ordinary" interface for small folios / THP::
       * folio_add_new_anon_rmap() -> Already existed
       * folio_add_anon_rmap_[pte|ptes|pmd]()
       * folio_try_dup_anon_rmap_[pte|ptes|pmd]()
       * folio_try_share_anon_rmap_[pte|pmd]()
       * folio_add_file_rmap_[pte|ptes|pmd]()
       * folio_dup_file_rmap_[pte|ptes|pmd]()
       * folio_remove_rmap_[pte|ptes|pmd]()
      
      folio_add_new_anon_rmap() will always map at the largest granularity
      possible (currently, a single PMD to cover a PMD-sized THP).  Could be
      extended if ever required.
      
      In the future, we might want "_pud" variants and eventually "_pmds"
      variants for batching.
      
      I ran some simple microbenchmarks on an Intel(R) Xeon(R) Silver 4210R:
      measuring munmap(), fork(), cow, MADV_DONTNEED on each PTE ...  and PTE
      remapping PMD-mapped THPs on 1 GiB of memory.
      
      For small folios, there is barely a change (< 1% improvement for me).
      
      For PTE-mapped THP:
      * PTE-remapping a PMD-mapped THP is more than 10% faster.
      * fork() is more than 4% faster.
      * MADV_DONTNEED is 2% faster
      * COW when writing only a single byte on a COW-shared PTE is 1% faster
      * munmap() barely changes (< 1%).
      
      [1] https://lkml.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com
      [2] https://lkml.kernel.org/r/20231204105440.61448-1-ryan.roberts@arm.com
      
      
      This patch (of 40):
      
      Let's just call it "hugetlb_".
      
      Yes, it's all already inconsistent and confusing because we have a lot of
      "hugepage_" functions for legacy reasons.  But "hugetlb" cannot possibly
      be confused with transparent huge pages, and it matches "hugetlb.c" and
      "folio_test_hugetlb()".  So let's minimize confusion in rmap code.
      
      Link: https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20231220224504.646757-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9d5fafd5
  10. 07 Dec, 2023 1 commit
    • Mike Kravetz's avatar
      hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write · 187da0f8
      Mike Kravetz authored
      The routine __vma_private_lock tests for the existence of a reserve map
      associated with a private hugetlb mapping.  A pointer to the reserve map
      is in vma->vm_private_data.  __vma_private_lock was checking the pointer
      for NULL.  However, it is possible that the low bits of the pointer could
      be used as flags.  In such instances, vm_private_data is not NULL and not
      a valid pointer.  This results in the null-ptr-deref reported by syzbot:
      
      general protection fault, probably for non-canonical address 0xdffffc000000001d:
       0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
      CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88
      8cf78c29e2 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1
      0/09/2023
      RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004
      ...
      Call Trace:
       <TASK>
       lock_acquire kernel/locking/lockdep.c:5753 [inline]
       lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718
       down_write+0x93/0x200 kernel/locking/rwsem.c:1573
       hugetlb_vma_lock_write mm/hugetlb.c:300 [inline]
       hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291
       __hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447
       hugetlb_zap_begin include/linux/hugetlb.h:258 [inline]
       unmap_vmas+0x2f4/0x470 mm/memory.c:1733
       exit_mmap+0x1ad/0xa60 mm/mmap.c:3230
       __mmput+0x12a/0x4d0 kernel/fork.c:1349
       mmput+0x62/0x70 kernel/fork.c:1371
       exit_mm kernel/exit.c:567 [inline]
       do_exit+0x9ad/0x2a20 kernel/exit.c:861
       __do_sys_exit kernel/exit.c:991 [inline]
       __se_sys_exit kernel/exit.c:989 [inline]
       __x64_sys_exit+0x42/0x50 kernel/exit.c:989
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Mask off low bit flags before checking for NULL pointer.  In addition, the
      reserve map only 'belongs' to the OWNER (parent in parent/child
      relationships) so also check for the OWNER flag.
      
      Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com
      Reported-by: syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/
      Fixes: bf491692 ("hugetlbfs: extend hugetlb_vma_lock to private VMAs")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Edward Adam Davis <eadavis@qq.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      187da0f8
  11. 21 Nov, 2023 1 commit
  12. 25 Oct, 2023 6 commits
    • Hugh Dickins's avatar
      mempolicy: mmap_lock is not needed while migrating folios · 72e315f7
      Hugh Dickins authored
      mbind(2) holds down_write of current task's mmap_lock throughout
      (exclusive because it needs to set the new mempolicy on the vmas);
      migrate_pages(2) holds down_read of pid's mmap_lock throughout.
      
      They both hold mmap_lock across the internal migrate_pages(), under which
      all new page allocations (huge or small) are made.  I'm nervous about it;
      and migrate_pages() certainly does not need mmap_lock itself.  It's done
      this way for mbind(2), because its page allocator is vma_alloc_folio() or
      alloc_hugetlb_folio_vma(), both of which depend on vma and address.
      
      Now that we have alloc_pages_mpol(), depending on (refcounted) memory
      policy and interleave index, mbind(2) can be modified to use that or
      alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the
      internal migrate_pages() at all: add alloc_migration_target_by_mpol() to
      replace mbind's new_page().
      
      (After that change, alloc_hugetlb_folio_vma() is used by nothing but a
      userfaultfd function: move it out of hugetlb.h and into the #ifdef.)
      
      migrate_pages(2) has chosen its target node before migrating, so can
      continue to use the standard alloc_migration_target(); but let it take and
      drop mmap_lock just around migrate_to_node()'s queue_pages_range():
      neither the node-to-node calculations nor the page migrations need it.
      
      It seems unlikely, but it is conceivable that some userspace depends on
      the kernel's mmap_lock exclusion here, instead of doing its own locking:
      more likely in a testsuite than in real life.  It is also possible, of
      course, that some pages on the list will be munmapped by another thread
      before they are migrated, or a newer memory policy applied to the range by
      that time: but such races could happen before, as soon as mmap_lock was
      dropped, so it does not appear to be a concern.
      
      Link: https://lkml.kernel.org/r/21e564e8-269f-6a89-7ee2-fd612831c289@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72e315f7
    • Usama Arif's avatar
      hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions · c5ad3233
      Usama Arif authored
      Most function calls in hugetlb.c are made with folio arguments.  This
      brings hugetlb_vmemmap calls inline with them by using folio instead of
      head struct page.  Head struct page is still needed within these
      functions.
      
      The set/clear/test functions for hugepages are also changed to folio
      versions.
      
      Link: https://lkml.kernel.org/r/20231011144557.1720481-2-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5ad3233
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap restoration on a list of pages · cfb8c750
      Mike Kravetz authored
      The routine update_and_free_pages_bulk already performs vmemmap
      restoration on the list of hugetlb pages in a separate step.  In
      preparation for more functionality to be added in this step, create a new
      routine hugetlb_vmemmap_restore_folios() that will restore vmemmap for a
      list of folios.
      
      This new routine must provide sufficient feedback about errors and actual
      restoration performed so that update_and_free_pages_bulk can perform
      optimally.
      
      Special care must be taken when encountering an error from
      hugetlb_vmemmap_restore_folios.  We want to continue making as much
      forward progress as possible.  A new routine bulk_vmemmap_restore_error
      handles this specific situation.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfb8c750
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap optimization on a list of pages · 79359d6d
      Mike Kravetz authored
      When adding hugetlb pages to the pool, we first create a list of the
      allocated pages before adding to the pool.  Pass this list of pages to a
      new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.
      
      Due to significant differences in vmemmmap initialization for bootmem
      allocated hugetlb pages, a new routine prep_and_add_bootmem_folios is
      created.
      
      We also modify the routine vmemmap_should_optimize() to check for pages
      that are already optimized.  There are code paths that might request
      vmemmap optimization twice and we want to make sure this is not attempted.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79359d6d
    • Mike Kravetz's avatar
      hugetlb: restructure pool allocations · d67e32f2
      Mike Kravetz authored
      Allocation of a hugetlb page for the hugetlb pool is done by the routine
      alloc_pool_huge_page.  This routine will allocate contiguous pages from a
      low level allocator, prep the pages for usage as a hugetlb page and then
      add the resulting hugetlb page to the pool.
      
      In the 'prep' stage, optional vmemmap optimization is done.  For
      performance reasons we want to perform vmemmap optimization on multiple
      hugetlb pages at once.  To do this, restructure the hugetlb pool
      allocation code such that vmemmap optimization can be isolated and later
      batched.
      
      The code to allocate hugetlb pages from bootmem was also modified to
      allow batching.
      
      No functional changes, only code restructure.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d67e32f2
    • Mike Kravetz's avatar
      hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles · d2cf88c2
      Mike Kravetz authored
      Patch series "Batch hugetlb vmemmap modification operations", v8.
      
      When hugetlb vmemmap optimization was introduced, the overhead of enabling
      the option was measured as described in commit 426e5c42 [1].  The
      summary states that allocating a hugetlb page should be ~2x slower with
      optimization and freeing a hugetlb page should be ~2-3x slower.  Such
      overhead was deemed an acceptable trade off for the memory savings
      obtained by freeing vmemmap pages.
      
      It was recently reported that the overhead associated with enabling
      vmemmap optimization could be as high as 190x for hugetlb page
      allocations.  Yes, 190x!  Some actual numbers from other environments are:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.119s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.477s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m28.973s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m36.748s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.463s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m2.931s
      
      Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    2m27.609s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    2m29.924s
      
      In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
      resulted in allocation times being 61x slower.
      
      A quick profile showed that the vast majority of this overhead was due to
      TLB flushing.  Each time we modify the kernel pagetable we need to flush
      the TLB.  For each hugetlb that is optimized, there could be potentially
      two TLB flushes performed.  One for the vmemmap pages associated with the
      hugetlb page, and potentially another one if the vmemmap pages are mapped
      at the PMD level and must be split.  The TLB flushes required for the
      kernel pagetable, result in a broadcast IPI with each CPU having to flush
      a range of pages, or do a global flush if a threshold is exceeded.  So,
      the flush time increases with the number of CPUs.  In addition, in virtual
      environments the broadcast IPI can’t be accelerated by hypervisor
      hardware and leads to traps that need to wakeup/IPI all vCPUs which is
      very expensive.  Because of this the slowdown in virtual environments is
      even worse than bare metal as the number of vCPUS/CPUs is increased.
      
      The following series attempts to reduce amount of time spent in TLB
      flushing.  The idea is to batch the vmemmap modification operations for
      multiple hugetlb pages.  Instead of doing one or two TLB flushes for each
      page, we do two TLB flushes for each batch of pages.  One flush after
      splitting pages mapped at the PMD level, and another after remapping
      vmemmap associated with all hugetlb pages.  Results of such batching are
      as follows:
      
      Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
      ------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m4.719s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m4.245s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 500000 > .../hugepages-2048kB/nr_hugepages
      real    0m7.267s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m13.199s
      
      VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
      -----------------------------------------------------------
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m2.715s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m3.186s
      
      next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
      time echo 524288 > .../hugepages-2048kB/nr_hugepages
      real    0m4.799s
      time echo 0 > .../hugepages-2048kB/nr_hugepages
      real    0m5.273s
      
      With batching, results are back in the 2-3x slowdown range.
      
      
      This patch (of 8):
      
      update_and_free_pages_bulk is designed to free a list of hugetlb pages
      back to their associated lower level allocators.  This may require
      allocating vmemmmap pages associated with each hugetlb page.  The hugetlb
      page destructor must be changed before pages are freed to lower level
      allocators.  However, the destructor must be changed under the hugetlb
      lock.  This means there is potentially one lock cycle per page.
      
      Minimize the number of lock cycles in update_and_free_pages_bulk by:
      1) allocating necessary vmemmap for all hugetlb pages on the list
      2) take hugetlb lock and clear destructor for all pages on the list
      3) free all pages on list back to low level allocators
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJames Houghton <jthoughton@google.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2cf88c2
  13. 18 Oct, 2023 6 commits
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
    • Frank van der Linden's avatar
      mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER · 59838b25
      Frank van der Linden authored
      Originally, hugetlb_cgroup was the only hugetlb user of tail page
      structure fields.  So, the code defined and checked against
      HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use.
      
      However, by now, tail page #2 is used to store hugetlb hwpoison and
      subpool information as well.  In other words, without that tail page
      hugetlb doesn't work.
      
      Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and
      checks against it.  Instead, just check for the minimum viable page order
      at hstate creation time.
      
      Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.comSigned-off-by: default avatarFrank van der Linden <fvdl@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59838b25
    • David Hildenbrand's avatar
      mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap() · 06968625
      David Hildenbrand authored
      Let's convert it to consume a folio.
      
      [akpm@linux-foundation.org: fix kerneldoc]
      Link: https://lkml.kernel.org/r/20231002142949.235104-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06968625
    • David Hildenbrand's avatar
      mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap() · 5ca43289
      David Hildenbrand authored
      Patch series "mm/rmap: convert page_move_anon_rmap() to
      folio_move_anon_rmap()".
      
      Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
      callers handle PageAnonExclusive.  I'm including cleanup patch #3 because
      it fits into the picture and can be done cleaner by the conversion.
      
      
      This patch (of 3):
      
      Let's move it into the caller: there is a difference between whether an
      anon folio can only be mapped by one process (e.g., into one VMA), and
      whether it is truly exclusive (e.g., no references -- including GUP --
      from other processes).
      
      Further, for large folios the page might not actually be pointing at the
      head page of the folio, so it better be handled in the caller.  This is a
      preparation for converting page_move_anon_rmap() to consume a folio.
      
      Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5ca43289
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs · 52526ca7
      Muhammad Usama Anjum authored
      The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally
      clear the info about page table entries. The following operations are
      supported in this IOCTL:
      - Scan the address range and get the memory ranges matching the provided
        criteria. This is performed when the output buffer is specified.
      - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect
        the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if
        non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING``
        can be used with or without PM_SCAN_CHECK_WPASYNC.
      - Both of those operations can be combined into one atomic operation where
        we can get and write protect the pages as well.
      
      Following flags about pages are currently supported:
      - PAGE_IS_WPALLOWED - Page has async-write-protection enabled
      - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected
      - PAGE_IS_FILE - Page is file backed
      - PAGE_IS_PRESENT - Page is present in the memory
      - PAGE_IS_SWAPPED - Page is in swapped
      - PAGE_IS_PFNZERO - Page has zero PFN
      - PAGE_IS_HUGE - Page is THP or Hugetlb backed
      
      This IOCTL can be extended to get information about more PTE bits. The
      entire address range passed by user [start, end) is scanned until either
      the user provided buffer is full or max_pages have been found.
      
      [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"]
      [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning]
      [arnd@arndb.de: hide unused pagemap_scan_backout_range() function]
        Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org
      [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"]
        Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au
      Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reviewed-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52526ca7
    • Peter Xu's avatar
      userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
      Peter Xu authored
      Patch series "Implement IOCTL to get and optionally clear info about
      PTEs", v33.
      
      *Motivation*
      The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
      GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
      retrieves the addresses of the pages that are written to in a region of
      virtual memory.
      
      This syscall is used in Windows applications and games etc.  This syscall
      is being emulated in pretty slow manner in userspace.  Our purpose is to
      enhance the kernel such that we translate it efficiently in a better way. 
      Currently some out of tree hack patches are being used to efficiently
      emulate it in some kernels.  We intend to replace those with these
      patches.  So the whole gaming on Linux can effectively get benefit from
      this.  It means there would be tons of users of this code.
      
      CRIU use case [2] was mentioned by Andrei and Danylo:
      > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
      > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
      > shadow memory [4]. Being able to migrate such binaries allows to highly
      > reduce the amount of work needed to identify and fix post-migration
      > crashes, which happen constantly.
      
      Andrei defines the following uses of this code:
      * it is more granular and allows us to track changed pages more
        effectively. The current interface can clear dirty bits for the entire
        process only. In addition, reading info about pages is a separate
        operation. It means we must freeze the process to read information
        about all its pages, reset dirty bits, only then we can start dumping
        pages. The information about pages becomes more and more outdated,
        while we are processing pages. The new interface solves both these
        downsides. First, it allows us to read pte bits and clear the
        soft-dirty bit atomically. It means that CRIU will not need to freeze
        processes to pre-dump their memory. Second, it clears soft-dirty bits
        for a specified region of memory. It means CRIU will have actual info
        about pages to the moment of dumping them.
      * The new interface has to be much faster because basic page filtering
        is happening in the kernel. With the old interface, we have to read
        pagemap for each page.
      
      *Implementation Evolution (Short Summary)*
      From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
      feature can be used under the hood with some additions like:
      * reset soft-dirty flag for only a specific region of memory instead of
      clearing the flag for the entire process
      * get and clear soft-dirty flag for a specific region atomically
      
      So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
      flag. But using soft-dirty flag, sometimes we get extra pages which weren't
      even written. They had become soft-dirty because of VMA merging and
      VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
      able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
      reported that mprotect etc messes up the soft-dirty flag while ignoring
      VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
      discussed if we can revert these patches. But we could not reach to any
      conclusion. So at this point, I made couple of tries to solve this whole
      VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
      * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
      regression. We left it behind.
      * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
      got the reply don't increase the size of the VMA by 8 bytes.
      
      At this point, we left soft-dirty considering it is too much delicate and
      userfaultfd [9] seemed like the only way forward. From there onward, we
      have been basing soft-dirty emulation on userfaultfd wp feature where
      kernel resolves the faults itself when WP_ASYNC feature is used. It was
      straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
      those pages dirty or written-to which are really written in reality. (PS
      There is another WP_UNPOPULATED userfautfd feature is required which is
      needed to avoid pre-faulting memory before write-protecting [9].)
      
      All the different masks were added on the request of CRIU devs to create
      interface more generic and better.
      
      [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
      [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
      [3] https://github.com/google/sanitizers
      [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
      [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
      [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
      [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
      [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
      [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
      [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
      
      
      This patch (of 6):
      
      Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
      userfaultfd wr-protect faults to be resolved by the kernel directly.
      
      It can be used like a high accuracy version of soft-dirty, without vma
      modifications during tracking, and also with ranged support by default
      rather than for a whole mm when reset the protections due to existence of
      ioctl(UFFDIO_WRITEPROTECT).
      
      Several goals of such a dirty tracking interface:
      
      1. All types of memory should be supported and tracable. This is nature
         for soft-dirty but should mention when the context is userfaultfd,
         because it used to only support anon/shmem/hugetlb. The problem is for
         a dirty tracking purpose these three types may not be enough, and it's
         legal to track anything e.g. any page cache writes from mmap.
      
      2. Protections can be applied to partial of a memory range, without vma
         split/merge fuss.  The hope is that the tracking itself should not
         affect any vma layout change.  It also helps when reset happens because
         the reset will not need mmap write lock which can block the tracee.
      
      3. Accuracy needs to be maintained.  This means we need pte markers to work
         on any type of VMA.
      
      One could question that, the whole concept of async dirty tracking is not
      really close to fundamentally what userfaultfd used to be: it's not "a
      fault to be serviced by userspace" anymore. However, using userfaultfd-wp
      here as a framework is convenient for us in at least:
      
      1. VM_UFFD_WP vma flag, which has a very good name to suite something like
         this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
         feature bit to identify from a sync version of uffd-wp registration.
      
      2. PTE markers logic can be leveraged across the whole kernel to maintain
         the uffd-wp bit as long as an arch supports, this also applies to this
         case where uffd-wp bit will be a hint to dirty information and it will
         not go lost easily (e.g. when some page cache ptes got zapped).
      
      3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
         resetting a range of memory, while there's no counterpart in the old
         soft-dirty world, hence if this is wanted in a new design we'll need a
         new interface otherwise.
      
      We can somehow understand that commonality because uffd-wp was
      fundamentally a similar idea of write-protecting pages just like
      soft-dirty.
      
      This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
      far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
      gives us chance to modify impl of WP_ASYNC just in case it could be not
      depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
      to imply that because both features will rely on PTE_MARKER_UFFD_WP config
      option, so they'll show up together (or both missing) in an UFFDIO_API
      probe.
      
      vma_can_userfault() now allows any VMA if the userfaultfd registration is
      only about async uffd-wp.  So we can track dirty for all kinds of memory
      including generic file systems (like XFS, EXT4 or BTRFS).
      
      One trick worth mention in do_wp_page() is that we need to manually update
      vmf->orig_pte here because it can be used later with a pte_same() check -
      this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
      
      The major defect of this approach of dirty tracking is we need to populate
      the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
      It's unwanted in the case where the range of memory to track is huge and
      unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
      without having any page cache installed yet).  One way to improve this is
      to allow pte markers exist for larger than PTE level for PMD+.  That will
      not change the interface if to implemented, so we can leave that for
      later.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61ea1cb