1. 27 Sep, 2022 12 commits
    • Peter Xu's avatar
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu authored
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e346877
    • Peter Xu's avatar
      mm/thp: carry over dirty bit when thp splits on pmd · 0ccf7f16
      Peter Xu authored
      Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
      shouldn't be a correctness issue since when pmd_dirty() we'll have the
      page marked dirty anyway, however having dirty bit carried over helps the
      next initial writes of split ptes on some archs like x86.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ccf7f16
    • Peter Xu's avatar
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu authored
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • Peter Xu's avatar
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu authored
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • Peter Xu's avatar
      mm/x86: use SWP_TYPE_BITS in 3-level swap macros · 9c61d532
      Peter Xu authored
      Patch series "mm: Remember a/d bits for migration entries", v4.
      
      
      Problem
      =======
      
      When migrating a page, right now we always mark the migrated page as old &
      clean.
      
      However that could lead to at least two problems:
      
        (1) We lost the real hot/cold information while we could have persisted.
            That information shouldn't change even if the backing page is changed
            after the migration,
      
        (2) There can be always extra overhead on the immediate next access to
            any migrated page, because hardware MMU needs cycles to set the young
            bit again for reads, and dirty bits for write, as long as the
            hardware MMU supports these bits.
      
      Many of the recent upstream works showed that (2) is not something trivial
      and actually very measurable.  In my test case, reading 1G chunk of memory
      - jumping in page size intervals - could take 99ms just because of the
      extra setting on the young bit on a generic x86_64 system, comparing to
      4ms if young set.
      
      This issue is originally reported by Andrea Arcangeli.
      
      Solution
      ========
      
      To solve this problem, this patchset tries to remember the young/dirty
      bits in the migration entries and carry them over when recovering the
      ptes.
      
      We have the chance to do so because in many systems the swap offset is not
      really fully used.  Migration entries use swp offset to store PFN only,
      while the PFN is normally not as large as swp offset and normally smaller.
      It means we do have some free bits in swp offset that we can use to store
      things like A/D bits, and that's how this series tried to approach this
      problem.
      
      max_swapfile_size() is used here to detect per-arch offset length in swp
      entries.  We'll automatically remember the A/D bits when we find that we
      have enough swp offset field to keep both the PFN and the extra bits.
      
      Since max_swapfile_size() can be slow, the last two patches cache the
      results for it and also swap_migration_ad_supported as a whole.
      
      Known Issues / TODOs
      ====================
      
      We still haven't taught madvise() to recognize the new A/D bits in
      migration entries, namely MADV_COLD/MADV_FREE.  E.g.  when MADV_COLD upon
      a migration entry.  It's not clear yet on whether we should clear the A
      bit, or we should just drop the entry directly.
      
      We didn't teach idle page tracking on the new migration entries, because
      it'll need larger rework on the tree on rmap pgtable walk.  However it
      should make it already better because before this patchset page will be
      old page after migration, so the series will fix potential false negative
      of idle page tracking when pages were migrated before observing.
      
      The other thing is migration A/D bits will not start to working for
      private device swap entries.  The code is there for completeness but since
      private device swap entries do not yet have fields to store A/D bits, even
      if we'll persistent A/D across present pte switching to migration entry,
      we'll lose it again when the migration entry converted to private device
      swap entry.
      
      Tests
      =====
      
      After the patchset applied, the immediate read access test [1] of above 1G
      chunk after migration can shrink from 99ms to 4ms.  The test is done by
      moving 1G pages from node 0->1->0 then read it in page size jumps.  The
      test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
      
      Similar effect can also be measured when writting the memory the 1st time
      after migration.
      
      After applying the patchset, both initial immediate read/write after page
      migrated will perform similarly like before migration happened.
      
      Patch Layout
      ============
      
      Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
      
      Patch 3-4:  Prepare for the introduction of migration A/D bits
      
      Patch 5:    The core patch to remember young/dirty bit in swap offsets.
      
      Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
      
      [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
      
      
      This patch (of 7):
      
      Replace all the magic "5" with the macro.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c61d532
    • Miaohe Lin's avatar
      mm, hwpoison: cleanup some obsolete comments · 9cf28191
      Miaohe Lin authored
      1.Remove meaningless comment in kill_proc(). That doesn't tell anything.
      2.Fix the wrong function name get_hwpoison_unless_zero(). It should be
      get_page_unless_zero().
      3.The gate keeper for free hwpoison page has moved to check_new_page().
      Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cf28191
    • Miaohe Lin's avatar
      mm, hwpoison: check PageTable() explicitly in hwpoison_user_mappings() · b680dae9
      Miaohe Lin authored
      PageTable can't be handled by memory_failure(). Filter it out explicitly in
      hwpoison_user_mappings(). This will also make code more consistent with the
      relevant check in unpoison_memory().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b680dae9
    • Miaohe Lin's avatar
      mm, hwpoison: avoid unneeded page_mapped_in_vma() overhead in collect_procs_anon() · 36537a67
      Miaohe Lin authored
      If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as
      add_to_kill() won't be called in this case. Move up the mm check to avoid
      possible unneeded calling to page_mapped_in_vma().
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      36537a67
    • Miaohe Lin's avatar
      mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages · 21c9e90a
      Miaohe Lin authored
      Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
      num_poisoned_pages_dec() can be killed as there's no caller now.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21c9e90a
    • Miaohe Lin's avatar
      mm, hwpoison: use __PageMovable() to detect non-lru movable pages · da294991
      Miaohe Lin authored
      It's more recommended to use __PageMovable() to detect non-lru movable
      pages. We can avoid bumping page refcnt via isolate_movable_page() for
      the isolated lru pages. Also if pages become PageLRU just after they're
      checked but before trying to isolate them, isolate_lru_page() will be
      called to do the right work.
      
      [linmiaohe@huawei.com: fixes per Naoya Horiguchi]
        Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da294991
    • Miaohe Lin's avatar
      mm, hwpoison: use ClearPageHWPoison() in memory_failure() · 2fe62e22
      Miaohe Lin authored
      Patch series "A few cleanup patches for memory-failure".
      
      his series contains a few cleanup patches to use __PageMovable() to detect
      non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple
      atomic ops overheads and so on.  More details can be found in the
      respective changelogs.
      
      
      This patch (of 6):
      
      Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page
      hwpoison flags to avoid unneeded full memory barrier overhead.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe62e22
    • Yang Shi's avatar
      mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock · 4d24de94
      Yang Shi authored
      The syzbot reported the below problem:
      
      BUG: Bad page map in process syz-executor198  pte:8000000071c00227 pmd:74b30067
      addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
      file:(null) fault:0x0 mmap:0x0 read_folio:0x0
      CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
       vm_normal_page+0x10c/0x2a0 mm/memory.c:636
       hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
       madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
       do_madvise mm/madvise.c:1428 [inline]
       __do_sys_madvise mm/madvise.c:1428 [inline]
       __se_sys_madvise mm/madvise.c:1426 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f770ba87929
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
      R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
      R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
      
      Basically the test program does the below conceptually:
      1. mmap 0x2000000 - 0x21000000 as anonymous region
      2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
         actually remaps the pages with special PTEs
      3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
      
      It actually triggered the below race:
      
                   CPU A                                          CPU B
      mmap 0x20000000 - 0x21000000 as anon
                                                 madvise_collapse is called on this area
                                                   Retrieve start and end address from the vma (NEVER updated later!)
                                                   Collapsed the first 2M area and dropped mmap_lock
      Acquire mmap_lock
      mmap io_uring file at 0x20563000
      Release mmap_lock
                                                   Reacquire mmap_lock
                                                   revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
                                                   scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
                                                   scan the 3rd 2M area (start from 0x20400000)
                                                     get into the vma created by io_uring
      
      The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
      the vma may be shrunk.  We don't have to worry about shink from the other
      direction since it could be caught by hugepage_vma_revalidate().  Either
      no valid vma is found or the vma doesn't fit anymore.
      
      Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d24de94
  2. 26 Sep, 2022 12 commits
  3. 12 Sep, 2022 16 commits