1. 18 Apr, 2023 24 commits
  2. 16 Apr, 2023 12 commits
    • Andrew Morton's avatar
    • Peter Xu's avatar
      Revert "userfaultfd: don't fail on unrecognized features" · 2ff559f3
      Peter Xu authored
      This is a proposal to revert commit 914eedcb.
      
      I found this when writing a simple UFFDIO_API test to be the first unit
      test in this set.  Two things breaks with the commit:
      
        - UFFDIO_API check was lost and missing.  According to man page, the
        kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa.  This
        check is needed if the api version will be extended in the future, or
        user app won't be able to identify which is a new kernel.
      
        - Feature flags checks were removed, which means UFFDIO_API with a
        feature that does not exist will also succeed.  According to the man
        page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
        unknown features passed in.
      
      Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com
      Fixes: 914eedcb ("userfaultfd: don't fail on unrecognized features")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ff559f3
    • Baokun Li's avatar
      writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs · 1ba1199e
      Baokun Li authored
      KASAN report null-ptr-deref:
      ==================================================================
      BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
      Write of size 8 at addr 0000000000000000 by task sync/943
      CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7f/0xc0
       print_report+0x2ba/0x340
       kasan_report+0xc4/0x120
       kasan_check_range+0x1b7/0x2e0
       __kasan_check_write+0x24/0x40
       bdi_split_work_to_wbs+0x5c5/0x7b0
       sync_inodes_sb+0x195/0x630
       sync_inodes_one_sb+0x3a/0x50
       iterate_supers+0x106/0x1b0
       ksys_sync+0x98/0x160
      [...]
      ==================================================================
      
      The race that causes the above issue is as follows:
      
                 cpu1                     cpu2
      -------------------------|-------------------------
      inode_switch_wbs
       INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
       queue_rcu_work(isw_wq, &isw->work)
       // queue_work async
        inode_switch_wbs_work_fn
         wb_put_many(old_wb, nr_switched)
          percpu_ref_put_many
           ref->data->release(ref)
           cgwb_release
            queue_work(cgwb_release_wq, &wb->release_work)
            // queue_work async
             &wb->release_work
             cgwb_release_workfn
                                  ksys_sync
                                   iterate_supers
                                    sync_inodes_one_sb
                                     sync_inodes_sb
                                      bdi_split_work_to_wbs
                                       kmalloc(sizeof(*work), GFP_ATOMIC)
                                       // alloc memory failed
              percpu_ref_exit
               ref->data = NULL
               kfree(data)
                                       wb_get(wb)
                                        percpu_ref_get(&wb->refcnt)
                                         percpu_ref_get_many(ref, 1)
                                          atomic_long_add(nr, &ref->data->count)
                                           atomic64_add(i, v)
                                           // trigger null-ptr-deref
      
      bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
      wbs.  If the allocation of new work fails, the on-stack fallback will be
      used and the reference count of the current wb is increased afterwards. 
      If cgroup writeback membership switches occur before getting the reference
      count and the current wb is released as old_wd, then calling wb_get() or
      wb_put() will trigger the null pointer dereference above.
      
      This issue was introduced in v4.3-rc7 (see fix tag1).  Both
      sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
      bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
      sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize
      sync(2) against cgroup writeback membership switches") reduced the
      possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
      fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
      inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
      thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
      and the issue becomes easily reproducible again.
      
      To solve this problem, percpu_ref_exit() is called under RCU protection to
      avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
      Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
      and skip the current wb if wb_tryget() fails because the wb has already
      been shutdown.
      
      Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
      Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ba1199e
    • Peng Zhang's avatar
      maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug · 1f5f12ec
      Peng Zhang authored
      In mas_alloc_nodes(), "node->node_count = 0" means to initialize the
      node_count field of the new node, but the node may not be a new node.  It
      may be a node that existed before and node_count has a value, setting it
      to 0 will cause a memory leak.  At this time, mas->alloc->total will be
      greater than the actual number of nodes in the linked list, which may
      cause many other errors.  For example, out-of-bounds access in
      mas_pop_node(), and mas_pop_node() may return addresses that should not be
      used.  Fix it by initializing node_count only for new nodes.
      
      Also, by the way, an if-else statement was removed to simplify the code.
      
      Link: https://lkml.kernel.org/r/20230411041005.26205-1-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f5f12ec
    • Steve Chou's avatar
      tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used · 92357568
      Steve Chou authored
      When using cull option with 'tg' flag, the fprintf is using pid instead
      of tgid. It should use tgid instead.
      
      Link: https://lkml.kernel.org/r/20230411034929.2071501-1-steve_chou@pesi.com.tw
      Fixes: 9c8a0a8e ("tools/vm/page_owner_sort.c: support for user-defined culling rules")
      Signed-off-by: default avatarSteve Chou <steve_chou@pesi.com.tw>
      Cc: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92357568
    • Jonathan Toppins's avatar
      mailmap: update jtoppins' entry to reference correct email · d2c115ba
      Jonathan Toppins authored
      Link: https://lkml.kernel.org/r/d79bc6eaf65e68bd1c2a1e1510ab6291ce5926a6.1681162487.git.jtoppins@redhat.comSigned-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2c115ba
    • Liam R. Howlett's avatar
      mm/mempolicy: fix use-after-free of VMA iterator · f4e9e0e6
      Liam R. Howlett authored
      set_mempolicy_home_node() iterates over a list of VMAs and calls
      mbind_range() on each VMA, which also iterates over the singular list of
      the VMA passed in and potentially splits the VMA.  Since the VMA iterator
      is not passed through, set_mempolicy_home_node() may now point to a stale
      node in the VMA tree.  This can result in a UAF as reported by syzbot.
      
      Avoid the stale maple tree node by passing the VMA iterator through to the
      underlying call to split_vma().
      
      mbind_range() is also overly complicated, since there are two calling
      functions and one already handles iterating over the VMAs.  Simplify
      mbind_range() to only handle merging and splitting of the VMAs.
      
      Align the new loop in do_mbind() and existing loop in
      set_mempolicy_home_node() to use the reduced mbind_range() function.  This
      allows for a single location of the range calculation and avoids
      constantly looking up the previous VMA (since this is a loop over the
      VMAs).
      
      Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
      Fixes: 66850be5 ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
        Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
      Tested-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4e9e0e6
    • Naoya Horiguchi's avatar
      mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO · 4737edbb
      Naoya Horiguchi authored
      split_huge_page_to_list() WARNs when called for huge zero pages, which
      sounds to me too harsh because it does not imply a kernel bug, but just
      notifies the event to admins.  On the other hand, this is considered as
      critical by syzkaller and makes its testing less efficient, which seems to
      me harmful.
      
      So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited.
      
      Link: https://lkml.kernel.org/r/20230406082004.2185420-1-naoya.horiguchi@linux.dev
      Fixes: 478d134e ("mm/huge_memory: do not overkill when splitting huge_zero_page")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: syzbot+07a218429c8d19b1fb25@syzkaller.appspotmail.com
        Link: https://lore.kernel.org/lkml/000000000000a6f34a05e6efcd01@google.com/Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4737edbb
    • Liam R. Howlett's avatar
      mm/mprotect: fix do_mprotect_pkey() return on error · 82f95134
      Liam R. Howlett authored
      When the loop over the VMA is terminated early due to an error, the return
      code could be overwritten with ENOMEM.  Fix the return code by only
      setting the error on early loop termination when the error is not set.
      
      User-visible effects include: attempts to run mprotect() against a
      special mapping or with a poorly-aligned hugetlb address should return
      -EINVAL, but they presently return -ENOMEM.  In other cases an -EACCESS
      should be returned.
      
      Link: https://lkml.kernel.org/r/20230406193050.1363476-1-Liam.Howlett@oracle.com
      Fixes: 2286a691 ("mm: change mprotect_fixup to vma iterator")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82f95134
    • Peter Xu's avatar
      mm/khugepaged: check again on anon uffd-wp during isolation · dd47ac42
      Peter Xu authored
      Khugepaged collapse an anonymous thp in two rounds of scans.  The 2nd
      round done in __collapse_huge_page_isolate() after
      hpage_collapse_scan_pmd(), during which all the locks will be released
      temporarily.  It means the pgtable can change during this phase before 2nd
      round starts.
      
      It's logically possible some ptes got wr-protected during this phase, and
      we can errornously collapse a thp without noticing some ptes are
      wr-protected by userfault.  e1e267c7 wanted to avoid it but it only
      did that for the 1st phase, not the 2nd phase.
      
      Since __collapse_huge_page_isolate() happens after a round of small page
      swapins, we don't need to worry on any !present ptes - if it existed
      khugepaged will already bail out.  So we only need to check present ptes
      with uffd-wp bit set there.
      
      This is something I found only but never had a reproducer, I thought it
      was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
      turns out it's not the cause of that but an userspace bug.  However this
      seems to still be a real bug even with a very small race window, still
      worth to have it fixed and copy stable.
      
      Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com
      Fixes: e1e267c7 ("khugepaged: skip collapse if uffd-wp detected")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dd47ac42
    • David Hildenbrand's avatar
      mm/userfaultfd: fix uffd-wp handling for THP migration entries · 24bf08c4
      David Hildenbrand authored
      Looks like what we fixed for hugetlb in commit 44f86392 ("mm/hugetlb:
      fix uffd-wp handling for migration entries in
      hugetlb_change_protection()") similarly applies to THP.
      
      Setting/clearing uffd-wp on THP migration entries is not implemented
      properly.  Further, while removing migration PMDs considers the uffd-wp
      bit, inserting migration PMDs does not consider the uffd-wp bit.
      
      We have to set/clear independently of the migration entry type in
      change_huge_pmd() and properly copy the uffd-wp bit in
      set_pmd_migration_entry().
      
      Verified using a simple reproducer that triggers migration of a THP, that
      the set_pmd_migration_entry() no longer loses the uffd-wp bit.
      
      Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24bf08c4
    • Qi Zheng's avatar
      mm: swap: fix performance regression on sparsetruncate-tiny · 998ad18b
      Qi Zheng authored
      The ->percpu_pvec_drained was originally introduced by commit d9ed0d08
      ("mm: only drain per-cpu pagevecs once per pagevec usage") to drain
      per-cpu pagevecs only once per pagevec usage.  But after converting the
      swap code to be more folio-based, the commit c2bc1681 ("mm/swap: add
      folio_batch_move_lru()") breaks this logic, which would cause
      ->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs
      will be drained multiple times per pagevec usage.
      
      In theory, there should be no functional changes when converting code to
      be more folio-based.  We should call folio_batch_reinit() in
      folio_batch_move_lru() instead of folio_batch_init().  And to verify that
      we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and
      got the following data:
      
                                   baseline                   with
                                  baseline/                 patch/
      Min       Time      326.00 (   0.00%)      328.00 (  -0.61%)
      1st-qrtle Time      334.00 (   0.00%)      336.00 (  -0.60%)
      2nd-qrtle Time      338.00 (   0.00%)      341.00 (  -0.89%)
      3rd-qrtle Time      343.00 (   0.00%)      347.00 (  -1.17%)
      Max-1     Time      326.00 (   0.00%)      328.00 (  -0.61%)
      Max-5     Time      327.00 (   0.00%)      330.00 (  -0.92%)
      Max-10    Time      328.00 (   0.00%)      331.00 (  -0.91%)
      Max-90    Time      350.00 (   0.00%)      357.00 (  -2.00%)
      Max-95    Time      395.00 (   0.00%)      390.00 (   1.27%)
      Max-99    Time      508.00 (   0.00%)      434.00 (  14.57%)
      Max       Time      547.00 (   0.00%)      476.00 (  12.98%)
      Amean     Time      344.61 (   0.00%)      345.56 *  -0.28%*
      Stddev    Time       30.34 (   0.00%)       19.51 (  35.69%)
      CoeffVar  Time        8.81 (   0.00%)        5.65 (  35.87%)
      BAmean-99 Time      342.38 (   0.00%)      344.27 (  -0.55%)
      BAmean-95 Time      338.58 (   0.00%)      341.87 (  -0.97%)
      BAmean-90 Time      336.89 (   0.00%)      340.26 (  -1.00%)
      BAmean-75 Time      335.18 (   0.00%)      338.40 (  -0.96%)
      BAmean-50 Time      332.54 (   0.00%)      335.42 (  -0.87%)
      BAmean-25 Time      329.30 (   0.00%)      332.00 (  -0.82%)
      
      From the above it can be seen that we get similar data to when
      ->percpu_pvec_drained was introduced, so we still need it.  Let's call
      folio_batch_reinit() in folio_batch_move_lru() to restore the original
      logic.
      
      Link: https://lkml.kernel.org/r/20230405161854.6931-1-zhengqi.arch@bytedance.com
      Fixes: c2bc1681 ("mm/swap: add folio_batch_move_lru()")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      998ad18b
  3. 06 Apr, 2023 4 commits
    • Raghavendra K T's avatar
      sched/numa: use hash_32 to mix up PIDs accessing VMA · d46031f4
      Raghavendra K T authored
      before: last 6 bits of PID is used as index to store information about
      tasks accessing VMA's.
      
      after: hash_32 is used to take of cases where tasks are created over a
      period of time, and thus improve collision probability.
      
      Result:
      The patch series overall improves autonuma cost.
      
      Kernbench around more than 5% improvement and system time in mmtest
      autonuma showed more than 80% improvement
      
      Link: https://lkml.kernel.org/r/d5a9f75513300caed74e5c8570bba9317b963c2b.1677672277.git.raghavendra.kt@amd.comSigned-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Disha Talreja <dishaa.talreja@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d46031f4
    • Raghavendra K T's avatar
      sched/numa: implement access PID reset logic · 20f58648
      Raghavendra K T authored
      This helps to ensure that only recently accessed PIDs scan the VMAs.
      
      Current implementation: (idea supported by PeterZ)
      
       1. Accessing PID information is maintained in two windows. 
          access_pids[1] being newest.
      
       2. Reset old access PID info i.e.  access_pid[0] every (4 *
          sysctl_numa_balancing_scan_delay) interval after initial scan delay
          period expires.
      
      The above interval seemed to be experimentally optimum since it avoids
      frequent reset of access info as well as helps clearing the old access
      info regularly.  The reset logic is implemented in scan path.
      
      Link: https://lkml.kernel.org/r/f7a675f66d1442d048b4216b2baf94515012c405.1677672277.git.raghavendra.kt@amd.comSigned-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Disha Talreja <dishaa.talreja@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20f58648
    • Raghavendra K T's avatar
      sched/numa: enhance vma scanning logic · fc137c0d
      Raghavendra K T authored
      During Numa scanning make sure only relevant vmas of the tasks are
      scanned.
      
      Before:
       All the tasks of a process participate in scanning the vma even if they
       do not access vma in it's lifespan.
      
      Now:
       Except cases of first few unconditional scans, if a process do
       not touch vma (exluding false positive cases of PID collisions)
       tasks no longer scan all vma
      
      Logic used:
      
      1) 6 bits of PID used to mark active bit in vma numab status during
         fault to remember PIDs accessing vma.  (Thanks Mel)
      
      2) Subsequently in scan path, vma scanning is skipped if current PID
         had not accessed vma.
      
      3) First two times we do allow unconditional scan to preserve earlier
         behaviour of scanning.
      
      Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
      store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
      test and set bit)
      
      Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.comSigned-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Disha Talreja <dishaa.talreja@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fc137c0d
    • Mel Gorman's avatar
      sched/numa: apply the scan delay to every new vma · ef6a22b7
      Mel Gorman authored
      Pach series "sched/numa: Enhance vma scanning", v3.
      
      The patchset proposes one of the enhancements to numa vma scanning
      suggested by Mel.  This is continuation of [3].
      
      Reposting the rebased patchset to akpm mm-unstable tree (March 1) 
      
      Existing mechanism of scan period involves, scan period derived from
      per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
      fault stats at per-process level to capture aplication behaviour better.
      
      During that course of discussion, Mel proposed several ideas to enhance
      current numa balancing.  One of the suggestion was below
      
      Track what threads access a VMA.  The suggestion was to use an unsigned
      long pid_mask and use the lower bits to tag approximately what threads
      access a VMA.  Skip VMAs that did not trap a fault.  This would be
      approximate because of PID collisions but would reduce scanning of areas
      the thread is not interested in.  The above suggestion intends not to
      penalize threads that has no interest in the vma, thus reduce scanning
      overhead.
      
      V3 changes are mostly based on PeterZ comments (details below in changes)
      
      Summary of patchset:
      
      Current patchset implements:
      
      1. Delay the vma scanning logic for newly created VMA's so that
         additional overhead of scanning is not incurred for short lived tasks
         (implementation by Mel)
      
      2. Store the information of tasks accessing VMA in 2 windows.  It is
         regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
         The above time is derived from experimenting (Suggested by PeterZ) to
         balance between frequent clearing vs obsolete access data
      
      3. hash_32 used to encode task index accessing VMA information
      
      4. VMA's acess information is used to skip scanning for the tasks
         which had not accessed VMA
      
      Changes since V2:
      patch1: 
       - Renaming of structure, macro to function,
       - Add explanation to heuristics
       - Adding more details from result (PeterZ)
       Patch2:
       - Usage of test and set bit (PeterZ)
       - Move storing access PID info to numa_migrate_prep()
       - Add a note on fainess among tasks allowed to scan
         (PeterZ)
       Patch3:
       - Maintain two windows of access PID information
        (PeterZ supported implementation and Gave idea to extend
         to N if needed)
       Patch4:
       - Apply hash_32 function to track VMA accessing PIDs (PeterZ)
      
      Changes since RFC V1:
       - Include Mel's vma scan delay patch
       - Change the accessing pid store logic (Thanks Mel)
       - Fencing structure / code to NUMA_BALANCING (David, Mel)
       - Adding clearing access PID logic (Mel)
       - Descriptive change log ( Mike Rapoport)
      
      Things to ponder over:
      ==========================================
      
      - Improvement to clearing accessing PIDs logic (discussed in-detail in
        patch3 itself (Done in this patchset by implementing 2 window history)
      
      - Current scan period is not changed in the patchset, so we do see
        frequent tries to scan.  Relaxing scan period dynamically could improve
        results further.
      
      [1] sched/numa: Process Adaptive autoNUMA 
       Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
      
      [2] RFC V1 Link: 
        https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/
      
      [3] V2 Link:
        https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/
      
      
      Results:
      Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
      is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
      (dbench had huge std deviation to post)
      
      kernbench
      ===========
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
      Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
      Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*
      
      Duration User       66017.43    67959.84
      Duration System     30503.15    24657.03
      Duration Elapsed      504.61      493.12
      
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Ops NUMA alloc hit                1738835089.00  1738780310.00
      Ops NUMA alloc local              1738834448.00  1738779711.00
      Ops NUMA base-page range updates      477310.00      392566.00
      Ops NUMA PTE updates                  477310.00      392566.00
      Ops NUMA hint faults                   96817.00       87555.00
      Ops NUMA hint local faults %           10150.00        2192.00
      Ops NUMA hint local percent               10.48           2.50
      Ops NUMA pages migrated                86660.00       85363.00
      Ops AutoNUMA cost                        489.07         442.14
      
      autonumabench
      ===============
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
      Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
      Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
      Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
      Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
      Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
      Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
      Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*
      
      Duration User      396433.47   324835.96
      Duration System      2808.70      376.66
      Duration Elapsed     2258.61     2258.12
      
                            6.2.0-mmunstable-base  6.2.0-mmunstable-patched
      Ops NUMA alloc hit                  59921806.00    49623489.00
      Ops NUMA alloc miss                        0.00           0.00
      Ops NUMA interleave hit                    0.00           0.00
      Ops NUMA alloc local                59920880.00    49622594.00
      Ops NUMA base-page range updates   152259275.00       50075.00
      Ops NUMA PTE updates               152259275.00       50075.00
      Ops NUMA PMD updates                       0.00           0.00
      Ops NUMA hint faults               154660352.00       39014.00
      Ops NUMA hint local faults %       138550501.00       23139.00
      Ops NUMA hint local percent               89.58          59.31
      Ops NUMA pages migrated              8179067.00       14147.00
      Ops AutoNUMA cost                     774522.98         195.69
      
      
      This patch (of 4):
      
      Currently whenever a new task is created we wait for
      sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
      Extend the same logic to new or very short-lived VMAs.
      
      [raghavendra.kt@amd.com: add initialization in vm_area_dup())]
      Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
      Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.comSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Disha Talreja <dishaa.talreja@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef6a22b7