• Mel Gorman's avatar
    sched/numa: apply the scan delay to every new vma · ef6a22b7
    Mel Gorman authored
    Pach series "sched/numa: Enhance vma scanning", v3.
    
    The patchset proposes one of the enhancements to numa vma scanning
    suggested by Mel.  This is continuation of [3].
    
    Reposting the rebased patchset to akpm mm-unstable tree (March 1) 
    
    Existing mechanism of scan period involves, scan period derived from
    per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
    fault stats at per-process level to capture aplication behaviour better.
    
    During that course of discussion, Mel proposed several ideas to enhance
    current numa balancing.  One of the suggestion was below
    
    Track what threads access a VMA.  The suggestion was to use an unsigned
    long pid_mask and use the lower bits to tag approximately what threads
    access a VMA.  Skip VMAs that did not trap a fault.  This would be
    approximate because of PID collisions but would reduce scanning of areas
    the thread is not interested in.  The above suggestion intends not to
    penalize threads that has no interest in the vma, thus reduce scanning
    overhead.
    
    V3 changes are mostly based on PeterZ comments (details below in changes)
    
    Summary of patchset:
    
    Current patchset implements:
    
    1. Delay the vma scanning logic for newly created VMA's so that
       additional overhead of scanning is not incurred for short lived tasks
       (implementation by Mel)
    
    2. Store the information of tasks accessing VMA in 2 windows.  It is
       regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. 
       The above time is derived from experimenting (Suggested by PeterZ) to
       balance between frequent clearing vs obsolete access data
    
    3. hash_32 used to encode task index accessing VMA information
    
    4. VMA's acess information is used to skip scanning for the tasks
       which had not accessed VMA
    
    Changes since V2:
    patch1: 
     - Renaming of structure, macro to function,
     - Add explanation to heuristics
     - Adding more details from result (PeterZ)
     Patch2:
     - Usage of test and set bit (PeterZ)
     - Move storing access PID info to numa_migrate_prep()
     - Add a note on fainess among tasks allowed to scan
       (PeterZ)
     Patch3:
     - Maintain two windows of access PID information
      (PeterZ supported implementation and Gave idea to extend
       to N if needed)
     Patch4:
     - Apply hash_32 function to track VMA accessing PIDs (PeterZ)
    
    Changes since RFC V1:
     - Include Mel's vma scan delay patch
     - Change the accessing pid store logic (Thanks Mel)
     - Fencing structure / code to NUMA_BALANCING (David, Mel)
     - Adding clearing access PID logic (Mel)
     - Descriptive change log ( Mike Rapoport)
    
    Things to ponder over:
    ==========================================
    
    - Improvement to clearing accessing PIDs logic (discussed in-detail in
      patch3 itself (Done in this patchset by implementing 2 window history)
    
    - Current scan period is not changed in the patchset, so we do see
      frequent tries to scan.  Relaxing scan period dynamically could improve
      results further.
    
    [1] sched/numa: Process Adaptive autoNUMA 
     Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/
    
    [2] RFC V1 Link: 
      https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/
    
    [3] V2 Link:
      https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/
    
    
    Results:
    Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement 
    is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
    (dbench had huge std deviation to post)
    
    kernbench
    ===========
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
    Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
    Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*
    
    Duration User       66017.43    67959.84
    Duration System     30503.15    24657.03
    Duration Elapsed      504.61      493.12
    
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Ops NUMA alloc hit                1738835089.00  1738780310.00
    Ops NUMA alloc local              1738834448.00  1738779711.00
    Ops NUMA base-page range updates      477310.00      392566.00
    Ops NUMA PTE updates                  477310.00      392566.00
    Ops NUMA hint faults                   96817.00       87555.00
    Ops NUMA hint local faults %           10150.00        2192.00
    Ops NUMA hint local percent               10.48           2.50
    Ops NUMA pages migrated                86660.00       85363.00
    Ops AutoNUMA cost                        489.07         442.14
    
    autonumabench
    ===============
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
    Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
    Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
    Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
    Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
    Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
    Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
    Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*
    
    Duration User      396433.47   324835.96
    Duration System      2808.70      376.66
    Duration Elapsed     2258.61     2258.12
    
                          6.2.0-mmunstable-base  6.2.0-mmunstable-patched
    Ops NUMA alloc hit                  59921806.00    49623489.00
    Ops NUMA alloc miss                        0.00           0.00
    Ops NUMA interleave hit                    0.00           0.00
    Ops NUMA alloc local                59920880.00    49622594.00
    Ops NUMA base-page range updates   152259275.00       50075.00
    Ops NUMA PTE updates               152259275.00       50075.00
    Ops NUMA PMD updates                       0.00           0.00
    Ops NUMA hint faults               154660352.00       39014.00
    Ops NUMA hint local faults %       138550501.00       23139.00
    Ops NUMA hint local percent               89.58          59.31
    Ops NUMA pages migrated              8179067.00       14147.00
    Ops AutoNUMA cost                     774522.98         195.69
    
    
    This patch (of 4):
    
    Currently whenever a new task is created we wait for
    sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. 
    Extend the same logic to new or very short-lived VMAs.
    
    [raghavendra.kt@amd.com: add initialization in vm_area_dup())]
    Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
    Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.comSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Disha Talreja <dishaa.talreja@amd.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    ef6a22b7
fair.c 335 KB