• Rakie Kim's avatar
    mm/mempolicy: implement the sysfs-based weighted_interleave interface · dce41f5a
    Rakie Kim authored
    Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs
    extension", v5.
    
    Weighted interleave is a new interleave policy intended to make use of
    heterogeneous memory environments appearing with CXL.
    
    The existing interleave mechanism does an even round-robin distribution of
    memory across all nodes in a nodemask, while weighted interleave
    distributes memory across nodes according to a provided weight.  (Weight =
    # of page allocations per round)
    
    Weighted interleave is intended to reduce average latency when bandwidth
    is pressured - therefore increasing total throughput.
    
    In other words: It allows greater use of the total available bandwidth in
    a heterogeneous hardware environment (different hardware provides
    different bandwidth capacity).
    
    As bandwidth is pressured, latency increases - first linearly and then
    exponentially.  By keeping bandwidth usage distributed according to
    available bandwidth, we therefore can reduce the average latency of a
    cacheline fetch.
    
    A good explanation of the bandwidth vs latency response curve:
    https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
    
    From the article:
    ```
    Constant region:
        The latency response is fairly constant for the first 40%
        of the sustained bandwidth.
    Linear region:
        In between 40% to 80% of the sustained bandwidth, the
        latency response increases almost linearly with the bandwidth
        demand of the system due to contention overhead by numerous
        memory requests.
    Exponential region:
        Between 80% to 100% of the sustained bandwidth, the memory
        latency is dominated by the contention latency which can be
        as much as twice the idle latency or more.
    Maximum sustained bandwidth :
        Is 65% to 75% of the theoretical maximum bandwidth.
    ```
    
    As a general rule of thumb:
    * If bandwidth usage is low, latency does not increase. It is
      optimal to place data in the nearest (lowest latency) device.
    * If bandwidth usage is high, latency increases. It is optimal
      to place data such that bandwidth use is optimized per-device.
    
    This is the top line goal: Provide a user a mechanism to target using the
    "maximum sustained bandwidth" of each hardware component in a heterogenous
    memory system.
    
    
    For example, the stream benchmark demonstrates that 1:1 (default)
    interleave is actively harmful, while weighted interleave can be
    beneficial.  Default interleave distributes data such that too much
    pressure is placed on devices with lower available bandwidth.
    
    Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
    Default interleave : -78% (slower than DRAM)
    Global weighting   : -6% to +4% (workload dependant)
    Targeted weights   : +2.5% to +4% (consistently better than DRAM)
    
    Global means the task-policy was set (set_mempolicy), while targeted means
    VMA policies were set (mbind2).  We see weighted interleave is not always
    beneficial when applied globally, but is always beneficial when applied to
    bandwidth-driving memory regions.
    
    
    There are 4 patches in this set:
    1) Implement system-global interleave weights as sysfs extension
       in mm/mempolicy.c.  These weights are RCU protected, and a
       default weight set is provided (all weights are 1 by default).
    
       In future work, we intend to expose an interface for HMAT/CDAT
       code to set reasonable default values based on the memory
       configuration of the system discovered at boot/hotplug.
    
    2) A mild refactor of some interleave-logic for re-use in the
       new weighted interleave logic.
    
    3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
    
    4) Protect interleave logic (weighted and normal) with the
       mems_allowed seq cookie.  If the nodemask changes while
       accessing it during a rebind, just retry the access.
    
    Included below are some performance and LTP test information,
    and a sample numactl branch which can be used for testing.
    
    = Performance summary =
    (tests may have different configurations, see extended info below)
    1) MLC (W2) : +38% over DRAM. +264% over default interleave.
       MLC (W5) : +40% over DRAM. +226% over default interleave.
    2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
    3) XSBench  : +19% over DRAM. +47% over default interleave.
    
    = LTP Testing Summary =
    existing mempolicy & mbind tests: pass
    mempolicy & mbind + weighted interleave (global weights): pass
    
    = version history
    v5:
    - style fixes
    - mems_allowed cookie protection to detect rebind issues,
      prevents spurious allocation failures and/or mis-allocations
    - sparse warning fixes related to __rcu on local variables
    
    =====================================================================
    Performance tests - MLC
    From - Ravi Jonnalagadda <ravis.opensrc@micron.com>
    
    Hardware: Single-socket, multiple CXL memory expanders.
    
    Workload:                               W2
    Data Signature:                         2:1 read:write
    DRAM only bandwidth (GBps):             298.8
    DRAM + CXL (default interleave) (GBps): 113.04
    DRAM + CXL (weighted interleave)(GBps): 412.5
    Gain over DRAM only:                    1.38x
    Gain over default interleave:           2.64x
    
    Workload:                               W5
    Data Signature:                         1:1 read:write
    DRAM only bandwidth (GBps):             273.2
    DRAM + CXL (default interleave) (GBps): 117.23
    DRAM + CXL (weighted interleave)(GBps): 382.7
    Gain over DRAM only:                    1.4x
    Gain over default interleave:           2.26x
    
    =====================================================================
    Performance test - Stream
    From - Gregory Price <gregory.price@memverge.com>
    
    Hardware: Single socket, single CXL expander
    numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
    
    Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
    Default interleave : -78% (slower than DRAM)
    Global weighting   : -6% to +4% (workload dependant)
    mbind2 weights     : +2.5% to +4% (consistently better than DRAM)
    
    dram only:
    numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
    Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
    Copy:        0->0            200923.2     0.032662     0.031853     0.033301
    Scale:       0->0            202123.0     0.032526     0.031664     0.032970
    Add:         0->0            208873.2     0.047322     0.045961     0.047884
    Triad:       0->0            208523.8     0.047262     0.046038     0.048414
    
    CXL-only:
    numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
    Copy:        0->0             22209.7     0.288661     0.288162     0.289342
    Scale:       0->0             22288.2     0.287549     0.287147     0.288291
    Add:         0->0             24419.1     0.393372     0.393135     0.393735
    Triad:       0->0             24484.6     0.392337     0.392083     0.394331
    
    Based on the above, the optimal weights are ~9:1
    echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
    echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
    
    default interleave:
    numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
    Copy:        0->0             44666.2     0.143671     0.143285     0.144174
    Scale:       0->0             44781.6     0.143256     0.142916     0.143713
    Add:         0->0             48600.7     0.197719     0.197528     0.197858
    Triad:       0->0             48727.5     0.197204     0.197014     0.197439
    
    global weighted interleave:
    numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
    Copy:        0->0            190085.9     0.034289     0.033669     0.034645
    Scale:       0->0            207677.4     0.031909     0.030817     0.033061
    Add:         0->0            202036.8     0.048737     0.047516     0.053409
    Triad:       0->0            217671.5     0.045819     0.044103     0.046755
    
    targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
    numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
    Copy:        0->0            205827.0     0.031445     0.031094     0.031984
    Scale:       0->0            208171.8     0.031320     0.030744     0.032505
    Add:         0->0            217352.0     0.045087     0.044168     0.046515
    Triad:       0->0            216884.8     0.045062     0.044263     0.046982
    
    =====================================================================
    Performance tests - XSBench
    From - Hyeongtak Ji <hyeongtak.ji@sk.com>
    
    Hardware: Single socket, Single CXL memory Expander
    
    NUMA node 0: 56 logical cores, 128 GB memory
    NUMA node 2: 96 GB CXL memory
    Threads:     56
    Lookups:     170,000,000
    
    Summary: +19% over DRAM. +47% over default interleave.
    
    Performance tests - XSBench
    1. dram only
    $ numactl -m 0 ./XSBench -s XL –p 5000000
    Runtime:     36.235 seconds
    Lookups/s:   4,691,618
    
    2. default interleave
    $ numactl –i 0,2 ./XSBench –s XL –p 5000000
    Runtime:     55.243 seconds
    Lookups/s:   3,077,293
    
    3. weighted interleave
    numactl –w –i 0,2 ./XSBench –s XL –p 5000000
    Runtime:     29.262 seconds
    Lookups/s:   5,809,513
    
    =====================================================================
    LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2
    
    = Existing tests
    set_mempolicy, get_mempolicy, mbind
    
    MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but
    did not adjust tests for weighting.  Basically the weights were set to 1,
    which is the default, and it should behave the same as MPOL_INTERLEAVE if
    logic is correct.
    
    == set_mempolicy01 : passed   18, failed   0
    == set_mempolicy02 : passed   10, failed   0
    == set_mempolicy03 : passed   64, failed   0
    == set_mempolicy04 : passed   32, failed   0
    == set_mempolicy05 - n/a on non-x86
    == set_mempolicy06 : passed   10, failed   0
       this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
    == set_mempolicy07 : passed   32, failed   0
       set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
    == get_mempolicy01 : passed   12, failed   0
       change: added MPOL_WEIGHTED_INTERLEAVE
    == get_mempolicy02 : passed   2, failed   0
    == mbind01 : passed   15, failed   0
       added MPOL_WEIGHTED_INTERLEAVE
    == mbind02 : passed   4, failed   0
       added MPOL_WEIGHTED_INTERLEAVE
    == mbind03 : passed   16, failed   0
       added MPOL_WEIGHTED_INTERLEAVE
    == mbind04 : passed   48, failed   0
       added MPOL_WEIGHTED_INTERLEAVE
    
    =====================================================================
    numactl (set_mempolicy) w/ global weighting test
    numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
    
    command: numactl -w --interleave=0,1 ./eatmem
    
    result (weights 1:1):
    0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
    7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
    50% distribution is correct
    
    result (weights 5:1):
    01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
    7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
    16.666% distribution is correct
    
    result (weights 1:5):
    01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
    7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
    16.666% distribution is correct
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    int main (void)
    {
            char* mem = malloc(1024*1024*256);
            memset(mem, 1, 1024*1024*256);
            for (int i = 0; i  < ((1024*1024*256)/4096); i++)
            {
                    mem = malloc(4096);
                    mem[0] = 1;
            }
            printf("done\n");
            getchar();
            return 0;
    }
    
    
    This patch (of 4):
    
    This patch provides a way to set interleave weight information under sysfs
    at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
    
    The sysfs structure is designed as follows.
    
      $ tree /sys/kernel/mm/mempolicy/
      /sys/kernel/mm/mempolicy/ [1]
      └── weighted_interleave [2]
          ├── node0 [3]
          └── node1
    
    Each file above can be explained as follows.
    
    [1] mm/mempolicy: configuration interface for mempolicy subsystem
    
    [2] weighted_interleave/: config interface for weighted interleave policy
    
    [3] weighted_interleave/nodeN: weight for nodeN
    
    If a node value is set to `0`, the system-default value will be used.
    As of this patch, the system-default for all nodes is always 1.
    
    Link: https://lkml.kernel.org/r/20240202170238.90004-1-gregory.price@memverge.com
    Link: https://lkml.kernel.org/r/20240202170238.90004-2-gregory.price@memverge.comSuggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: default avatarRakie Kim <rakie.kim@sk.com>
    Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
    Co-developed-by: default avatarGregory Price <gregory.price@memverge.com>
    Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
    Co-developed-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
    Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
    Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Gregory Price <gourry.memverge@gmail.com>
    Cc: Hasan Al Maruf <Hasan.Maruf@amd.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    dce41f5a
sysfs-kernel-mm-mempolicy-weighted-interleave 946 Bytes