• Gregory Price's avatar
    mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving · fa3bea4e
    Gregory Price authored
    When a system has multiple NUMA nodes and it becomes bandwidth hungry,
    using the current MPOL_INTERLEAVE could be an wise option.
    
    However, if those NUMA nodes consist of different types of memory such as
    socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
    interleave policy does not optimally distribute data to make use of their
    different bandwidth characteristics.
    
    Instead, interleave is more effective when the allocation policy follows
    each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
    
    This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
    enabling weighted interleave between NUMA nodes.  Weighted interleave
    allows for proportional distribution of memory across multiple numa nodes,
    preferably apportioned to match the bandwidth of each node.
    
    For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
    with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
    distribution is (2:1).
    
    Weights for each node can be assigned via the new sysfs extension:
    /sys/kernel/mm/mempolicy/weighted_interleave/
    
    For now, the default value of all nodes will be `1`, which matches the
    behavior of standard 1:1 round-robin interleave.  An extension will be
    added in the future to allow default values to be registered at kernel and
    device bringup time.
    
    The policy allocates a number of pages equal to the set weights.  For
    example, if the weights are (2,1), then 2 pages will be allocated on node0
    for every 1 page allocated on node1.
    
    The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
    and mbind(2).
    
    Some high level notes about the pieces of weighted interleave:
    
    current->il_prev:
        Tracks the node previously allocated from.
    
    current->il_weight:
        The active weight of the current node (current->il_prev)
        When this reaches 0, current->il_prev is set to the next node
        and current->il_weight is set to the next weight.
    
    weighted_interleave_nodes:
        Counts the number of allocations as they occur, and applies the
        weight for the current node.  When the weight reaches 0, switch
        to the next node.  Operates only on task->mempolicy.
    
    weighted_interleave_nid:
        Gets the total weight of the nodemask as well as each individual
        node weight, then calculates the node based on the given index.
        Operates on VMA policies.
    
    bulk_array_weighted_interleave:
        Gets the total weight of the nodemask as well as each individual
        node weight, then calculates the number of "interleave rounds" as
        well as any delta ("partial round").  Calculates the number of
        pages for each node and allocates them.
    
        If a node was scheduled for interleave via interleave_nodes, the
        current weight will be allocated first.
    
        Operates only on the task->mempolicy.
    
    One piece of complexity is the interaction between a recent refactor which
    split the logic to acquire the "ilx" (interleave index) of an allocation
    and the actually application of the interleave.  If a call to
    alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
    to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
    policy - violating the description above.
    
    An inspection of all callers of alloc_pages_mpol() shows that all external
    callers set ilx to `0`, an index value, or will call get_vma_policy() to
    acquire the ilx.
    
    For example, mm/shmem.c may call into alloc_pages_mpol.  The call stacks
    all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces the
    `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
    requirements (task/vma respectively).
    
    Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.comSuggested-by: default avatarHasan Al Maruf <Hasan.Maruf@amd.com>
    Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
    Co-developed-by: default avatarRakie Kim <rakie.kim@sk.com>
    Signed-off-by: default avatarRakie Kim <rakie.kim@sk.com>
    Co-developed-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
    Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
    Co-developed-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
    Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
    Co-developed-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
    Signed-off-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
    Co-developed-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
    Signed-off-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
    Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    fa3bea4e
mempolicy.c 88.5 KB