• Donet Tom's avatar
    mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy · 133d04b1
    Donet Tom authored
    commit bda420b9 ("numa balancing: migrate on fault among multiple
    bound nodes") added support for migrate on protnone reference with
    MPOL_BIND memory policy.  This allowed numa fault migration when the
    executing node is part of the policy mask for MPOL_BIND.  This patch
    extends migration support to MPOL_PREFERRED_MANY policy.
    
    Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
    MPOL_F_NUMA_BALANCING.  This causes issues when we want to use
    NUMA_BALANCING_MEMORY_TIERING.  To effectively use the slow memory tier,
    the kernel should not allocate pages from the slower memory tier via
    allocation control zonelist fallback.  Instead, we should move cold pages
    from the faster memory node via memory demotion.  For a page allocation,
    kswapd is only woken up after we try to allocate pages from all nodes in
    the allocation zone list.  This implies that, without using memory
    policies, we will end up allocating hot pages in the slower memory tier.
    
    MPOL_PREFERRED_MANY was added by commit b27abacc ("mm/mempolicy: add
    MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
    allocation control when we have memory tiers in the system.  With
    MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
    of faster memory nodes.  When we fail to allocate pages from the faster
    memory node, kswapd would be woken up, allowing demotion of cold pages to
    slower memory nodes.
    
    With the current kernel, such usage of memory policies implies we can't do
    page promotion from a slower memory tier to a faster memory tier using
    numa fault.  This patch fixes this issue.
    
    For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask,
    we allow numa migration to the executing nodes.  If the executing node is
    not in the policy node mask, we do not allow numa migration.
    
    Example:
    On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0,
    N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while
    N2 has slow memory and no CPU.  For a workload, we may use
    MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload
    runs on CPUs of socket 0 at most times. Then, even if the workload
    runs on CPUs of N3 occasionally, we will not try to migrate the workload
    pages from N2 to N3 because users may want to avoid cross-socket access
    as much as possible in the long term.
    
    In below table, Process is the Process executing node and
    Curr Loc Pgs is the numa node where page present(folio node)
    ===========================================================
    Process  Policy  Curr Loc Pgs     Observation
    -----------------------------------------------------------
    N0       N0 N1      N1         Pages Migrated from N1 to N0
    N0       N0 N1      N2         Pages Migrated from N2 to N0
    N0       N0 N1      N3	       Pages Migrated from N3 to N0
    
    N3       N0 N1      N0         Pages NOT Migrated  to N3
    N3       N0 N1      N1         Pages NOT Migrated  to N3
    N3       N0 N1      N2	       Pages NOT Migrated  to N3
    ------------------------------------------------------------
    
    Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
    Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
    Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Huang, Ying <ying.huang@intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    133d04b1
mempolicy.c 89.8 KB