• Rik van Riel's avatar
    sched/numa: Spread memory according to CPU and memory use · 4142c3eb
    Rik van Riel authored
    The pseudo-interleaving in NUMA placement has a fundamental problem:
    using hard usage thresholds to spread memory equally between nodes
    can prevent workloads from converging, or keep memory "trapped" on
    nodes where the workload is barely running any more.
    
    In order for workloads to properly converge, the memory migration
    should not be stopped when nodes reach parity, but instead be
    distributed according to how heavily memory is used from each node.
    This way memory migration and task migration reinforce each other,
    instead of one putting the brakes on the other.
    
    Remove the hard thresholds from the pseudo-interleaving code, and
    instead use a more gradual policy on memory placement. This also
    seems to improve convergence of workloads that do not run flat out,
    but sleep in between bursts of activity.
    
    We still want to slow down NUMA scanning and migration once a workload
    has settled on a few actively used nodes, so keep the 3/4 hysteresis
    in place. Keep track of whether a workload is actively running on
    multiple nodes, so task_numa_migrate does a full scan of the system
    for better task placement.
    
    In the case of running 3 SPECjbb2005 instances on a 4 node system,
    this code seems to result in fairer distribution of memory between
    nodes, with more memory bandwidth for each instance.
    Signed-off-by: default avatarRik van Riel <riel@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/20160125170739.2fc9a641@annuminas.surriel.com
    [ Minor readability tweaks. ]
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    4142c3eb
fair.c 222 KB