• Rik van Riel's avatar
    sched/numa: Cap PTE scanning overhead to 3% of run time · 51170840
    Rik van Riel authored
    There is a fundamental mismatch between the runtime based NUMA scanning
    at the task level, and the wall clock time NUMA scanning at the mm level.
    On a severely overloaded system, with very large processes, this mismatch
    can cause the system to spend all of its time in change_prot_numa().
    
    This can happen if the task spends at least two ticks in change_prot_numa(),
    and only gets two ticks of CPU time in the real time between two scan
    intervals of the mm.
    
    This patch ensures that a task never spends more than 3% of run
    time scanning PTEs. It does that by ensuring that in-between
    task_numa_work() runs, the task spends at least 32x as much time on
    other things than it did on task_numa_work().
    
    This is done stochastically: if a timer tick happens, or the task
    gets rescheduled during task_numa_work(), we delay a future run of
    task_numa_work() until the task has spent at least 32x the amount of
    CPU time doing something else, as it spent inside task_numa_work().
    The longer task_numa_work() takes, the more likely it is this happens.
    
    If task_numa_work() takes very little time, chances are low that that
    code will do anything, but we will not care.
    Reported-and-tested-by: default avatarJan Stancek <jstancek@redhat.com>
    Signed-off-by: default avatarRik van Riel <riel@redhat.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    51170840
fair.c 219 KB