• Hillf Danton's avatar
    mm, reclaim: make should_continue_reclaim perform dryrun detection · 1c6c1597
    Hillf Danton authored
    Patch series "address hugetlb page allocation stalls", v2.
    
    Allocation of hugetlb pages via sysctl or procfs can stall for minutes or
    hours.  A simple example on a two node system with 8GB of memory is as
    follows:
    
    echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
    echo 4096 > /proc/sys/vm/nr_hugepages
    
    Obviously, both allocation attempts will fall short of their 8GB goal.
    However, one or both of these commands may stall and not be interruptible.
    The issues were initially discussed in mail thread [1] and RFC code at
    [2].
    
    This series addresses the issues causing the stalls.  There are two
    distinct fixes, a cleanup, and an optimization.  The reclaim patch by
    Hillf and compaction patch by Vlasitmil address corner cases in their
    respective areas.  hugetlb page allocation could stall due to either of
    these issues.  Vlasitmil added a cleanup patch after Hillf's
    modifications.  The hugetlb patch by Mike is an optimization suggested
    during the debug and development process.
    
    [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
    [2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com
    
    This patch (of 4):
    
    Address the issue of should_continue_reclaim returning true too often for
    __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.  This was
    observed during hugetlb page allocation causing stalls for minutes or
    hours.
    
    We can stop reclaiming pages if compaction reports it can make a progress.
    There might be side-effects for other high-order allocations that would
    potentially benefit from reclaiming more before compaction so that they
    would be faster and less likely to stall.  However, the consequences of
    premature/over-reclaim are considered worse.
    
    We can also bail out of reclaiming pages if we know that there are not
    enough inactive lru pages left to satisfy the costly allocation.
    
    We can give up reclaiming pages too if we see dryrun occur, with the
    certainty of plenty of inactive pages.  IOW with dryrun detected, we are
    sure we have reclaimed as many pages as we could.
    
    Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.comSigned-off-by: default avatarHillf Danton <hdanton@sina.com>
    Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    1c6c1597
vmscan.c 122 KB