• David Rientjes's avatar
    mm, hugetlb: allow hugepage allocations to reclaim as needed · 3f36d866
    David Rientjes authored
    Commit b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when
    compaction may not succeed") has chnaged the allocator to bail out from
    the allocator early to prevent from a potentially excessive memory
    reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
    reclaim and compaction loop as long as there is a reasonable chance to
    make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
    the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
    
    The most obvious affected subsystem is hugetlbfs which allocates huge
    pages based on an admin request (or via admin configured overcommit).  I
    have done a simple test which tries to allocate half of the memory for
    hugetlb pages while the memory is full of a clean page cache.  This is
    not an unusual situation because we try to cache as much of the memory
    as possible and sysctl/sysfs interface to allocate huge pages is there
    for flexibility to allocate hugetlb pages at any time.
    
    System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
    after the memory is prefilled by a clean page cache:
    
      root@test1:~# cat hugetlb_test.sh
    
      set -x
      echo 0 > /proc/sys/vm/nr_hugepages
      echo 3 > /proc/sys/vm/drop_caches
      echo 1 > /proc/sys/vm/compact_memory
      dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
      TS=$(date +%s)
      echo 256 > /proc/sys/vm/nr_hugepages
      cat /proc/sys/vm/nr_hugepages
    
    The results for 2 consecutive runs on clean 5.3
    
      root@test1:~# sh hugetlb_test.sh
      + echo 0
      + echo 3
      + echo 1
      + dd if=/mnt/data/file-1G of=/dev/null bs=4096
      262144+0 records in
      262144+0 records out
      1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
      + date +%s
      + TS=1569905284
      + echo 256
      + cat /proc/sys/vm/nr_hugepages
      256
      root@test1:~# sh hugetlb_test.sh
      + echo 0
      + echo 3
      + echo 1
      + dd if=/mnt/data/file-1G of=/dev/null bs=4096
      262144+0 records in
      262144+0 records out
      1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
      + date +%s
      + TS=1569905311
      + echo 256
      + cat /proc/sys/vm/nr_hugepages
      256
    
    Now with b39d0ee2 applied
    
      root@test1:~# sh hugetlb_test.sh
      + echo 0
      + echo 3
      + echo 1
      + dd if=/mnt/data/file-1G of=/dev/null bs=4096
      262144+0 records in
      262144+0 records out
      1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
      + date +%s
      + TS=1569905516
      + echo 256
      + cat /proc/sys/vm/nr_hugepages
      11
      root@test1:~# sh hugetlb_test.sh
      + echo 0
      + echo 3
      + echo 1
      + dd if=/mnt/data/file-1G of=/dev/null bs=4096
      262144+0 records in
      262144+0 records out
      1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
      + date +%s
      + TS=1569905541
      + echo 256
      + cat /proc/sys/vm/nr_hugepages
      12
    
    The success rate went down by factor of 20!
    
    Although hugetlb allocation requests might fail and it is reasonable to
    expect them to under extremely fragmented memory or when the memory is
    under a heavy pressure but the above situation is not that case.
    
    Fix the regression by reverting back to the previous behavior for
    __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
    those requests.
    
    Mike said:
    
    : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
    : boot where this may not be as much of an issue.  However, I am aware of at
    : least three use cases where allocations are made after the system has been
    : up and running for quite some time:
    :
    : - DB reconfiguration.  If sysctl/sysfs fails to get required number of
    :   huge pages, system is rebooted to perform allocation after boot.
    :
    : - VM provisioning.  If unable get required number of huge pages, fall
    :   back to base pages.
    :
    : - An application that does not preallocate pool, but rather allocates
    :   pages at fault time for optimal NUMA locality.
    :
    : In all cases, I would expect b39d0ee2 to cause regressions and
    : noticable behavior changes.
    :
    : My quick/limited testing in
    : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
    : was insufficient.  It was also mentioned that if something like
    : b39d0ee2 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
    : requests as in this patch.
    
    [mhocko@suse.com: reworded changelog]
    Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
    Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
    Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3f36d866
page_alloc.c 238 KB