• Mike Kravetz's avatar
    hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles · d2cf88c2
    Mike Kravetz authored
    Patch series "Batch hugetlb vmemmap modification operations", v8.
    
    When hugetlb vmemmap optimization was introduced, the overhead of enabling
    the option was measured as described in commit 426e5c42 [1].  The
    summary states that allocating a hugetlb page should be ~2x slower with
    optimization and freeing a hugetlb page should be ~2-3x slower.  Such
    overhead was deemed an acceptable trade off for the memory savings
    obtained by freeing vmemmap pages.
    
    It was recently reported that the overhead associated with enabling
    vmemmap optimization could be as high as 190x for hugetlb page
    allocations.  Yes, 190x!  Some actual numbers from other environments are:
    
    Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
    ------------------------------------------------
    Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
    time echo 500000 > .../hugepages-2048kB/nr_hugepages
    real    0m4.119s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m4.477s
    
    Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
    time echo 500000 > .../hugepages-2048kB/nr_hugepages
    real    0m28.973s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m36.748s
    
    VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
    -----------------------------------------------------------
    Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
    time echo 524288 > .../hugepages-2048kB/nr_hugepages
    real    0m2.463s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m2.931s
    
    Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
    time echo 524288 > .../hugepages-2048kB/nr_hugepages
    real    2m27.609s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    2m29.924s
    
    In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
    resulted in allocation times being 61x slower.
    
    A quick profile showed that the vast majority of this overhead was due to
    TLB flushing.  Each time we modify the kernel pagetable we need to flush
    the TLB.  For each hugetlb that is optimized, there could be potentially
    two TLB flushes performed.  One for the vmemmap pages associated with the
    hugetlb page, and potentially another one if the vmemmap pages are mapped
    at the PMD level and must be split.  The TLB flushes required for the
    kernel pagetable, result in a broadcast IPI with each CPU having to flush
    a range of pages, or do a global flush if a threshold is exceeded.  So,
    the flush time increases with the number of CPUs.  In addition, in virtual
    environments the broadcast IPI can’t be accelerated by hypervisor
    hardware and leads to traps that need to wakeup/IPI all vCPUs which is
    very expensive.  Because of this the slowdown in virtual environments is
    even worse than bare metal as the number of vCPUS/CPUs is increased.
    
    The following series attempts to reduce amount of time spent in TLB
    flushing.  The idea is to batch the vmemmap modification operations for
    multiple hugetlb pages.  Instead of doing one or two TLB flushes for each
    page, we do two TLB flushes for each batch of pages.  One flush after
    splitting pages mapped at the PMD level, and another after remapping
    vmemmap associated with all hugetlb pages.  Results of such batching are
    as follows:
    
    Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
    ------------------------------------------------
    next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
    time echo 500000 > .../hugepages-2048kB/nr_hugepages
    real    0m4.719s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m4.245s
    
    next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
    time echo 500000 > .../hugepages-2048kB/nr_hugepages
    real    0m7.267s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m13.199s
    
    VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
    -----------------------------------------------------------
    next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
    time echo 524288 > .../hugepages-2048kB/nr_hugepages
    real    0m2.715s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m3.186s
    
    next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
    time echo 524288 > .../hugepages-2048kB/nr_hugepages
    real    0m4.799s
    time echo 0 > .../hugepages-2048kB/nr_hugepages
    real    0m5.273s
    
    With batching, results are back in the 2-3x slowdown range.
    
    
    This patch (of 8):
    
    update_and_free_pages_bulk is designed to free a list of hugetlb pages
    back to their associated lower level allocators.  This may require
    allocating vmemmmap pages associated with each hugetlb page.  The hugetlb
    page destructor must be changed before pages are freed to lower level
    allocators.  However, the destructor must be changed under the hugetlb
    lock.  This means there is potentially one lock cycle per page.
    
    Minimize the number of lock cycles in update_and_free_pages_bulk by:
    1) allocating necessary vmemmap for all hugetlb pages on the list
    2) take hugetlb lock and clear destructor for all pages on the list
    3) free all pages on list back to low level allocators
    
    Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com
    Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Acked-by: default avatarJames Houghton <jthoughton@google.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Barry Song <21cnbao@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joao Martins <joao.m.martins@oracle.com>
    Cc: Konrad Dybcio <konradybcio@kernel.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Usama Arif <usama.arif@bytedance.com>
    Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    d2cf88c2
hugetlb.c 210 KB