• Huang Ying's avatar
    memory tiering: hot page selection with hint page fault latency · 33024536
    Huang Ying authored
    Patch series "memory tiering: hot page selection", v4.
    
    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory nodes need to be identified. 
    Essentially, the original NUMA balancing implementation selects the mostly
    recently accessed (MRU) pages to promote.  But this isn't a perfect
    algorithm to identify the hot pages.  Because the pages with quite low
    access frequency may be accessed eventually given the NUMA balancing page
    table scanning period could be quite long (e.g.  60 seconds).  So in this
    patchset, we implement a new hot page identification algorithm based on
    the latency between NUMA balancing page table scanning and hint page
    fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
    
    In NUMA balancing memory tiering mode, if there are hot pages in slow
    memory node and cold pages in fast memory node, we need to promote/demote
    hot/cold pages between the fast and cold memory nodes.
    
    A choice is to promote/demote as fast as possible.  But the CPU cycles and
    memory bandwidth consumed by the high promoting/demoting throughput will
    hurt the latency of some workload because of accessing inflating and slow
    memory bandwidth contention.
    
    A way to resolve this issue is to restrict the max promoting/demoting
    throughput.  It will take longer to finish the promoting/demoting.  But
    the workload latency will be better.  This is implemented in this patchset
    as the page promotion rate limit mechanism.
    
    The promotion hot threshold is workload and system configuration
    dependent.  So in this patchset, a method to adjust the hot threshold
    automatically is implemented.  The basic idea is to control the number of
    the candidate promotion pages to match the promotion rate limit.
    
    We used the pmbench memory accessing benchmark tested the patchset on a
    2-socket server system with DRAM and PMEM installed.  The test results are
    as follows,
    
    		pmbench score		promote rate
    		 (accesses/s)			MB/s
    		-------------		------------
    base		  146887704.1		       725.6
    hot selection     165695601.2		       544.0
    rate limit	  162814569.8		       165.2
    auto adjustment	  170495294.0                  136.9
    
    From the results above,
    
    With hot page selection patch [1/3], the pmbench score increases about
    12.8%, and promote rate (overhead) decreases about 25.0%, compared with
    base kernel.
    
    With rate limit patch [2/3], pmbench score decreases about 1.7%, and
    promote rate decreases about 69.6%, compared with hot page selection
    patch.
    
    With threshold auto adjustment patch [3/3], pmbench score increases about
    4.7%, and promote rate decrease about 17.1%, compared with rate limit
    patch.
    
    Baolin helped to test the patchset with MySQL on a machine which contains
    1 DRAM node (30G) and 1 PMEM node (126G).
    
    sysbench /usr/share/sysbench/oltp_read_write.lua \
    ......
    --tables=200 \
    --table-size=1000000 \
    --report-interval=10 \
    --threads=16 \
    --time=120
    
    The tps can be improved about 5%.
    
    
    This patch (of 3):
    
    To optimize page placement in a memory tiering system with NUMA balancing,
    the hot pages in the slow memory node need to be identified.  Essentially,
    the original NUMA balancing implementation selects the mostly recently
    accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
    identify the hot pages.  Because the pages with quite low access frequency
    may be accessed eventually given the NUMA balancing page table scanning
    period could be quite long (e.g.  60 seconds).  The most frequently
    accessed (MFU) algorithm is better.
    
    So, in this patch we implemented a better hot page selection algorithm. 
    Which is based on NUMA balancing page table scanning and hint page fault
    as follows,
    
    - When the page tables of the processes are scanned to change PTE/PMD
      to be PROT_NONE, the current time is recorded in struct page as scan
      time.
    
    - When the page is accessed, hint page fault will occur.  The scan
      time is gotten from the struct page.  And The hint page fault
      latency is defined as
    
        hint page fault time - scan time
    
    The shorter the hint page fault latency of a page is, the higher the
    probability of their access frequency to be higher.  So the hint page
    fault latency is a better estimation of the page hot/cold.
    
    It's hard to find some extra space in struct page to hold the scan time. 
    Fortunately, we can reuse some bits used by the original NUMA balancing.
    
    NUMA balancing uses some bits in struct page to store the page accessing
    CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
    multi-stage node selection algorithm to avoid to migrate pages shared
    accessed by the NUMA nodes back and forth.  But for pages in the slow
    memory node, even if they are shared accessed by multiple NUMA nodes, as
    long as the pages are hot, they need to be promoted to the fast memory
    node.  So the accessing CPU and PID information are unnecessary for the
    slow memory pages.  We can reuse these bits in struct page to record the
    scan time.  For the fast memory pages, these bits are used as before.
    
    For the hot threshold, the default value is 1 second, which works well in
    our performance test.  All pages with hint page fault latency < hot
    threshold will be considered hot.
    
    It's hard for users to determine the hot threshold.  So we don't provide a
    kernel ABI to set it, just provide a debugfs interface for advanced users
    to experiment.  We will continue to work on a hot threshold automatic
    adjustment mechanism.
    
    The downside of the above method is that the response time to the workload
    hot spot changing may be much longer.  For example,
    
    - A previous cold memory area becomes hot
    
    - The hint page fault will be triggered.  But the hint page fault
      latency isn't shorter than the hot threshold.  So the pages will
      not be promoted.
    
    - When the memory area is scanned again, maybe after a scan period,
      the hint page fault latency measured will be shorter than the hot
      threshold and the pages will be promoted.
    
    To mitigate this, if there are enough free space in the fast memory node,
    the hot threshold will not be used, all pages will be promoted upon the
    hint page fault for fast response.
    
    Thanks Zhong Jiang reported and tested the fix for a bug when disabling
    memory tiering mode dynamically.
    
    Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
    Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Wei Xu <weixugc@google.com>
    Cc: osalvador <osalvador@suse.de>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    33024536
huge_memory.c 86.4 KB