• Andy Whitcroft's avatar
    [PATCH] sparsemem swiss cheese numa layouts · 641c7673
    Andy Whitcroft authored
    The part of the sparsemem patch which modifies memmap_init_zone() has recently
    become a problem.  It changes behavior so that there is a call to
    pfn_to_page() for each individual page inside of a node's range:
    node_start_pfn through node_end_pfn.  It used to simply do this once, at the
    beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
    of a node made it necessary to change.
    
    Mike Kravetz recently wrote a patch which made the NUMA code accept some new
    kinds of layouts.  The system's memory was laid out like this, with node 0's
    memory in two pieces: one before and one after node 1's memory:
    
    	Node 0: +++++     +++++
    	Node 1:      +++++
    
    Previous behavior before Mike's patch was to assign nodes like this:
    
    	Node 0: 00000     XXXXX
    	Node 1:      11111
    
    Where the 'X' areas were simply thrown away.  The new behavior was to make the
    pg_data_t span node 0 across all of its areas, including areas that are really
    node 1's: Node 0: 000000000000000 Node 1: 11111
    
    This wastes a little bit of mem_map space, but ends up being OK, and more
    fully utilizes the system's memory.  memmap_init_zone() initializes all of the
    "struct page"s for node 0, even for the "hole", but those never get used,
    because there is no pfn_to_page() that resolves to those pages.  However, only
    calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
    allocated for node0->node_mem_map because:
    
    	struct page *start = pfn_to_page(start_pfn);
    	// effectively start = &node->node_mem_map[0]
    	for (page = start; page < (start + size); page++) {
    		init_page_here();...
    		page++;
    	}
    
    Slow, and wasteful, but generally harmless.
    
    But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
    does):
    
    	for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
    		page = pfn_to_page(pfn);
    	}
    
    And you end up trying to initialize node 1's pages too early, along with bogus
    data from node 0.  This patch checks for those weird layouts and declines to
    touch the pages, making the more frequent pfn_to_page() calls OK to do.
    Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
    Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    641c7673
Kconfig 10.1 KB