• Mel Gorman's avatar
    mm/page_alloc: fix race condition between build_all_zonelists and page allocation · 3d36424b
    Mel Gorman authored
    Patrick Daly reported the following problem;
    
    	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
    	[0] - ZONE_MOVABLE
    	[1] - ZONE_NORMAL
    	[2] - NULL
    
    	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
    	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
    	memory_offline operation removes the last page from ZONE_MOVABLE,
    	build_all_zonelists() & build_zonerefs_node() will update
    	node_zonelists as shown below. Only populated zones are added.
    
    	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
    	[0] - ZONE_NORMAL
    	[1] - NULL
    	[2] - NULL
    
    The race is simple -- page allocation could be in progress when a memory
    hot-remove operation triggers a zonelist rebuild that removes zones.  The
    allocation request will still have a valid ac->preferred_zoneref that is
    now pointing to NULL and triggers an OOM kill.
    
    This problem probably always existed but may be slightly easier to trigger
    due to 6aa303de ("mm, vmscan: only allocate and reclaim from zones
    with pages managed by the buddy allocator") which distinguishes between
    zones that are completely unpopulated versus zones that have valid pages
    not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
    etc).  Memory hotplug had multiple stages with timing considerations
    around managed/present page updates, the zonelist rebuild and the zone
    span updates.  As David Hildenbrand puts it
    
    	memory offlining adjusts managed+present pages of the zone
    	essentially in one go. If after the adjustments, the zone is no
    	longer populated (present==0), we rebuild the zone lists.
    
    	Once that's done, we try shrinking the zone (start+spanned
    	pages) -- which results in zone_start_pfn == 0 if there are no
    	more pages. That happens *after* rebuilding the zonelists via
    	remove_pfn_range_from_zone().
    
    The only requirement to fix the race is that a page allocation request
    identifies when a zonelist rebuild has happened since the allocation
    request started and no page has yet been allocated.  Use a seqlock_t to
    track zonelist updates with a lockless read-side of the zonelist and
    protecting the rebuild and update of the counter with a spinlock.
    
    [akpm@linux-foundation.org: make zonelist_update_seq static]
    Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
    Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Reported-by: default avatarPatrick Daly <quic_pdaly@quicinc.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Cc: <stable@vger.kernel.org>	[4.9+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    3d36424b
page_alloc.c 269 KB