• Andrew Morton's avatar
    [PATCH] Fix page double-freeing race · f68e7a55
    Andrew Morton authored
    This has been there for nearly two years.  See bugzilla #1403
    
    vmscan.c does, in two places:
    
    	spin_lock(zone->lru_lock)
    	page = lru_to_page(&zone->inactive_list);
    	if (page_count(page) == 0) {
    		/* erk, it's being freed by __page_cache_release() or
    		 * release_pages()
    		 */
    		put_it_back_on_the_lru();
    
    	} else {
    
    	--> window 1 <--
    
    		page_cache_get(page);
    		put_in_on_private_list();
    	}
    	spin_unlock(zone->lru_lock)
    
    	use_the_private_list();
    
    	page_cache_release(page);
    
    
    
    whereas __page_cache_release() and release_pages() do:
    
    	if (put_page_testzero(page)) {
    
    	--> window 2 <--
    
    		spin_lock(lru->lock);
    		if (page_count(page) == 0) {
    			remove_it_from_the_lru();
    			really_free_the_page()
    		}
    		spin_unlock(zone->lru_lock)
    	}
    
    
    The race occurs if the vmscan.c path sees page_count()==1 and then the
    page_cache_release() path happens in that few-instruction "window 1" before
    vmscan's page_cache_get().
    
    The page_cache_release() path does put_page_testzero(), which returns true.
    Then this CPU takes an interrupt...
    
    The vmscan.c path then does page_cache_get(), taking the refcount to one.
    Then it uses the page and does page_cache_release(), taking the refcount to
    zero and the page is really freed.
    
    Now, the CPU running page_cache_release() returns from the interrupt, takes
    the LRU lock, sees the page still has a refcount of zero and frees it again.
    Boom.
    
    
    The patch fixes this by closing "window 1".  We provide a
    "get_page_testone()" which grabs a ref on the page and returns true if the
    refcount was previously zero.  If that happens the vmscan.c code simply drops
    the page's refcount again and leaves the page on the LRU.
    
    All this happens under the zone->lru_lock, which is also taken by
    __page_cache_release() and release_pages(), so the vmscan code knows that the
    page has not been returned to the page allocator yet.
    
    
    In terms of implementation, the page counts are now offset by one: a free page
    has page->_count of -1.  This is so that we can use atomic_add_negative() and
    atomic_inc_and_test() to provide put_page_testzero() and get_page_testone().
    
    The macros hide all of this so the public interpretation of page_count() and
    set_page_count() remains unaltered.
    
    The compiler can usually constant-fold the offsetting of page->count.  This
    patch increases an x86 SMP kernel's text by 32 bytes.
    
    The patch renames page->count to page->_count to break callers who aren't
    using the macros.
    
    This patch requires that the architecture implement atomic_add_negative().  It
    is currently present on
    
    	arm
    	arm26
    	i386
    	ia64
    	mips
    	ppc
    	s390
    	v850
    	x86_64
    
    ppc implements this as
    
    #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
    
    and atomic_add_return() is implemented on
    
    	alpha
    	cris
    	h8300
    	ia64
    	m68knommu
    	mips
    	parisc
    	ppc
    	ppc
    	ppc64
    	s390
    	sh
    	sparc
    	v850
    
    so we're looking pretty good.
    f68e7a55
vmscan.c 32.1 KB