Commit 49c7ca7c authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] don't allocate ratnodes under PF_MEMALLOC

On the swap_out() path, the radix-tree pagecache is allocating its
nodes with PF_MEMALLOC set, which allows it to completely exhaust the
free page lists(*).  This is fairly easy to trigger with swap-intensive
loads.

It would be better to make those node allocations fail at an earlier
time.  When this happens, the radix-tree can still obtain nodes from its
mempool, and we leave some memory available for the I/O layer.
(Assuming that the I/O is being performed under PF_MEMALLOC, which it
is).

So the patch simply drops PF_MEMALLOC while adding nodes to the
swapcache's tree.

We're still performing atomic allocations, so the rat is still biting
pretty deeply into the page reserves - under heavy load the amount of
free memory is less than half of what it was pre-rat.

It is unfortunate that the page allocator overloads !__GFP_WAIT to also
mean "try harder".  It would be better to separate these concepts, and
to allow the radix-tree code (at least) to perform atomic allocations,
but to not go below pages_min.  It seems that __GFP_TRY_HARDER will be
pretty straightforward to implement.  Later.

The patch also impements a workaround for the mempool list_head
problem, until that is sorted out.



(*) The usual result is that the SCSI layer dies at scsi_merge.c:82.
It would be nice to have a fix for that - it's going BUG if 1-order
allocations fail at interrupt time.  That happens pretty easily.
parent de6ca58c
......@@ -49,11 +49,27 @@ struct radix_tree_path {
static kmem_cache_t *radix_tree_node_cachep;
static mempool_t *radix_tree_node_pool;
#define radix_tree_node_alloc(root) \
mempool_alloc(radix_tree_node_pool, (root)->gfp_mask)
#define radix_tree_node_free(node) \
mempool_free((node), radix_tree_node_pool);
/*
* mempool scribbles on the first eight bytes of the managed
* memory. Here we implement a temp workaround for that.
*/
#include <linux/list.h>
static inline struct radix_tree_node *
radix_tree_node_alloc(struct radix_tree_root *root)
{
struct radix_tree_node *ret;
ret = mempool_alloc(radix_tree_node_pool, root->gfp_mask);
if (ret)
memset(ret, 0, sizeof(struct list_head));
return ret;
}
static inline void
radix_tree_node_free(struct radix_tree_node *node)
{
mempool_free(node, radix_tree_node_pool);
}
/*
* Return the maximum key which can be store into a
......
......@@ -34,6 +34,29 @@
*/
#define DEF_PRIORITY (6)
/*
* On the swap_out path, the radix-tree node allocations are performing
* GFP_ATOMIC allocations under PF_MEMALLOC. They can completely
* exhaust the page allocator. This is bad; some pages should be left
* available for the I/O system to start sending the swapcache contents
* to disk.
*
* So PF_MEMALLOC is dropped here. This causes the slab allocations to fail
* earlier, so radix-tree nodes will then be allocated from the mempool
* reserves.
*/
static inline int
swap_out_add_to_swap_cache(struct page *page, swp_entry_t entry)
{
int flags = current->flags;
int ret;
current->flags &= ~PF_MEMALLOC;
ret = add_to_swap_cache(page, entry);
current->flags = flags;
return ret;
}
/*
* The swap-out function returns 1 if it successfully
* scanned all the pages it was asked to (`count').
......@@ -139,7 +162,7 @@ static inline int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct*
* (adding to the page cache will clear the dirty
* and uptodate bits, so we need to do it again)
*/
switch (add_to_swap_cache(page, entry)) {
switch (swap_out_add_to_swap_cache(page, entry)) {
case 0: /* Success */
SetPageUptodate(page);
set_page_dirty(page);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment