1. 06 Jan, 2009 40 commits
    • Rik van Riel's avatar
      vmscan: bail out of direct reclaim after swap_cluster_max pages · a79311c1
      Rik van Riel authored
      When the VM is under pressure, it can happen that several direct reclaim
      processes are in the pageout code simultaneously.  It also happens that
      the reclaiming processes run into mostly referenced, mapped and dirty
      pages in the first round.
      
      This results in multiple direct reclaim processes having a lower
      pageout priority, which corresponds to a higher target of pages to
      scan.
      
      This in turn can result in each direct reclaim process freeing
      many pages.  Together, they can end up freeing way too many pages.
      
      This kicks useful data out of memory (in some cases more than half
      of all memory is swapped out).  It also impacts performance by
      keeping tasks stuck in the pageout code for too long.
      
      A 30% improvement in hackbench has been observed with this patch.
      
      The fix is relatively simple: in shrink_zone() we can check how many
      pages we have already freed, direct reclaim tasks break out of the
      scanning loop if they have already freed enough pages and have reached
      a lower priority level.
      
      We do not break out of shrink_zone() when priority == DEF_PRIORITY,
      to ensure that equal pressure is applied to every zone in the common
      case.
      
      However, in order to do this we do need to know how many pages we already
      freed, so move nr_reclaimed into scan_control.
      
      akpm: a historical interlude...
      
      We tried this in 2004:
      
      :commit e468e46a9bea3297011d5918663ce6d19094cf87
      :Author: akpm <akpm>
      :Date:   Thu Jun 24 15:53:52 2004 +0000
      :
      :[PATCH] vmscan.c: dont reclaim too many pages
      :
      :    The shrink_zone() logic can, under some circumstances, cause far too many
      :    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
      :    a large number of reclaimable pages on the LRU.
      :    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
      
      And we reverted it in 2006:
      
      :commit 210fe530
      :Author: Andrew Morton <akpm@osdl.org>
      :Date:   Fri Jan 6 00:11:14 2006 -0800
      :
      :    [PATCH] vmscan: balancing fix
      :
      :    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
      :
      :      The shrink_zone() logic can, under some circumstances, cause far too many
      :      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
      :      hit a large number of reclaimable pages on the LRU.
      :
      :      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
      :      reclaimed.
      :
      :    Problem is, this change caused significant imbalance in inter-zone scan
      :    balancing by truncating scans of larger zones.
      :
      :    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
      :    balancing algorithm would require that if we're scanning 100 pages of
      :    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
      :    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
      :    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
      :    harder than large ones.
      :
      :    Now I need to remember what the workload was which caused me to write this
      :    patch originally, then fix it up in a different way...
      
      And we haven't demonstrated that whatever problem caused that reversion is
      not being reintroduced by this change in 2008.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a79311c1
    • Hannes Eder's avatar
      hugetlb: fix sparse warnings · ebdd4aea
      Hannes Eder authored
      Fix the following sparse warnings:
      
        mm/hugetlb.c:375:3: warning: returning void-valued expression
        mm/hugetlb.c:408:3: warning: returning void-valued expression
      Signed-off-by: default avatarHannes Eder <hannes@hanneseder.net>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebdd4aea
    • Hugh Dickins's avatar
      swapfile: let others seed random · f0d7a4b3
      Hugh Dickins authored
      Remove the srandom32((u32)get_seconds()) from non-rotational swapon:
      there's been a coincidental discussion of earlier randomization, assume
      that goes ahead, let swapon be a client rather than stirring for itself.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0d7a4b3
    • Hugh Dickins's avatar
      swapfile: change discard pgoff_t to sector_t · 858a2990
      Hugh Dickins authored
      Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to
      sector_t: given the constraints on swap offsets (in particular, the 5 bits
      of swap type accommodated in the same unsigned long), pgoff_t was actually
      safe as is, but it certainly looked worrying when shifted left.
      
      [akpm@linux-foundation.org: fix shift overflow]
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858a2990
    • Hugh Dickins's avatar
      swapfile: swap allocation cycle if nonrot · c60aa176
      Hugh Dickins authored
      Though attempting to find free clusters (Andrea), swap allocation has
      always restarted its searches from the beginning of the swap area (sct),
      to reduce seek times between swap pages, by not scattering them all over
      the partition.
      
      But on a solidstate swap device, seeks are cheap, and block remapping to
      level the wear may be limited by zones: in that case it's better to cycle
      around the whole partition.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c60aa176
    • Hugh Dickins's avatar
      swapfile: swapon randomize if nonrot · 20137a49
      Hugh Dickins authored
      Swap allocation has always started from the beginning of the swap area;
      but if we're dealing with a solidstate swap device which can only remap
      blocks within limited zones, that would sooner wear out the first zone.
      
      Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
      randomize the cluster_next starting position for allocation.
      
      If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
      with an "SS" at the right end of the kernel's "Adding ...  swap" message
      (so that if it's both nonrot and discardable, "SSD" will be shown there).
      Perhaps something should be shown in /proc/swaps (swapon -s), but we have
      to be more cautious before making any addition to that format.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20137a49
    • Hugh Dickins's avatar
      swapfile: swap allocation use discard · 7992fde7
      Hugh Dickins authored
      When scan_swap_map() finds a free cluster of swap pages to allocate,
      discard the old contents of the cluster if the device supports discard.
      But don't bother when swap is so fragmented that we allocate single pages.
      
      Be careful about racing allocations made while we're scanning for a
      cluster; and hold up allocations made while we're discarding.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7992fde7
    • Hugh Dickins's avatar
      swapfile: swapon use discard (trim) · 6a6ba831
      Hugh Dickins authored
      When adding swap, all the old data on swap can be forgotten: sys_swapon()
      discard all but the header page of the swap partition (or every extent but
      the header of the swap file), to give a solidstate swap device the
      opportunity to optimize its wear-levelling.
      
      If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
      "D" at the right end of the kernel's "Adding ...  swap" message.  Perhaps
      something should be shown in /proc/swaps (swapon -s), but we have to be
      more cautious before making any addition to that format.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Joern Engel <joern@logfs.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Donjun Shin <djshin90@gmail.com>
      Cc: Tejun Heo <teheo@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a6ba831
    • Hugh Dickins's avatar
      swapfile: rearrange scan and swap_info · ebebbbe9
      Hugh Dickins authored
      Before making functional changes, rearrange scan_swap_map() to simplify
      subsequent diffs.  Actually, there is one functional change in there:
      leave cluster_nr negative while scanning for a new cluster - resetting it
      early increased the likelihood that when we have difficulty finding a free
      cluster, another task may come in and try doing exactly the same - just a
      waste of cpu.
      
      Before making functional changes, rearrange struct swap_info_struct
      slightly: flags will be needed as an unsigned long (for wait_on_bit), next
      is a good int to pair with prio, old_block_size is uninteresting so shift
      it to the end.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebebbbe9
    • Hugh Dickins's avatar
      swapfile: remove v0 SWAP-SPACE message · 81e33971
      Hugh Dickins authored
      The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can
      now safely drop its "version 0 swap is no longer supported" message - just
      say "Unable to find swap-space signature" as usual.  This removes one
      level of indentation from a stretch of sys_swapon().
      
      I'd have liked to be specific, saying "Unable to find SWAPSPACE2
      signature", but it's just too confusing that the version 1 signature shows
      the number 2.
      
      Irrelevant nearby cleanup: kmap(page) already gives page_address(page).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81e33971
    • Hugh Dickins's avatar
      swapfile: remove surplus whitespace · 886bb7e9
      Hugh Dickins authored
      Remove trailing whitespace from swapfile.c, and odd swap_show() alignment.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      886bb7e9
    • Hugh Dickins's avatar
      swapfile: remove SWP_ACTIVE mask · 22c6f8fd
      Hugh Dickins authored
      Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22c6f8fd
    • Hugh Dickins's avatar
      swapfile: swapon needs larger size type · 73fd8748
      Hugh Dickins authored
      sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as
      an int, but should be an unsigned long like the maxpages it's compared
      against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected
      with "Swap area shorter than signature indicates".
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73fd8748
    • KOSAKI Motohiro's avatar
      mm: make vread() and vwrite() declaration · 69beeb1d
      KOSAKI Motohiro authored
      Sparse output following warnings.
      
      mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
      mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?
      
      However, it is used by /dev/kmem. fixed here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69beeb1d
    • KOSAKI Motohiro's avatar
      mm: make setup_per_zone_inactive_ratio() static · efab8186
      KOSAKI Motohiro authored
      Sparse output following warning.
      
      mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efab8186
    • KOSAKI Motohiro's avatar
      mm: make scan_zone_unevictable_pages() static · 14b90b22
      KOSAKI Motohiro authored
      sparse output following warning
      
      	mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14b90b22
    • KOSAKI Motohiro's avatar
      mm: make scan_all_zones_unevictable_pages() static · ff30153b
      KOSAKI Motohiro authored
      sparse output following warning.
      
      	mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff30153b
    • KOSAKI Motohiro's avatar
      mm: make mem_cgroup_resize_limit() static · d38d2a75
      KOSAKI Motohiro authored
      Sparse output following warnings.
      
      mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not
      declared.  Should it be static?
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d38d2a75
    • KOSAKI Motohiro's avatar
      mm: make maddr __iomem · 2bc7273b
      KOSAKI Motohiro authored
      sparse output following warnings.
      
      mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces)
      mm/memory.c:2936:8:    expected void *maddr
      mm/memory.c:2936:8:    got void [noderef] <asn:2>
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bc7273b
    • KOSAKI Motohiro's avatar
      mm: make init_section_page_cgroup() static · feb16694
      KOSAKI Motohiro authored
      Sparse output following warning.
      
      mm/page_cgroup.c:100:15: warning: symbol 'init_section_page_cgroup' was
      not declared.  Should it be static?
      
      cleanup here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      feb16694
    • KOSAKI Motohiro's avatar
      memcg: reclaim shouldn't change zone->recent_rotated statistics · 077cbc58
      KOSAKI Motohiro authored
      memcg reclaim shouldn't change zone->recent_rotated statistics.  If
      memcgroup reclaim changes zone statistics, global reclaim can get a bit
      confused.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      077cbc58
    • Hugh Dickins's avatar
      mm: optimize get_scan_ratio for no swap · b962716b
      Hugh Dickins authored
      Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
      optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
      declaration to swapfile.c: it never belonged in page_alloc.c.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b962716b
    • Hugh Dickins's avatar
      mm: add add_to_swap stub · 60371d97
      Hugh Dickins authored
      If we add a failing stub for add_to_swap(), then we can remove the #ifdef
      CONFIG_SWAP from mm/vmscan.c.
      
      This was intended as a source cleanup, but looking more closely, it turns
      out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
      page, whereas now it goes to the more suitable activate_locked, like the
      CONFIG_SWAP nr_swap_pages 0 case.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60371d97
    • Hugh Dickins's avatar
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins authored
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • Hugh Dickins's avatar
      mm: remove try_to_munlock from vmscan · 63d6c5ad
      Hugh Dickins authored
      An unfortunate feature of the Unevictable LRU work was that reclaiming an
      anonymous page involved an extra scan through the anon_vma: to check that
      the page is evictable before allocating swap, because the swap could not
      be freed reliably soon afterwards.
      
      Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
      not an issue any more: remove try_to_munlock() call from
      shrink_page_list(), leaving it to try_to_munmap() to discover if the page
      is one to be culled to the unevictable list - in which case then
      try_to_free_swap().
      
      Update unevictable-lru.txt to remove comments on the try_to_munlock() in
      shrink_page_list(), and shorten some lines over 80 columns.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63d6c5ad
    • Hugh Dickins's avatar
      mm: try_to_unuse check removing right swap · 68bdc8d6
      Hugh Dickins authored
      There's a possible race in try_to_unuse() which Nick Piggin led me to two
      years ago.  Where it does lock_page() after read_swap_cache_async(), what
      if another task removed that page from swapcache just before we locked it?
      
      It would sail though the (*swap_map > 1) tests doing nothing (because it
      could not have been removed from swapcache before its swap references were
      gone), until it reaches the delete_from_swap_cache(page) near the bottom.
      
      Now imagine that this page has been allocated to swap on a different swap
      area while we dropped page lock (perhaps at the top, perhaps in unuse_mm):
      we could wrongly remove from swap cache before the page has been written
      to swap, so a subsequent do_swap_page() would read in stale data from
      swap.
      
      I think this case could not happen before: remove_exclusive_swap_page()
      refused while page count was raised.  But now with reuse_swap_page() and
      try_to_free_swap() removing from swap cache without minding page count, I
      think it could happen - the previous patch argued that it was safe because
      try_to_unuse() already ignored page count, but overlooked that it might be
      breaking the assumptions in try_to_unuse() itself.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68bdc8d6
    • Hugh Dickins's avatar
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins authored
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed
    • Hugh Dickins's avatar
      mm: reuse_swap_page replaces can_share_swap_page · 7b1fe597
      Hugh Dickins authored
      A good place to free up old swap is where do_wp_page(), or do_swap_page(),
      is about to redirty the page: the data on disk is then stale and won't be
      read again; and if we do decide to write the page out later, using the
      previous swap location makes an unnecessary disk seek very likely.
      
      So give can_share_swap_page() the side-effect of delete_from_swap_cache()
      when it safely can.  And can_share_swap_page() was always a misleading
      name, the more so if it has a side-effect: rename it reuse_swap_page().
      
      Irrelevant cleanup nearby: remove swap_token_default_timeout definition
      from swap.h: it's used nowhere.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b1fe597
    • Hugh Dickins's avatar
      mm: wp lock page before deciding cow · ab967d86
      Hugh Dickins authored
      An application may rely on get_user_pages() to give it pages writable from
      userspace and shared with a driver, GUP breaking COW if necessary.  It may
      mprotect() the pages' writability, off and on, from time to time.
      
      Normally this works fine (so long as the app does not fork); but just
      occasionally, under memory pressure, a readonly pte in a newly writable
      area is COWed unnecessarily, breaking the link with the driver: because
      do_wp_page() does trylock_page, and falls back to COW whenever that fails.
      
      For reliable behaviour in the unshared case, when the trylock_page fails,
      now unlock pagetable, lock page and relock pagetable, before deciding
      whether Copy-On-Write is really necessary.
      
      Reported-by: Zhou Yingchao
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab967d86
    • Hugh Dickins's avatar
      mm: gup persist for write permission · 878b63ac
      Hugh Dickins authored
      do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
      COW has been done if necessary, though it may be leaving the pte without
      write permission - for the odd case of forced writing to a readonly vma
      for ptrace.  At present GUP then retries the follow_page() without asking
      for write permission, to escape an endless loop when forced.
      
      But an application may be relying on GUP to guarantee a writable page
      which won't be COWed again when written from userspace, whereas a race
      here might leave a readonly pte in place?  Change the VM_FAULT_WRITE
      handling to ask follow_page() for write permission again, except in that
      odd case of forced writing to a readonly vma.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      878b63ac
    • David Rientjes's avatar
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes authored
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
    • David Rientjes's avatar
      mm: change dirty limit type specifiers to unsigned long · 364aeb28
      David Rientjes authored
      The background dirty and dirty limits are better defined with type
      specifiers of unsigned long since negative writeback thresholds are not
      possible.
      
      These values, as returned by get_dirty_limits(), are normally compared
      with ZVC values to determine whether writeback shall commence or be
      throttled.  Such page counts cannot be negative, so declaring the page
      limits as signed is unnecessary.
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      364aeb28
    • Julia Lawall's avatar
      mm/page_alloc.c: eliminate NULL test and memset after alloc_bootmem · 58a01a45
      Julia Lawall authored
      As noted by Akinobu Mita in patch b1fceac2,
      alloc_bootmem and related functions never return NULL and always return a
      zeroed region of memory.  Thus a NULL test or memset after calls to these
      functions is unnecessary.
      
      This was fixed using the following semantic patch.
      (http://www.emn.fr/x-info/coccinelle/)
      
      // <smpl>
      @@
      expression E;
      statement S;
      @@
      
      E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
      ... when != E
      (
      - BUG_ON (E == NULL);
      |
      - if (E == NULL) S
      )
      
      @@
      expression E,E1;
      @@
      
      E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
      ... when != E
      - memset(E,0,E1);
      // </smpl>
      Signed-off-by: default avatarJulia Lawall <julia@diku.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58a01a45
    • Hugh Dickins's avatar
      mm: further cleanup page_add_new_anon_rmap · cbf84b7a
      Hugh Dickins authored
      Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
      was good but stupid: we can and should SetPageSwapBacked() there too; and
      we know for sure that this anonymous, swap-backed page is not file cache.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbf84b7a
    • Hugh Dickins's avatar
      mm: make page_lock_anon_vma() static · 2afd1c92
      Hugh Dickins authored
      page_lock_anon_vma() and page_unlock_anon_vma() were made available to
      show_page_path() in vmscan.c; but now that has been removed, make them
      static in rmap.c again, they're better kept private if possible.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2afd1c92
    • Hugh Dickins's avatar
      mm: add_active_or_unevictable into rmap · b5934c53
      Hugh Dickins authored
      lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
      appear together.  Save some symbol table space and some jumping around by
      removing lru_cache_add_active_or_unevictable(), folding its code into
      page_add_new_anon_rmap(): like how we add file pages to lru just after
      adding them to page cache.
      
      Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
      change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
      originally intended.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5934c53
    • Hugh Dickins's avatar
      mm: replace some BUG_ONs by VM_BUG_ONs · 51726b12
      Hugh Dickins authored
      The swap code is over-provisioned with BUG_ONs on assorted page flags,
      mostly dating back to 2.3.  They're good documentation, and guard against
      developer error, but a waste of space on most systems: change them to
      VM_BUG_ONs, conditional on CONFIG_DEBUG_VM.  Just delete the PagePrivate
      ones: they're later, from 2.5.69, but even less interesting now.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51726b12
    • Hugh Dickins's avatar
      mm: add Set,ClearPageSwapCache stubs · 6d91add0
      Hugh Dickins authored
      If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
      we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d91add0
    • Hugh Dickins's avatar
      mm: remove GFP_HIGHUSER_PAGECACHE · 3c1d4378
      Hugh Dickins authored
      GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
      that harder to track down: remove it, and its out-of-work brothers
      GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.
      
      Since we're making that improvement to hotremove_migrate_alloc(), I think
      we can now also remove one of the "o"s from its comment.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c1d4378
    • Hugh Dickins's avatar
      mm: remove cgroup_mm_owner_callbacks · e5991371
      Hugh Dickins authored
      cgroup_mm_owner_callbacks() was brought in to support the memrlimit
      controller, but sneaked into mainline ahead of it.  That controller has
      now been shelved, and the mm_owner_changed() args were inadequate for it
      anyway (they needed an mm pointer instead of a task pointer).
      
      Remove the dead code, and restore mm_update_next_owner() locking to how it
      was before: taking mmap_sem there does nothing for memcontrol.c, now the
      only user of mm->owner.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5991371