1. 09 Oct, 2012 40 commits
    • Minchan Kim's avatar
      memory-hotplug: don't replace lowmem pages with highmem · 74c08f98
      Minchan Kim authored
      The changelog for commit 6a6dccba ("mm: cma: don't replace lowmem
      pages with highmem") mentioned that lowmem pages can be replaced by
      highmem pages during CMA migration.  6a6dccba fixed that issue.
      
      Quote from that changelog:
      
      :   The filesystem layer expects pages in the block device's mapping to not
      :   be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
      :   currently replace lowmem pages with highmem pages, leading to crashes in
      :   filesystem code such as the one below:
      :
      :     Unable to handle kernel NULL pointer dereference at virtual address 00000400
      :     pgd = c0c98000
      :     [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
      :     Internal error: Oops: 817 [#1] PREEMPT SMP ARM
      :     CPU: 0    Not tainted  (3.5.0-rc5+ #80)
      :     PC is at __memzero+0x24/0x80
      :     ...
      :     Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
      :     Backtrace:
      :     [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
      :     [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
      :      r4:c15337f0
      :     [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
      :     [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
      :      r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
      :     [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
      :      r6:beccdcf0 r5:00074000 r4:beccdbbc
      :     [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
      
      Memory-hotplug has same problem as CMA has so the same fix can be applied
      to memory-hotplug as well.
      
      Fix it by reusing.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74c08f98
    • Minchan Kim's avatar
      mm/page_alloc: refactor out __alloc_contig_migrate_alloc() · 723a0644
      Minchan Kim authored
      __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
      it out (move + rename as a common name) into page_isolation.c.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      723a0644
    • Sachin Kamat's avatar
    • Mel Gorman's avatar
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman authored
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • Mel Gorman's avatar
      mm: compaction: Restart compaction from near where it left off · c89511ab
      Mel Gorman authored
      This is almost entirely based on Rik's previous patches and discussions
      with him about how this might be implemented.
      
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
       This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      This patch caches where the migration and free scanner should start from
      on subsequent compaction invocations using the pageblock-skip information.
       When compaction starts it begins from the cached restart points and will
      update the cached restart points until a page is isolated or a pageblock
      is skipped that would have been scanned by synchronous compaction.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c89511ab
    • Mel Gorman's avatar
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman authored
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb
    • Mel Gorman's avatar
      revert "mm: have order > 0 compaction start off where it left" · 753341a4
      Mel Gorman authored
      This reverts commit 7db8889a ("mm: have order > 0 compaction start
      off where it left") and commit de74f1cc ("mm: have order > 0 compaction
      start near a pageblock with free pages").  These patches were a good
      idea and tests confirmed that they massively reduced the amount of
      scanning but the implementation is complex and tricky to understand.  A
      later patch will cache what pageblocks should be skipped and
      reimplements the concept of compact_cached_free_pfn on top for both
      migration and free scanners.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      753341a4
    • Mel Gorman's avatar
      mm: compaction: acquire the zone->lock as late as possible · f40d1e42
      Mel Gorman authored
      Compaction's free scanner acquires the zone->lock when checking for
      PageBuddy pages and isolating them.  It does this even if there are no
      PageBuddy pages in the range.
      
      This patch defers acquiring the zone lock for as long as possible.  In the
      event there are no free pages in the pageblock then the lock will not be
      acquired at all which reduces contention on zone->lock.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Tested-by: default avatarPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f40d1e42
    • Mel Gorman's avatar
      mm: compaction: acquire the zone->lru_lock as late as possible · 2a1402aa
      Mel Gorman authored
      Richard Davies and Shaohua Li have both reported lock contention problems
      in compaction on the zone and LRU locks as well as significant amounts of
      time being spent in compaction.  This series aims to reduce lock
      contention and scanning rates to reduce that CPU usage.  Richard reported
      at https://lkml.org/lkml/2012/9/21/91 that this series made a big
      different to a problem he reported in August:
      
         http://marc.info/?l=kvm&m=134511507015614&w=2
      
      Patch 1 defers acquiring the zone->lru_lock as long as possible.
      
      Patch 2 defers acquiring the zone->lock as lock as possible.
      
      Patch 3 reverts Rik's "skip-free" patches as the core concept gets
      	reimplemented later and the remaining patches are easier to
      	understand if this is reverted first.
      
      Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
      	pageblocks should be skipped by the migrate and free scanners.
      	This drastically reduces the amount of scanning compaction has
      	to do.
      
      Patch 5 reimplements something similar to Rik's idea except it uses the
      	pageblock-skip information to decide where the scanners should
      	restart from and does not need to wrap around.
      
      I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were
      
      akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
      lesslock	Patches 1-6
      revert		Patches 1-7
      cachefail	Patches 1-8
      skipuseless	Patches 1-9
      
      Stress high-order allocation tests looked ok.  Success rates are more or
      less the same with the full series applied but there is an expectation
      that there is less opportunity to race with other allocation requests if
      there is less scanning.  The time to complete the tests did not vary that
      much and are uninteresting as were the vmstat statistics so I will not
      present them here.
      
      Using ftrace I recorded how much scanning was done by compaction and got this
      
                                  3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                                  akpm-20120920 lockless  revert-v2r2  cachefail skipuseless
      
      Total   free    scanned         360753976  515414028  565479007   17103281   18916589
      Total   free    isolated          2852429    3597369    4048601     670493     727840
      Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385%
      Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903
      Total   migrate isolated          2555324    3245937    3437501     616359     658616
      Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466%
      
      The efficiency is worthless because of the nature of the test and the
      number of failures.  The really interesting point as far as this patch
      series is concerned is the number of pages scanned.  Note that reverting
      Rik's patches massively increases the number of pages scanned indicating
      that those patches really did make a difference to CPU usage.
      
      However, caching what pageblocks should be skipped has a much higher
      impact.  With patches 1-8 applied, free page and migrate page scanning are
      both reduced by 95% in comparison to the akpm kernel.  If the basic
      concept of Rik's patches are implemened on top then scanning then the free
      scanner barely changed but migrate scanning was further reduced.  That
      said, tests on 3.6-rc5 indicated that the last patch had greater impact
      than what was measured here so it is a bit variable.
      
      One way or the other, this series has a large impact on the amount of
      scanning compaction does when there is a storm of THP allocations.
      
      This patch:
      
      Compaction's migrate scanner acquires the zone->lru_lock when scanning a
      range of pages looking for LRU pages to acquire.  It does this even if
      there are no LRU pages in the range.  If multiple processes are compacting
      then this can cause severe locking contention.  To make matters worse
      commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration") releases the lru_lock every
      SWAP_CLUSTER_MAX pages that are scanned.
      
      This patch makes two changes to how the migrate scanner acquires the LRU
      lock.  First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
      if the lock is contended.  This reduces the number of times it
      unnecessarily disables and re-enables IRQs.  The second is that it defers
      acquiring the LRU lock for as long as possible.  If there are no LRU pages
      or the only LRU pages are transhuge then the LRU lock will not be acquired
      at all which reduces contention on zone->lru_lock.
      
      [minchan@kernel.org: augment comment]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a1402aa
    • Mel Gorman's avatar
      mm: compaction: Update try_to_compact_pages()kerneldoc comment · 661c4cb9
      Mel Gorman authored
      Parameters were added without documentation, tut tut.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      661c4cb9
    • Mel Gorman's avatar
      mm: compaction: move fatal signal check out of compact_checklock_irqsave · 3cc668f4
      Mel Gorman authored
      Commit c67fe375 ("mm: compaction: Abort async compaction if locks
      are contended or taking too long") addressed a lock contention problem
      in compaction by introducing compact_checklock_irqsave() that effecively
      aborting async compaction in the event of compaction.
      
      To preserve existing behaviour it also moved a fatal_signal_pending()
      check into compact_checklock_irqsave() but that is very misleading.  It
      "hides" the check within a locking function but has nothing to do with
      locking as such.  It just happens to work in a desirable fashion.
      
      This patch moves the fatal_signal_pending() check to
      isolate_migratepages_range() where it belongs.  Arguably the same check
      should also happen when isolating pages for freeing but it's overkill.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cc668f4
    • Shaohua Li's avatar
      mm: compaction: abort compaction loop if lock is contended or run too long · e64c5237
      Shaohua Li authored
      isolate_migratepages_range() might isolate no pages if for example when
      zone->lru_lock is contended and running asynchronous compaction. In this
      case, we should abort compaction, otherwise, compact_zone will run a
      useless loop and make zone->lru_lock is even contended.
      
      An additional check is added to ensure that cc.migratepages and
      cc.freepages get properly drained whan compaction is aborted.
      
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
      [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e64c5237
    • Wanpeng Li's avatar
      mm/memblock: cleanup early_node_map[] related comments · f2d52fe5
      Wanpeng Li authored
      Commit 0ee332c1 ("memblock: Kill early_node_map[]") removed
      early_node_map[].  Clean up the comments to comply with that change.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2d52fe5
    • Wanpeng Li's avatar
      mm/memblock: use existing interface to set nid · e9d24ad3
      Wanpeng Li authored
      Use the existing interface function to set the NUMA node ID (NID) for the
      regions, either memory or reserved region.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9d24ad3
    • Shaohua Li's avatar
      readahead: fault retry breaks mmap file read random detection · 45cac65b
      Shaohua Li authored
      .fault now can retry.  The retry can break state machine of .fault.  In
      filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
      try, since the page is in page cache now, ra->mmap_miss is decreased.  And
      these are done in one fault, so we can't detect random mmap file access.
      
      Add a new flag to indicate .fault is tried once.  In the second try, skip
      ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.
      
      I only tested x86, didn't test other archs, but looks the change for other
      archs is obvious, but who knows :)
      Signed-off-by: default avatarShaohua Li <shaohua.li@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45cac65b
    • Shaohua Li's avatar
      atomic: implement generic atomic_dec_if_positive() · e79bee24
      Shaohua Li authored
      The x86 implementation of atomic_dec_if_positive is quite generic, so make
      it available to all architectures.
      
      This is needed for "swap: add a simple detector for inappropriate swapin
      readahead".
      
      [akpm@linux-foundation.org: do the "#define foo foo" trick in the conventional manner]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e79bee24
    • Minchan Kim's avatar
      memory-hotplug: fix pages missed by race rather than failing · 435b405c
      Minchan Kim authored
      If race between allocation and isolation in memory-hotplug offline
      happens, some pages could be in MIGRATE_MOVABLE of free_list although the
      pageblock's migratetype of the page is MIGRATE_ISOLATE.
      
      The race could be detected by get_freepage_migratetype in
      __test_page_isolated_in_pageblock.  If it is detected, now EBUSY gets
      bubbled all the way up and the hotplug operations fails.
      
      But better idea is instead of returning and failing memory-hotremove, move
      the free page to the correct list at the time it is detected.  It could
      enhance memory-hotremove operation success ratio although the race is
      really rare.
      
      Suggested by Mel Gorman.
      
      [akpm@linux-foundation.org: small cleanup]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      435b405c
    • Minchan Kim's avatar
      memory-hotplug: bug fix race between isolation and allocation · 41d575ad
      Minchan Kim authored
      Like below, memory-hotplug makes race between page-isolation
      and page-allocation so it can hit BUG_ON in __offline_isolated_pages.
      
      	CPU A					CPU B
      
      start_isolate_page_range
      set_migratetype_isolate
      spin_lock_irqsave(zone->lock)
      
      				free_hot_cold_page(Page A)
      				/* without zone->lock */
      				migratetype = get_pageblock_migratetype(Page A);
      				/*
      				 * Page could be moved into MIGRATE_MOVABLE
      				 * of per_cpu_pages
      				 */
      				list_add_tail(&page->lru, &pcp->lists[migratetype]);
      
      set_pageblock_isolate
      move_freepages_block
      drain_all_pages
      
      				/* Page A could be in MIGRATE_MOVABLE of free_list. */
      
      check_pages_isolated
      __test_page_isolated_in_pageblock
      /*
       * We can't catch freed page which
       * is free_list[MIGRATE_MOVABLE]
       */
      if (PageBuddy(page A))
      	pfn += 1 << page_order(page A);
      
      				/* So, Page A could be allocated */
      
      __offline_isolated_pages
      /*
       * BUG_ON hit or offline page
       * which is used by someone
       */
      BUG_ON(!PageBuddy(page A));
      
      This patch checks page's migratetype in freelist in
      __test_page_isolated_in_pageblock.  So now
      __test_page_isolated_in_pageblock can check the page caused by above race
      and can fail of memory offlining.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41d575ad
    • Minchan Kim's avatar
      mm: remain migratetype in freed page · 95e34412
      Minchan Kim authored
      The page allocator caches the pageblock information in page->private while
      it is in the PCP freelists but this is overwritten with the order of the
      page when freed to the buddy allocator.  This patch stores the migratetype
      of the page in the page->index field so that it is available at all times
      when the page remain in free_list.
      
      This patch adds a new call site in __free_pages_ok so it might be overhead
      a bit but it's for high order allocation.  So I believe damage isn't hurt.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95e34412
    • Minchan Kim's avatar
      mm: page_alloc: use get_freepage_migratetype() instead of page_private() · b12c4ad1
      Minchan Kim authored
      The page allocator uses set_page_private and page_private for handling
      migratetype when it frees page.  Let's replace them with [set|get]
      _freepage_migratetype to make it more clear.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b12c4ad1
    • Bartlomiej Zolnierkiewicz's avatar
      cma: fix watermark checking · d95ea5d1
      Bartlomiej Zolnierkiewicz authored
      * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
        (from Minchan Kim).
      
      * During watermark check decrease available free pages number by
        free CMA pages number if necessary (unmovable allocations cannot
        use pages from CMA areas).
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d95ea5d1
    • Bartlomiej Zolnierkiewicz's avatar
      cma: count free CMA pages · d1ce749a
      Bartlomiej Zolnierkiewicz authored
      Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
      __zone_watermark_ok().  For simplicity and to avoid #ifdef hell make this
      counter always available (not only when CONFIG_CMA=y).
      
      [akpm@linux-foundation.org: use conventional migratetype naming]
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1ce749a
    • Bartlomiej Zolnierkiewicz's avatar
      cma: fix counting of isolated pages · 2139cbe6
      Bartlomiej Zolnierkiewicz authored
      Isolated free pages shouldn't be accounted to NR_FREE_PAGES counter.  Fix
      it by properly decreasing/increasing NR_FREE_PAGES counter in
      set_migratetype_isolate()/unset_migratetype_isolate() and removing counter
      adjustment for isolated pages from free_one_page() and split_free_page().
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2139cbe6
    • Bartlomiej Zolnierkiewicz's avatar
      mm: fix tracing in free_pcppages_bulk() · 770c8aaa
      Bartlomiej Zolnierkiewicz authored
      page->private gets re-used in __free_one_page() to store page order
      (so trace_mm_page_pcpu_drain() may print order instead of migratetype)
      thus migratetype value must be cached locally.
      
      Fixes regression introduced in commit a7016235 ("mm: fix migratetype
      bug which slowed swapping").  This caused incorrect data to be attached
      to the mm_page_pcpu_drain trace event.
      
      [akpm@linux-foundation.org: add comment]
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      770c8aaa
    • Minchan Kim's avatar
      mm: cma: discard clean pages during contiguous allocation instead of migration · 02c6de8d
      Minchan Kim authored
      Drop clean cache pages instead of migration during alloc_contig_range() to
      minimise allocation latency by reducing the amount of migration that is
      necessary.  It's useful for CMA because latency of migration is more
      important than evicting the background process's working set.  In
      addition, as pages are reclaimed then fewer free pages for migration
      targets are required so it avoids memory reclaiming to get free pages,
      which is a contributory factor to increased latency.
      
      I measured elapsed time of __alloc_contig_migrate_range() which migrates
      10M in 40M movable zone in QEMU machine.
      
      Before - 146ms, After - 7ms
      
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Rik van Riel <riel@redhat.com>
      Tested-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02c6de8d
    • Andrea Arcangeli's avatar
      mm: mmu_notifier: make the mmu_notifier srcu static · 70400303
      Andrea Arcangeli authored
      The variable must be static especially given the variable name.
      
      s/RCU/SRCU/ over a few comments.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70400303
    • Xishi Qiu's avatar
      memory-hotplug: build zonelists when offlining pages · 1e8537ba
      Xishi Qiu authored
      online_pages() does build_all_zonelists() and zone_pcp_update(), I think
      offline_pages() should do it too.
      
      When the zone has no memory to allocate, remove it from other nodes'
      zonelists.  zone_batchsize() depends on zone's present pages, if zone's
      present pages are changed, zone's pcp should be updated.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e8537ba
    • Michel Lespinasse's avatar
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse authored
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
    • Michel Lespinasse's avatar
      mm anon rmap: in mremap, set the new vma's position before anon_vma_clone() · 523d4e20
      Michel Lespinasse authored
      anon_vma_clone() expects new_vma->vm_{start,end,pgoff} to be correctly set
      so that the new vma can be indexed on the anon interval tree.
      
      copy_vma() was failing to do that, which broke mremap().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Tested-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      523d4e20
    • Michel Lespinasse's avatar
      mm: add CONFIG_DEBUG_VM_RB build option · ed8ea815
      Michel Lespinasse authored
      Add a CONFIG_DEBUG_VM_RB build option for the previously existing
      DEBUG_MM_RB code.  Now that Andi Kleen modified it to avoid using
      recursive algorithms, we can expose it a bit more.
      
      Also extend this code to validate_mm() after stack expansion, and to check
      that the vma's start and last pgoffs have not changed since the nodes were
      inserted on the anon vma interval tree (as it is important that the nodes
      be reindexed after each such update).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed8ea815
    • Michel Lespinasse's avatar
      mm rmap: remove vma_address check for address inside vma · 86c2ad19
      Michel Lespinasse authored
      In file and anon rmap, we use interval trees to find potentially relevant
      vmas and then call vma_address() to find the virtual address the given
      page might be found at in these vmas.  vma_address() used to include a
      check that the returned address falls within the limits of the vma, but
      this check isn't necessary now that we always use interval trees in rmap:
      the interval tree just doesn't return any vmas which this check would find
      to be irrelevant.  As a result, we can replace the use of -EFAULT error
      code (which then needed to be checked in every call site) with a
      VM_BUG_ON().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86c2ad19
    • Michel Lespinasse's avatar
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse authored
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • Michel Lespinasse's avatar
      mm anon rmap: remove anon_vma_moveto_tail · 108d6642
      Michel Lespinasse authored
      mremap() had a clever optimization where move_ptes() did not take the
      anon_vma lock to avoid a race with anon rmap users such as page migration.
       Instead, the avc's were ordered in such a way that the origin vma was
      always visited by rmap before the destination.  This ordering and the use
      of page table locks rmap usage safe.  However, we want to replace the use
      of linked lists in anon rmap with an interval tree, and this will make it
      harder to impose such ordering as the interval tree will always be sorted
      by the avc->vma->vm_pgoff value.  For now, let's replace the
      anon_vma_moveto_tail() ordering function with proper anon_vma locking in
      move_ptes().  Once we have the anon interval tree in place, we will
      re-introduce an optimization to avoid taking these locks in the most
      common cases.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108d6642
    • Michel Lespinasse's avatar
      mm: interval tree updates · 9826a516
      Michel Lespinasse authored
      Update the generic interval tree code that was introduced in "mm: replace
      vma prio_tree with an interval tree".
      
      Changes:
      
      - fixed 'endpoing' typo noticed by Andrew Morton
      
      - replaced include/linux/interval_tree_tmpl.h, which was used as a
        template (including it automatically defined the interval tree
        functions) with include/linux/interval_tree_generic.h, which only
        defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
        defines the interval tree functions when invoked. Now that is a very
        long macro which is unfortunate, but it does make the usage sites
        (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
      
      - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
        instead of duplicating that code in the interval tree template.
      
      - replaced vma_interval_tree_add(), which was actually handling the
        nonlinear and interval tree cases, with vma_interval_tree_insert_after()
        which handles only the interval tree case and has an API that is more
        consistent with the other interval tree handling functions.
        The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9826a516
    • Michel Lespinasse's avatar
      rbtree: move augmented rbtree functionality to rbtree_augmented.h · 9c079add
      Michel Lespinasse authored
      Provide rb_insert_augmented() and rb_erase_augmented() through a new
      rbtree_augmented.h include file.  rb_erase_augmented() is defined there as
      an __always_inline function, in order to allow inlining of augmented
      rbtree callbacks into it.  Since this generates a relatively large
      function, each augmented rbtree user should make sure to have a single
      call site.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c079add
    • Michel Lespinasse's avatar
      prio_tree: remove · 147e615f
      Michel Lespinasse authored
      After both prio_tree users have been converted to use red-black trees,
      there is no need to keep around the prio tree library anymore.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      147e615f
    • Michel Lespinasse's avatar
      kmemleak: use rbtree instead of prio tree · 85d3a316
      Michel Lespinasse authored
      kmemleak uses a tree where each node represents an allocated memory object
      in order to quickly find out what object a given address is part of.
      However, the objects don't overlap, so rbtrees are a better choice than
      prio tree for this use.  They are both faster and have lower memory
      overhead.
      
      Tested by booting a kernel with kmemleak enabled, loading the
      kmemleak_test module, and looking for the expected messages.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85d3a316
    • Michel Lespinasse's avatar
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse authored
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • Michel Lespinasse's avatar
      rbtree: add prio tree and interval tree tests · fff3fd8a
      Michel Lespinasse authored
      Patch 1 implements support for interval trees, on top of the augmented
      rbtree API. It also adds synthetic tests to compare the performance of
      interval trees vs prio trees. Short answers is that interval trees are
      slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
      on search. It is debatable how realistic the synthetic test is, and I have
      not made such measurements yet, but my impression is that interval trees
      would still come out faster.
      
      Patch 2 uses a preprocessor template to make the interval tree generic,
      and uses it as a replacement for the vma prio_tree.
      
      Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
      a basic rbtree. We don't actually need the augmented rbtree support here
      because the intervals are always non-overlapping.
      
      Patch 4 removes the now-unused prio tree library.
      
      Patch 5 proposes an additional optimization to rb_erase_augmented, now
      providing it as an inline function so that the augmented callbacks can be
      inlined in. This provides an additional 5-10% performance improvement
      for the interval tree insert/erase benchmark. There is a maintainance cost
      as it exposes augmented rbtree users to some of the rbtree library internals;
      however I think this cost shouldn't be too high as I expect the augmented
      rbtree will always have much less users than the base rbtree.
      
      I should probably add a quick summary of why I think it makes sense to
      replace prio trees with augmented rbtree based interval trees now.  One of
      the drivers is that we need augmented rbtrees for Rik's vma gap finding
      code, and once you have them, it just makes sense to use them for interval
      trees as well, as this is the simpler and more well known algorithm.  prio
      trees, in comparison, seem *too* clever: they impose an additional 'heap'
      constraint on the tree, which they use to guarantee a faster worst-case
      complexity of O(k+log N) for stabbing queries in a well-balanced prio
      tree, vs O(k*log N) for interval trees (where k=number of matches,
      N=number of intervals).  Now this sounds great, but in practice prio trees
      don't realize this theorical benefit.  First, the additional constraint
      makes them harder to update, so that the kernel implementation has to
      simplify things by balancing them like a radix tree, which is not always
      ideal.  Second, the fact that there are both index and heap properties
      makes both tree manipulation and search more complex, which results in a
      higher multiplicative time constant.  As it turns out, the simple interval
      tree algorithm ends up running faster than the more clever prio tree.
      
      This patch:
      
      Add two test modules:
      
      - prio_tree_test measures the performance of lib/prio_tree.c, both for
        insertion/removal and for stabbing searches
      
      - interval_tree_test measures the performance of a library of equivalent
        functionality, built using the augmented rbtree support.
      
      In order to support the second test module, lib/interval_tree.c is
      introduced. It is kept separate from the interval_tree_test main file
      for two reasons: first we don't want to provide an unfair advantage
      over prio_tree_test by having everything in a single compilation unit,
      and second there is the possibility that the interval tree functionality
      could get some non-test users in kernel over time.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fff3fd8a
    • Michel Lespinasse's avatar
      rbtree: add RB_DECLARE_CALLBACKS() macro · 3908836a
      Michel Lespinasse authored
      As proposed by Peter Zijlstra, this makes it easier to define the augmented
      rbtree callbacks.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3908836a