1. 26 Sep, 2014 40 commits
    • Shaohua Li's avatar
      x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB · c3206478
      Shaohua Li authored
      commit b13b1d2d upstream.
      
      We use the accessed bit to age a page at page reclaim time,
      and currently we also flush the TLB when doing so.
      
      But in some workloads TLB flush overhead is very heavy. In my
      simple multithreaded app with a lot of swap to several pcie
      SSDs, removing the tlb flush gives about 20% ~ 30% swapout
      speedup.
      
      Fortunately just removing the TLB flush is a valid optimization:
      on x86 CPUs, clearing the accessed bit without a TLB flush
      doesn't cause data corruption.
      
      It could cause incorrect page aging and the (mistaken) reclaim of
      hot pages, but the chance of that should be relatively low.
      
      So as a performance optimization don't flush the TLB when
      clearing the accessed bit, it will eventually be flushed by
      a context switch or a VM operation anyway. [ In the rare
      event of it not getting flushed for a long time the delay
      shouldn't really matter because there's no real memory
      pressure for swapout to react to. ]
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: linux-mm@kvack.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
      [ Rewrote the changelog and the code comments. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c3206478
    • Vlastimil Babka's avatar
      mm, compaction: properly signal and act upon lock and need_sched() contention · b25fd5de
      Vlastimil Babka authored
      commit be976572 upstream.
      
      Compaction uses compact_checklock_irqsave() function to periodically check
      for lock contention and need_resched() to either abort async compaction,
      or to free the lock, schedule and retake the lock.  When aborting,
      cc->contended is set to signal the contended state to the caller.  Two
      problems have been identified in this mechanism.
      
      First, compaction also calls directly cond_resched() in both scanners when
      no lock is yet taken.  This call either does not abort async compaction,
      or set cc->contended appropriately.  This patch introduces a new
      compact_should_abort() function to achieve both.  In isolate_freepages(),
      the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
      match what the migration scanner does in the preliminary page checks.  In
      case a pageblock is found suitable for calling isolate_freepages_block(),
      the checks within there are done on higher frequency.
      
      Second, isolate_freepages() does not check if isolate_freepages_block()
      aborted due to contention, and advances to the next pageblock.  This
      violates the principle of aborting on contention, and might result in
      pageblocks not being scanned completely, since the scanning cursor is
      advanced.  This problem has been noticed in the code by Joonsoo Kim when
      reviewing related patches.  This patch makes isolate_freepages_block()
      check the cc->contended flag and abort.
      
      In case isolate_freepages() has already isolated some pages before
      aborting due to contention, page migration will proceed, which is OK since
      we do not want to waste the work that has been done, and page migration
      has own checks for contention.  However, we do not want another isolation
      attempt by either of the scanners, so cc->contended flag check is added
      also to compaction_alloc() and compact_finished() to make sure compaction
      is aborted right after the migration.
      
      The outcome of the patch should be reduced lock contention by async
      compaction and lower latencies for higher-order allocations where direct
      compaction is involved.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Tested-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Tested-by: default avatarStephen Warren <swarren@nvidia.com>
      Tested-by: default avatarFabio Estevam <fabio.estevam@freescale.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b25fd5de
    • Vlastimil Babka's avatar
      mm/compaction: avoid rescanning pageblocks in isolate_freepages · 71a5b801
      Vlastimil Babka authored
      commit e9ade569 upstream.
      
      The compaction free scanner in isolate_freepages() currently remembers PFN
      of the highest pageblock where it successfully isolates, to be used as the
      starting pageblock for the next invocation.  The rationale behind this is
      that page migration might return free pages to the allocator when
      migration fails and we don't want to skip them if the compaction
      continues.
      
      Since migration now returns free pages back to compaction code where they
      can be reused, this is no longer a concern.  This patch changes
      isolate_freepages() so that the PFN for restarting is updated with each
      pageblock where isolation is attempted.  Using stress-highalloc from
      mmtests, this resulted in 10% reduction of the pages scanned by the free
      scanner.
      
      Note that the somewhat similar functionality that records highest
      successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
      This cache is used when the whole compaction is restarted, not for
      multiple invocations of the free scanner during single compaction.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      71a5b801
    • Vlastimil Babka's avatar
      mm/compaction: do not count migratepages when unnecessary · 5e4084a6
      Vlastimil Babka authored
      commit f8c9301f upstream.
      
      During compaction, update_nr_listpages() has been used to count remaining
      non-migrated and free pages after a call to migrage_pages().  The
      freepages counting has become unneccessary, and it turns out that
      migratepages counting is also unnecessary in most cases.
      
      The only situation when it's needed to count cc->migratepages is when
      migrate_pages() returns with a negative error code.  Otherwise, the
      non-negative return value is the number of pages that were not migrated,
      which is exactly the count of remaining pages in the cc->migratepages
      list.
      
      Furthermore, any non-zero count is only interesting for the tracepoint of
      mm_compaction_migratepages events, because after that all remaining
      unmigrated pages are put back and their count is set to 0.
      
      This patch therefore removes update_nr_listpages() completely, and changes
      the tracepoint definition so that the manual counting is done only when
      the tracepoint is enabled, and only when migrate_pages() returns a
      negative error code.
      
      Furthermore, migrate_pages() and the tracepoints won't be called when
      there's nothing to migrate.  This potentially avoids some wasted cycles
      and reduces the volume of uninteresting mm_compaction_migratepages events
      where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
      was about 75% of the events.  The mm_compaction_isolate_migratepages event
      is better for determining that nothing was isolated for migration, and
      this one was just duplicating the info.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5e4084a6
    • David Rientjes's avatar
      mm, compaction: terminate async compaction when rescheduling · 0d9b7924
      David Rientjes authored
      commit aeef4b83 upstream.
      
      Async compaction terminates prematurely when need_resched(), see
      compact_checklock_irqsave().  This can never trigger, however, if the
      cond_resched() in isolate_migratepages_range() always takes care of the
      scheduling.
      
      If the cond_resched() actually triggers, then terminate this pageblock
      scan for async compaction as well.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0d9b7924
    • David Rientjes's avatar
      mm, compaction: embed migration mode in compact_control · 812fcdf3
      David Rientjes authored
      commit e0b9daeb upstream.
      
      We're going to want to manipulate the migration mode for compaction in the
      page allocator, and currently compact_control's sync field is only a bool.
      
      Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
      depending on the value of this bool.  Convert the bool to enum
      migrate_mode and pass the migration mode in directly.  Later, we'll want
      to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
      avoid unnecessary latency.
      
      This also alters compaction triggered from sysfs, either for the entire
      system or for a node, to force MIGRATE_SYNC.
      
      [akpm@linux-foundation.org: fix build]
      [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      812fcdf3
    • David Rientjes's avatar
      mm, compaction: add per-zone migration pfn cache for async compaction · 7e95430e
      David Rientjes authored
      commit 35979ef3 upstream.
      
      Each zone has a cached migration scanner pfn for memory compaction so that
      subsequent calls to memory compaction can start where the previous call
      left off.
      
      Currently, the compaction migration scanner only updates the per-zone
      cached pfn when pageblocks were not skipped for async compaction.  This
      creates a dependency on calling sync compaction to avoid having subsequent
      calls to async compaction from scanning an enormous amount of non-MOVABLE
      pageblocks each time it is called.  On large machines, this could be
      potentially very expensive.
      
      This patch adds a per-zone cached migration scanner pfn only for async
      compaction.  It is updated everytime a pageblock has been scanned in its
      entirety and when no pages from it were successfully isolated.  The cached
      migration scanner pfn for sync compaction is updated only when called for
      sync compaction.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7e95430e
    • David Rientjes's avatar
      mm, compaction: return failed migration target pages back to freelist · b264e9ab
      David Rientjes authored
      commit d53aea3d upstream.
      
      Greg reported that he found isolated free pages were returned back to the
      VM rather than the compaction freelist.  This will cause holes behind the
      free scanner and cause it to reallocate additional memory if necessary
      later.
      
      He detected the problem at runtime seeing that ext4 metadata pages (esp
      the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
      constantly visited by compaction calls of migrate_pages().  These pages
      had a non-zero b_count which caused fallback_migrate_page() ->
      try_to_release_page() -> try_to_free_buffers() to fail.
      
      Memory compaction works by having a "freeing scanner" scan from one end of
      a zone which isolates pages as migration targets while another "migrating
      scanner" scans from the other end of the same zone which isolates pages
      for migration.
      
      When page migration fails for an isolated page, the target page is
      returned to the system rather than the freelist built by the freeing
      scanner.  This may require the freeing scanner to continue scanning memory
      after suitable migration targets have already been returned to the system
      needlessly.
      
      This patch returns destination pages to the freeing scanner freelist when
      page migration fails.  This prevents unnecessary work done by the freeing
      scanner but also encourages memory to be as compacted as possible at the
      end of the zone.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b264e9ab
    • David Rientjes's avatar
      mm, migration: add destination page freeing callback · ee92d4d6
      David Rientjes authored
      commit 68711a74 upstream.
      
      Memory migration uses a callback defined by the caller to determine how to
      allocate destination pages.  When migration fails for a source page,
      however, it frees the destination page back to the system.
      
      This patch adds a memory migration callback defined by the caller to
      determine how to free destination pages.  If a caller, such as memory
      compaction, builds its own freelist for migration targets, this can reuse
      already freed memory instead of scanning additional memory.
      
      If the caller provides a function to handle freeing of destination pages,
      it is called when page migration fails.  If the caller passes NULL then
      freeing back to the system will be handled as usual.  This patch
      introduces no functional change.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ee92d4d6
    • Vlastimil Babka's avatar
      mm/compaction: cleanup isolate_freepages() · e644c10b
      Vlastimil Babka authored
      commit c96b9e50 upstream.
      
      isolate_freepages() is currently somewhat hard to follow thanks to many
      looks like it is related to the 'low_pfn' variable, but in fact it is not.
      
      This patch renames the 'high_pfn' variable to a hopefully less confusing name,
      and slightly changes its handling without a functional change. A comment made
      obsolete by recent changes is also updated.
      
      [akpm@linux-foundation.org: comment fixes, per Minchan]
      [iamjoonsoo.kim@lge.com: cleanups]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e644c10b
    • Heesub Shin's avatar
      mm/compaction: clean up unused code lines · 87db4a8a
      Heesub Shin authored
      commit 13fb44e4 upstream.
      
      Remove code lines currently not in use or never called.
      Signed-off-by: default avatarHeesub Shin <heesub.shin@samsung.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      87db4a8a
    • Fabian Frederick's avatar
      mm/readahead.c: inline ra_submit · 32e8fcae
      Fabian Frederick authored
      commit 29f175d1 upstream.
      
      Commit f9acc8c7 ("readahead: sanify file_ra_state names") left
      ra_submit with a single function call.
      
      Move ra_submit to internal.h and inline it to save some stack.  Thanks
      to Andrew Morton for commenting different versions.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      32e8fcae
    • Al Viro's avatar
      callers of iov_copy_from_user_atomic() don't need pagecache_disable() · 72ef5b50
      Al Viro authored
      commit 9e8c2af9 upstream.
      
      ... it does that itself (via kmap_atomic())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      72ef5b50
    • Sasha Levin's avatar
      mm: remove read_cache_page_async() · 4fb08e5a
      Sasha Levin authored
      commit 67f9fd91 upstream.
      
      This patch removes read_cache_page_async() which wasn't really needed
      anywhere and simplifies the code around it a bit.
      
      read_cache_page_async() is useful when we want to read a page into the
      cache without waiting for it to complete.  This happens when the
      appropriate callback 'filler' doesn't complete its read operation and
      releases the page lock immediately, and instead queues a different
      completion routine to do that.  This never actually happened anywhere in
      the code.
      
      read_cache_page_async() had 3 different callers:
      
      - read_cache_page() which is the sync version, it would just wait for
        the requested read to complete using wait_on_page_read().
      
      - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
        function it supplied doesn't do any async reads, and would complete
        before the filler function returns - making it actually a sync read.
      
      - CRAMFS would call it using the read_mapping_page_async() wrapper, with
        a similar story to JFFS2 - the filler function doesn't do anything that
        reminds async reads and would always complete before the filler function
        returns.
      
      To sum it up, the code in mm/filemap.c never took advantage of having
      read_cache_page_async().  While there are filler callbacks that do async
      reads (such as the block one), we always called it with the
      read_cache_page().
      
      This patch adds a mandatory wait for read to complete when adding a new
      page to the cache, and removes read_cache_page_async() and its wrappers.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4fb08e5a
    • Johannes Weiner's avatar
      mm: madvise: fix MADV_WILLNEED on shmem swapouts · 5c3ce5b2
      Johannes Weiner authored
      commit 55231e5c upstream.
      
      MADV_WILLNEED currently does not read swapped out shmem pages back in.
      
      Commit 0cd6144a ("mm + fs: prepare for non-page entries in page
      cache radix trees") made find_get_page() filter exceptional radix tree
      entries but failed to convert all find_get_page() callers that WANT
      exceptional entries over to find_get_entry().  One of them is shmem swap
      readahead in madvise, which now skips over any swap-out records.
      
      Convert it to find_get_entry().
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5c3ce5b2
    • Johannes Weiner's avatar
      mm + fs: prepare for non-page entries in page cache radix trees · e714f0cf
      Johannes Weiner authored
      commit 0cd6144a upstream.
      
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e714f0cf
    • Johannes Weiner's avatar
      mm: filemap: move radix tree hole searching here · 3721b421
      Johannes Weiner authored
      commit e7b563bb upstream.
      
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3721b421
    • Johannes Weiner's avatar
      mm: shmem: save one radix tree lookup when truncating swapped pages · a3d18e49
      Johannes Weiner authored
      commit 6dbaf22c upstream.
      
      Page cache radix tree slots are usually stabilized by the page lock, but
      shmem's swap cookies have no such thing.  Because the overall truncation
      loop is lockless, the swap entry is currently confirmed by a tree lookup
      and then deleted by another tree lookup under the same tree lock region.
      
      Use radix_tree_delete_item() instead, which does the verification and
      deletion with only one lookup.  This also allows removing the
      delete-only special case from shmem_radix_tree_replace().
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a3d18e49
    • Johannes Weiner's avatar
      lib: radix-tree: add radix_tree_delete_item() · 50c4613d
      Johannes Weiner authored
      commit 53c59f26 upstream.
      
      Provide a function that does not just delete an entry at a given index,
      but also allows passing in an expected item.  Delete only if that item
      is still located at the specified index.
      
      This is handy when lockless tree traversals want to delete entries as
      well because they don't have to do an second, locked lookup to verify
      the slot has not changed under them before deleting the entry.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      50c4613d
    • Linus Torvalds's avatar
      mm: don't pointlessly use BUG_ON() for sanity check · c05ac84a
      Linus Torvalds authored
      commit 50f5aa8a upstream.
      
      BUG_ON() is a big hammer, and should be used _only_ if there is some
      major corruption that you cannot possibly recover from, making it
      imperative that the current process (and possibly the whole machine) be
      terminated with extreme prejudice.
      
      The trivial sanity check in the vmacache code is *not* such a fatal
      error.  Recovering from it is absolutely trivial, and using BUG_ON()
      just makes it harder to debug for no actual advantage.
      
      To make matters worse, the placement of the BUG_ON() (only if the range
      check matched) actually makes it harder to hit the sanity check to begin
      with, so _if_ there is a bug (and we just got a report from Srivatsa
      Bhat that this can indeed trigger), it is harder to debug not just
      because the machine is possibly dead, but because we don't have better
      coverage.
      
      BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
      because it is simply just about the worst thing you can ever do if you
      hit some "this cannot happen" situation.
      Reported-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c05ac84a
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 9c007307
      Davidlohr Bueso authored
      commit 615d6e87 upstream.
      
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9c007307
    • Christoph Lameter's avatar
      vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() · edce92fc
      Christoph Lameter authored
      commit 83da7510 upstream.
      
      Seems to be called with preemption enabled.  Therefore it must use
      mod_zone_page_state instead.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Reported-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Tested-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      edce92fc
    • Vladimir Davydov's avatar
      mm: vmscan: shrink_slab: rename max_pass -> freeable · 4b946426
      Vladimir Davydov authored
      commit d5bc5fd3 upstream.
      
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4b946426
    • Vladimir Davydov's avatar
      mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim · b3b0bd39
      Vladimir Davydov authored
      commit 99120b77 upstream.
      
      When direct reclaim is executed by a process bound to a set of NUMA
      nodes, we should scan only those nodes when possible, but currently we
      will scan kmem from all online nodes even if the kmem shrinker is NUMA
      aware.  That said, binding a process to a particular NUMA node won't
      prevent it from shrinking inode/dentry caches from other nodes, which is
      not good.  Fix this.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b3b0bd39
    • Jens Axboe's avatar
      mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT · 5a6ad555
      Jens Axboe authored
      commit 7fcbbaf1 upstream.
      
      In some testing I ran today (some fio jobs that spread over two nodes),
      we end up spending 40% of the time in filemap_check_errors().  That
      smells fishy.  Looking further, this is basically what happens:
      
      blkdev_aio_read()
          generic_file_aio_read()
              filemap_write_and_wait_range()
                  if (!mapping->nr_pages)
                      filemap_check_errors()
      
      and filemap_check_errors() always attempts two test_and_clear_bit() on
      the mapping flags, thus dirtying it for every single invocation.  The
      patch below tests each of these bits before clearing them, avoiding this
      issue.  In my test case (4-socket box), performance went from 1.7M IOPS
      to 4.0M IOPS.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5a6ad555
    • Mel Gorman's avatar
      mm: optimize put_mems_allowed() usage · 337c9823
      Mel Gorman authored
      commit d26914d1 upstream.
      
      Since put_mems_allowed() is strictly optional, its a seqcount retry, we
      don't need to evaluate the function if the allocation was in fact
      successful, saving a smp_rmb some loads and comparisons on some relative
      fast-paths.
      
      Since the naming, get/put_mems_allowed() does suggest a mandatory
      pairing, rename the interface, as suggested by Mel, to resemble the
      seqcount interface.
      
      This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
      where it is important to note that the return value of the latter call
      is inverted from its previous incarnation.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      337c9823
    • Raghavendra K T's avatar
      mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages · 8010da49
      Raghavendra K T authored
      commit 6d2be915 upstream.
      
      Currently max_sane_readahead() returns zero on the cpu whose NUMA node
      has no local memory which leads to readahead failure.  Fix this
      readahead failure by returning minimum of (requested pages, 512).  Users
      running applications on a memory-less cpu which needs readahead such as
      streaming application see considerable boost in the performance.
      
      Result:
      
      fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
      CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.
      
      fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
      testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
      the normal NUMA cases w/ patch.
      
        Kernel       Avg  Stddev
        base      7.4975   3.92%
        patched   7.4174   3.26%
      
      [Andrew: making return value PAGE_SIZE independent]
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8010da49
    • David Rientjes's avatar
      mm, compaction: ignore pageblock skip when manually invoking compaction · c4199ba1
      David Rientjes authored
      commit 91ca9186 upstream.
      
      The cached pageblock hint should be ignored when triggering compaction
      through /proc/sys/vm/compact_memory so all eligible memory is isolated.
      Manually invoking compaction is known to be expensive, there's no need
      to skip pageblocks based on heuristics (mainly for debugging).
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c4199ba1
    • David Rientjes's avatar
      mm, compaction: determine isolation mode only once · d36e7004
      David Rientjes authored
      commit da1c67a7 upstream.
      
      The conditions that control the isolation mode in
      isolate_migratepages_range() do not change during the iteration, so
      extract them out and only define the value once.
      
      This actually does have an effect, gcc doesn't optimize it itself because
      of cc->sync.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d36e7004
    • Joonsoo Kim's avatar
      mm/compaction: clean-up code on success of ballon isolation · 0a1802ea
      Joonsoo Kim authored
      commit b6c75016 upstream.
      
      It is just for clean-up to reduce code size and improve readability.
      There is no functional change.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0a1802ea
    • Joonsoo Kim's avatar
      mm/compaction: check pageblock suitability once per pageblock · fc8dc0a9
      Joonsoo Kim authored
      commit c122b208 upstream.
      
      isolation_suitable() and migrate_async_suitable() is used to be sure
      that this pageblock range is fine to be migragted.  It isn't needed to
      call it on every page.  Current code do well if not suitable, but, don't
      do well when suitable.
      
      1) It re-checks isolation_suitable() on each page of a pageblock that was
         already estabilished as suitable.
      2) It re-checks migrate_async_suitable() on each page of a pageblock that
         was not entered through the next_pageblock: label, because
         last_pageblock_nr is not otherwise updated.
      
      This patch fixes situation by 1) calling isolation_suitable() only once
      per pageblock and 2) always updating last_pageblock_nr to the pageblock
      that was just checked.
      
      Additionally, move PageBuddy() check after pageblock unit check, since
      pageblock check is the first thing we should do and makes things more
      simple.
      
      [vbabka@suse.cz: rephrase commit description]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc8dc0a9
    • Joonsoo Kim's avatar
      mm/compaction: change the timing to check to drop the spinlock · e45dcd3d
      Joonsoo Kim authored
      commit be1aa03b upstream.
      
      It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
      pfn page.  This may results in below situation while isolating
      migratepage.
      
      1. try isolate 0x0 ~ 0x200 pfn pages.
      2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
         the spinlock.
      3. Then, to complete isolating, retry to aquire the lock.
      
      I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
      criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
      this time, locked variable would be false.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e45dcd3d
    • Joonsoo Kim's avatar
      mm/compaction: do not call suitable_migration_target() on every page · 093b8ab7
      Joonsoo Kim authored
      commit 01ead534 upstream.
      
      suitable_migration_target() checks that pageblock is suitable for
      migration target.  In isolate_freepages_block(), it is called on every
      page and this is inefficient.  So make it called once per pageblock.
      
      suitable_migration_target() also checks if page is highorder or not, but
      it's criteria for highorder is pageblock order.  So calling it once
      within pageblock range has no problem.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      093b8ab7
    • Joonsoo Kim's avatar
      mm/compaction: disallow high-order page for migration target · d9c7a696
      Joonsoo Kim authored
      commit 7d348b9e upstream.
      
      Purpose of compaction is to get a high order page.  Currently, if we
      find high-order page while searching migration target page, we break it
      to order-0 pages and use them as migration target.  It is contrary to
      purpose of compaction, so disallow high-order page to be used for
      migration target.
      
      Additionally, clean-up logic in suitable_migration_target() to simplify
      the code.  There is no functional changes from this clean-up.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d9c7a696
    • David Rientjes's avatar
      mm, compaction: avoid isolating pinned pages · 5939eba4
      David Rientjes authored
      commit 119d6d59 upstream.
      
      Page migration will fail for memory that is pinned in memory with, for
      example, get_user_pages().  In this case, it is unnecessary to take
      zone->lru_lock or isolating the page and passing it to page migration
      which will ultimately fail.
      
      This is a racy check, the page can still change from under us, but in
      that case we'll just fail later when attempting to move the page.
      
      This avoids very expensive memory compaction when faulting transparent
      hugepages after pinning a lot of memory with a Mellanox driver.
      
      On a 128GB machine and pinning ~120GB of memory, before this patch we
      see the enormous disparity in the number of page migration failures
      because of the pinning (from /proc/vmstat):
      
      	compact_pages_moved 8450
      	compact_pagemigrate_failed 15614415
      
      0.05% of pages isolated are successfully migrated and explicitly
      triggering memory compaction takes 102 seconds.  After the patch:
      
      	compact_pages_moved 9197
      	compact_pagemigrate_failed 7
      
      99.9% of pages isolated are now successfully migrated in this
      configuration and memory compaction takes less than one second.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5939eba4
    • Yasuaki Ishimatsu's avatar
      mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve · c824c468
      Yasuaki Ishimatsu authored
      commit 943dca1a upstream.
      
      Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
      9TB memory machine since onlining memory sections is too slow.  And we
      found out setup_zone_migrate_reserve spent >90% of the time.
      
      The problem is, setup_zone_migrate_reserve scans all pageblocks
      unconditionally, but it is only necessary if the number of reserved
      block was reduced (i.e.  memory hot remove).
      
      Moreover, maximum MIGRATE_RESERVE per zone is currently 2.  It means
      that the number of reserved pageblocks is almost always unchanged.
      
      This patch adds zone->nr_migrate_reserve_block to maintain the number of
      MIGRATE_RESERVE pageblocks and it reduces the overhead of
      setup_zone_migrate_reserve dramatically.  The following table shows time
      of onlining a memory section.
      
        Amount of memory     | 128GB | 192GB | 256GB|
        ---------------------------------------------
        linux-3.12           |  23.9 |  31.4 | 44.5 |
        This patch           |   8.3 |   8.3 |  8.6 |
        Mel's proposal patch |  10.9 |  19.2 | 31.3 |
        ---------------------------------------------
                                         (millisecond)
      
        128GB : 4 nodes and each node has 32GB of memory
        192GB : 6 nodes and each node has 32GB of memory
        256GB : 8 nodes and each node has 32GB of memory
      
        (*1) Mel proposed his idea by the following threads.
             https://lkml.org/lkml/2013/10/30/272
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c824c468
    • Vladimir Davydov's avatar
      mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask · 76026540
      Vladimir Davydov authored
      commit ec97097b upstream.
      
      If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
      once with nid=0, but currently it is not true: if node 0 is not set in
      the nodemask or if it is not online, we will not call such shrinkers at
      all.  As a result some slabs will be left untouched under some
      circumstances.  Let us fix it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reported-by: default avatarDave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      76026540
    • Vladimir Davydov's avatar
      mm: vmscan: shrink all slab objects if tight on memory · 671133cd
      Vladimir Davydov authored
      commit 0b1fb40a upstream.
      
      When reclaiming kmem, we currently don't scan slabs that have less than
      batch_size objects (see shrink_slab_node()):
      
              while (total_scan >= batch_size) {
                      shrinkctl->nr_to_scan = batch_size;
                      shrinker->scan_objects(shrinker, shrinkctl);
                      total_scan -= batch_size;
              }
      
      If there are only a few shrinkers available, such a behavior won't cause
      any problems, because the batch_size is usually small, but if we have a
      lot of slab shrinkers, which is perfectly possible since FS shrinkers
      are now per-superblock, we can end up with hundreds of megabytes of
      practically unreclaimable kmem objects.  For instance, mounting a
      thousand of ext2 FS images with a hundred of files in each and iterating
      over all the files using du(1) will result in about 200 Mb of FS caches
      that cannot be dropped even with the aid of the vm.drop_caches sysctl!
      
      This problem was initially pointed out by Glauber Costa [*].  Glauber
      proposed to fix it by making the shrink_slab() always take at least one
      pass, to put it simply, turning the scan loop above to a do{}while()
      loop.  However, this proposal was rejected, because it could result in
      more aggressive and frequent slab shrinking even under low memory
      pressure when total_scan is naturally very small.
      
      This patch is a slightly modified version of Glauber's approach.
      Similarly to Glauber's patch, it makes shrink_slab() scan less than
      batch_size objects, but only if the total number of objects we want to
      scan (total_scan) is greater than the total number of objects available
      (max_pass).  Since total_scan is biased as half max_pass if the current
      delta change is small:
      
              if (delta < max_pass / 4)
                      total_scan = min(total_scan, max_pass / 2);
      
      this is only possible if we are scanning at high prio.  That said, this
      patch shouldn't change the vmscan behaviour if the memory pressure is
      low, but if we are tight on memory, we will do our best by trying to
      reclaim all available objects, which sounds reasonable.
      
      [*] http://www.spinics.net/lists/cgroups/msg06913.htmlSigned-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      671133cd
    • Shaohua Li's avatar
      swap: add a simple detector for inappropriate swapin readahead · df94648c
      Shaohua Li authored
      commit 579f8290 upstream.
      
      This is a patch to improve swap readahead algorithm.  It's from Hugh and
      I slightly changed it.
      
      Hugh's original changelog:
      
      swapin readahead does a blind readahead, whether or not the swapin is
      sequential.  This may be ok on harddisk, because large reads have
      relatively small costs, and if the readahead pages are unneeded they can
      be reclaimed easily - though, what if their allocation forced reclaim of
      useful pages? But on SSD devices large reads are more expensive than
      small ones: if the readahead pages are unneeded, reading them in caused
      significant overhead.
      
      This patch adds very simplistic random read detection.  Stealing the
      PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
      vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
      simply looks at readahead's current success rate, and narrows or widens
      its readahead window accordingly.  There is little science to its
      heuristic: it's about as stupid as can be whilst remaining effective.
      
      The table below shows elapsed times (in centiseconds) when running a
      single repetitive swapping load across a 1000MB mapping in 900MB ram
      with 1GB swap (the harddisk tests had taken painfully too long when I
      used mem=500M, but SSD shows similar results for that).
      
      Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
      Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
      which Shaohua showed to be defective; HughNew this Nov 14 patch, with
      page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
      patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
      (1-page reads: no readahead).
      
      HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
      sequential access to the mapping, cycling five times around; Rand for
      the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
      Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
      
      One weakness of Shaohua's vma/anon_vma approach was that it did not
      optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
      50% slower on Seq: did not compete and is not shown below.
      
      HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     73921   76210   75611   76904   78191  121542
      Seq Shmem    73601   73176   73855   72947   74543  118322
      Rand Anon   895392  831243  871569  845197  846496  841680
      Rand Shmem 1058375 1053486  827935  764955  764376  756489
      
      SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     24634   24198   24673   25107   21614   70018
      Seq Shmem    24959   24932   25052   25703   22030   69678
      Rand Anon    43014   26146   28075   25989   26935   25901
      Rand Shmem   45349   45215   28249   24268   24138   24332
      
      These tests are, of course, two extremes of a very simple case: under
      heavier mixed loads I've not yet observed any consistent improvement or
      degradation, and wider testing would be welcome.
      
      Shaohua Li:
      
      Test shows Vanilla is slightly better in sequential workload than Hugh's
      patch.  I observed with Hugh's patch sometimes the readahead size is
      shrinked too fast (from 8 to 1 immediately) in sequential workload if
      there is no hit.  And in such case, continuing doing readahead is good
      actually.
      
      I don't prepare a sophisticated algorithm for the sequential workload
      because so far we can't guarantee sequential accessed pages are swap out
      sequentially.  So I slightly change Hugh's heuristic - don't shrink
      readahead size too fast.
      
      Here is my test result (unit second, 3 runs average):
      	Vanilla		Hugh		New
      Seq	356		370		360
      Random	4525		2447		2444
      
      Attached graph is the swapin/swapout throughput I collected with 'vmstat
      2'.  The first part is running a random workload (till around 1200 of
      the x-axis) and the second part is running a sequential workload.
      swapin and swapout throughput are almost identical in steady state in
      both workloads.  These are expected behavior.  while in Vanilla, swapin
      is much bigger than swapout especially in random workload (because wrong
      readahead).
      
      Original patches by: Shaohua Li and Konstantin Khlebnikov.
      
      [fengguang.wu@intel.com: swapin_nr_pages() can be static]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      df94648c
    • Vlastimil Babka's avatar
      mm: compaction: reset scanner positions immediately when they meet · 3d3de516
      Vlastimil Babka authored
      commit 55b7c4c9 upstream.
      
      Compaction used to start its migrate and free page scaners at the zone's
      lowest and highest pfn, respectively.  Later, caching was introduced to
      remember the scanners' progress across compaction attempts so that
      pageblocks are not re-scanned uselessly.  Additionally, pageblocks where
      isolation failed are marked to be quickly skipped when encountered again
      in future compactions.
      
      Currently, both the reset of cached pfn's and clearing of the pageblock
      skip information for a zone is done in __reset_isolation_suitable().
      This function gets called when:
      
       - compaction is restarting after being deferred
       - compact_blockskip_flush flag is set in compact_finished() when the scanners
         meet (and not again cleared when direct compaction succeeds in allocation)
         and kswapd acts upon this flag before going to sleep
      
      This behavior is suboptimal for several reasons:
      
       - when direct sync compaction is called after async compaction fails (in the
         allocation slowpath), it will effectively do nothing, unless kswapd
         happens to process the compact_blockskip_flush flag meanwhile. This is racy
         and goes against the purpose of sync compaction to more thoroughly retry
         the compaction of a zone where async compaction has failed.
         The restart-after-deferring path cannot help here as deferring happens only
         after the sync compaction fails. It is also done only for the preferred
         zone, while the compaction might be done for a fallback zone.
      
       - the mechanism of marking pageblock to be skipped has little value since the
         cached pfn's are reset only together with the pageblock skip flags. This
         effectively limits pageblock skip usage to parallel compactions.
      
      This patch changes compact_finished() so that cached pfn's are reset
      immediately when the scanners meet.  Clearing pageblock skip flags is
      unchanged, as well as the other situations where cached pfn's are reset.
      This allows the sync-after-async compaction to retry pageblocks not
      marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
      compactions now skips without marking them.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3d3de516