1. 26 Sep, 2014 40 commits
    • Mel Gorman's avatar
      mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free · f161eedc
      Mel Gorman authored
      commit cfc47a28 upstream.
      
      get_pageblock_migratetype() is called during free with IRQs disabled.
      This is unnecessary and disables IRQs for longer than necessary.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f161eedc
    • Mel Gorman's avatar
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · 3e7379c0
      Mel Gorman authored
      commit b745bc85 upstream.
      
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3e7379c0
    • Mel Gorman's avatar
      mm: page_alloc: reduce number of times page_to_pfn is called · c01947d6
      Mel Gorman authored
      commit dc4b0caf upstream.
      
      In the free path we calculate page_to_pfn multiple times. Reduce that.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c01947d6
    • Mel Gorman's avatar
      mm: page_alloc: use unsigned int for order in more places · da530fd8
      Mel Gorman authored
      commit 7aeb09f9 upstream.
      
      X86 prefers the use of unsigned types for iterators and there is a
      tendency to mix whether a signed or unsigned type if used for page order.
      This converts a number of sites in mm/page_alloc.c to use unsigned int for
      order where possible.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      da530fd8
    • Mel Gorman's avatar
      mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path · 35515de9
      Mel Gorman authored
      commit 5dab2911 upstream.
      
      ALLOC_NO_WATERMARK is set in a few cases.  Always by kswapd, always for
      __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc.  Each of these
      cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
      unlikely branch in the fast path.  This patch moves the check out of the
      fast path and after it has been determined that the watermarks have not
      been met.  This helps the common fast path at the cost of making the slow
      path slower and hitting kswapd with a performance cost.  It's a reasonable
      tradeoff.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      35515de9
    • Mel Gorman's avatar
      mm: page_alloc: only check the alloc flags and gfp_mask for dirty once · d6217476
      Mel Gorman authored
      commit a6e21b14 upstream.
      
      Currently it's calculated once per zone in the zonelist.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d6217476
    • Mel Gorman's avatar
      mm: page_alloc: only check the zone id check if pages are buddies · 6a85a5ad
      Mel Gorman authored
      commit d34c5fa0 upstream.
      
      A node/zone index is used to check if pages are compatible for merging
      but this happens unconditionally even if the buddy page is not free. Defer
      the calculation as long as possible. Ideally we would check the zone boundary
      but nodes can overlap.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6a85a5ad
    • Mel Gorman's avatar
      mm: page_alloc: calculate classzone_idx once from the zonelist ref · dc2786f0
      Mel Gorman authored
      commit d8846374 upstream.
      
      There is no need to calculate zone_idx(preferred_zone) multiple times
      or use the pgdat to figure it out.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dc2786f0
    • Mel Gorman's avatar
      mm: page_alloc: use jump labels to avoid checking number_of_cpusets · ee1760b2
      Mel Gorman authored
      commit 664eedde upstream.
      
      If cpusets are not in use then we still check a global variable on every
      page allocation.  Use jump labels to avoid the overhead.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ee1760b2
    • Mel Gorman's avatar
      include/linux/jump_label.h: expose the reference count · f99bfd27
      Mel Gorman authored
      commit ea5e9539 upstream.
      
      This patch exposes the jump_label reference count in preparation for the
      next patch.  cpusets cares about both the jump_label being enabled and how
      many users of the cpusets there currently are.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f99bfd27
    • Mel Gorman's avatar
      mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full" · 605cf7a1
      Mel Gorman authored
      commit 800a1e75 upstream.
      
      If a zone cannot be used for a dirty page then it gets marked "full" which
      is cached in the zlc and later potentially skipped by allocation requests
      that have nothing to do with dirty zones.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      605cf7a1
    • Mel Gorman's avatar
      mm: page_alloc: do not update zlc unless the zlc is active · 55dcadc2
      Mel Gorman authored
      commit 65bb3719 upstream.
      
      The zlc is used on NUMA machines to quickly skip over zones that are full.
       However it is always updated, even for the first zone scanned when the
      zlc might not even be active.  As it's a write to a bitmap that
      potentially bounces cache line it's deceptively expensive and most
      machines will not care.  Only update the zlc if it was active.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      55dcadc2
    • Jianyu Zhan's avatar
      mm/swap.c: clean up *lru_cache_add* functions · 69aa12f2
      Jianyu Zhan authored
      commit 2329d375 upstream.
      
      In mm/swap.c, __lru_cache_add() is exported, but actually there are no
      users outside this file.
      
      This patch unexports __lru_cache_add(), and makes it static.  It also
      exports lru_cache_add_file(), as it is use by cifs and fuse, which can
      loaded as modules.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      69aa12f2
    • Vlastimil Babka's avatar
      mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced · 4cd64dce
      Vlastimil Babka authored
      commit 5bcc9f86 upstream.
      
      For the MIGRATE_RESERVE pages, it is useful when they do not get
      misplaced on free_list of other migratetype, otherwise they might get
      allocated prematurely and e.g.  fragment the MIGRATE_RESEVE pageblocks.
      While this cannot be avoided completely when allocating new
      MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
      prevent the misplacement where possible.
      
      Currently, it is possible for the misplacement to happen when a
      MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
      fallback for other desired migratetype, and then later freed back
      through free_pcppages_bulk() without being actually used.  This happens
      because free_pcppages_bulk() uses get_freepage_migratetype() to choose
      the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
      the *desired* migratetype and not the page's original MIGRATE_RESERVE
      migratetype.
      
      This patch fixes the problem by moving the call to
      set_freepage_migratetype() from rmqueue_bulk() down to
      __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
      migratetype (e.g.  from which free_list the page is taken from) is used.
      Note that this migratetype might be different from the pageblock's
      migratetype due to freepage stealing decisions.  This is OK, as page
      stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
      to leave all MIGRATE_CMA pages on the correct freelist.
      
      Therefore, as an additional benefit, the call to
      get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
      be removed completely.  This relies on the fact that MIGRATE_CMA
      pageblocks are created only during system init, and the above.  The
      related is_migrate_isolate() check is also unnecessary, as memory
      isolation has other ways to move pages between freelists, and drain pcp
      lists containing pages that should be isolated.  The buffered_rmqueue()
      can also benefit from calling get_freepage_migratetype() instead of
      get_pageblock_migratetype().
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarYong-Taek Lee <ytk.lee@samsung.com>
      Reported-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4cd64dce
    • Mel Gorman's avatar
      mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY · 2ce666c1
      Mel Gorman authored
      commit 1a501907 upstream.
      
      Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
      ensured that file/anon lists were scanned proportionally for reclaim from
      kswapd but ignored it for direct reclaim.  The intent was to minimse
      direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
      long stall for many small stalls and distorts aging for normal workloads
      like streaming readers/writers.  Hugh Dickins pointed out that a
      side-effect of the same commit was that when one LRU list dropped to zero
      that the entirety of the other list was shrunk leading to excessive
      reclaim in memcgs.  This patch scans the file/anon lists proportionally
      for direct reclaim to similarly age page whether reclaimed by kswapd or
      direct reclaim but takes care to abort reclaim if one LRU drops to zero
      after reclaiming the requested number of pages.
      
      Based on ext4 and using the Intel VM scalability test
      
                                                    3.15.0-rc5            3.15.0-rc5
                                                      shrinker            proportion
      Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
      Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
      Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
      Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
      Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
      Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
      
      The test cases are running multiple dd instances reading sparse files. The results are within
      the noise for the small test machine. The impact of the patch is more noticable from the vmstats
      
                                  3.15.0-rc5  3.15.0-rc5
                                    shrinker  proportion
      Minor Faults                     35154       36784
      Major Faults                       611        1305
      Swap Ins                           394        1651
      Swap Outs                         4394        5891
      Allocation stalls               118616       44781
      Direct pages scanned           4935171     4602313
      Kswapd pages scanned          15921292    16258483
      Kswapd pages reclaimed        15913301    16248305
      Direct pages reclaimed         4933368     4601133
      Kswapd efficiency                  99%         99%
      Kswapd velocity             670088.047  682555.961
      Direct efficiency                  99%         99%
      Direct velocity             207709.217  193212.133
      Percentage direct scans            23%         22%
      Page writes by reclaim        4858.000    6232.000
      Page writes file                   464         341
      Page writes anon                  4394        5891
      
      Note that there are fewer allocation stalls even though the amount
      of direct reclaim scanning is very approximately the same.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2ce666c1
    • Tim Chen's avatar
      fs/superblock: avoid locking counting inodes and dentries before reclaiming them · 2d37a72e
      Tim Chen authored
      commit d23da150 upstream.
      
      We remove the call to grab_super_passive in call to super_cache_count.
      This becomes a scalability bottleneck as multiple threads are trying to do
      memory reclamation, e.g.  when we are doing large amount of file read and
      page cache is under pressure.  The cached objects quickly got reclaimed
      down to 0 and we are aborting the cache_scan() reclaim.  But counting
      creates a log jam acquiring the sb_lock.
      
      We are holding the shrinker_rwsem which ensures the safety of call to
      list_lru_count_node() and s_op->nr_cached_objects.  The shrinker is
      unregistered now before ->kill_sb() so the operation is safe when we are
      doing unmount.
      
      The impact will depend heavily on the machine and the workload but for a
      small machine using postmark tuned to use 4xRAM size the results were
      
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla         shrinker-v1r1
      Ops/sec Transactions         21.00 (  0.00%)       24.00 ( 14.29%)
      Ops/sec FilesCreate          39.00 (  0.00%)       44.00 ( 12.82%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       29.10 ( 12.05%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       56.02 ( 12.06%)
      
      ffsb running in a configuration that is meant to simulate a mail server showed
      
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla          shrinker-v1r1
      Ops/sec readall           9402.63 (  0.00%)      9567.97 (  1.76%)
      Ops/sec create            4695.45 (  0.00%)      4735.00 (  0.84%)
      Ops/sec delete             173.72 (  0.00%)       179.83 (  3.52%)
      Ops/sec Transactions     14271.80 (  0.00%)     14482.81 (  1.48%)
      Ops/sec Read                37.00 (  0.00%)        37.60 (  1.62%)
      Ops/sec Write               18.20 (  0.00%)        18.30 (  0.55%)
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2d37a72e
    • Dave Chinner's avatar
      fs/superblock: unregister sb shrinker before ->kill_sb() · 2f11e3a8
      Dave Chinner authored
      commit 28f2cd4f upstream.
      
      This series is aimed at regressions noticed during reclaim activity.  The
      first two patches are shrinker patches that were posted ages ago but never
      merged for reasons that are unclear to me.  I'm posting them again to see
      if there was a reason they were dropped or if they just got lost.  Dave?
      Time?  The last patch adjusts proportional reclaim.  Yuanhan Liu, can you
      retest the vm scalability test cases on a larger machine?  Hugh, does this
      work for you on the memcg test cases?
      
      Based on ext4, I get the following results but unfortunately my larger
      test machines are all unavailable so this is based on a relatively small
      machine.
      
      postmark
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla       proportion-v1r4
      Ops/sec Transactions         21.00 (  0.00%)       25.00 ( 19.05%)
      Ops/sec FilesCreate          39.00 (  0.00%)       45.00 ( 15.38%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       30.02 ( 15.59%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       57.78 ( 15.58%)
      
      ffsb (mail server simulator)
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla        proportion-v1r4
      Ops/sec readall           9402.63 (  0.00%)      9805.74 (  4.29%)
      Ops/sec create            4695.45 (  0.00%)      4781.39 (  1.83%)
      Ops/sec delete             173.72 (  0.00%)       177.23 (  2.02%)
      Ops/sec Transactions     14271.80 (  0.00%)     14764.37 (  3.45%)
      Ops/sec Read                37.00 (  0.00%)        38.50 (  4.05%)
      Ops/sec Write               18.20 (  0.00%)        18.50 (  1.65%)
      
      dd of a large file
                                      3.15.0-rc5            3.15.0-rc5
                                         vanilla       proportion-v1r4
      WallTime DownloadTar       75.00 (  0.00%)       61.00 ( 18.67%)
      WallTime DD               423.00 (  0.00%)      401.00 (  5.20%)
      WallTime Delete             2.00 (  0.00%)        5.00 (-150.00%)
      
      stutter (times mmap latency during large amounts of IO)
      
                                  3.15.0-rc5            3.15.0-rc5
                                     vanilla       proportion-v1r4
      Unit >5ms Delays  80252.0000 (  0.00%)  81523.0000 ( -1.58%)
      Unit Mmap min         8.2118 (  0.00%)      8.3206 ( -1.33%)
      Unit Mmap mean       17.4614 (  0.00%)     17.2868 (  1.00%)
      Unit Mmap stddev     24.9059 (  0.00%)     34.6771 (-39.23%)
      Unit Mmap max      2811.6433 (  0.00%)   2645.1398 (  5.92%)
      Unit Mmap 90%        20.5098 (  0.00%)     18.3105 ( 10.72%)
      Unit Mmap 93%        22.9180 (  0.00%)     20.1751 ( 11.97%)
      Unit Mmap 95%        25.2114 (  0.00%)     22.4988 ( 10.76%)
      Unit Mmap 99%        46.1430 (  0.00%)     43.5952 (  5.52%)
      Unit Ideal  Tput     85.2623 (  0.00%)     78.8906 (  7.47%)
      Unit Tput min        44.0666 (  0.00%)     43.9609 (  0.24%)
      Unit Tput mean       45.5646 (  0.00%)     45.2009 (  0.80%)
      Unit Tput stddev      0.9318 (  0.00%)      1.1084 (-18.95%)
      Unit Tput max        46.7375 (  0.00%)     46.7539 ( -0.04%)
      
      This patch (of 3):
      
      We will like to unregister the sb shrinker before ->kill_sb().  This will
      allow cached objects to be counted without call to grab_super_passive() to
      update ref count on sb.  We want to avoid locking during memory
      reclamation especially when we are skipping the memory reclaim when we are
      out of cached objects.
      
      This is safe because grab_super_passive does a try-lock on the
      sb->s_umount now, and so if we are in the unmount process, it won't ever
      block.  That means what used to be a deadlock and races we were avoiding
      by using grab_super_passive() is now:
      
              shrinker                        umount
      
              down_read(shrinker_rwsem)
                                              down_write(sb->s_umount)
                                              shrinker_unregister
                                                down_write(shrinker_rwsem)
                                                  <blocks>
              grab_super_passive(sb)
                down_read_trylock(sb->s_umount)
                  <fails>
              <shrinker aborts>
              ....
              <shrinkers finish running>
              up_read(shrinker_rwsem)
                                                <unblocks>
                                                <removes shrinker>
                                                up_write(shrinker_rwsem)
                                              ->kill_sb()
                                              ....
      
      So it is safe to deregister the shrinker before ->kill_sb().
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2f11e3a8
    • Hugh Dickins's avatar
      mm: fix direct reclaim writeback regression · 94109cd2
      Hugh Dickins authored
      commit 8bdd6380 upstream.
      
      Shortly before 3.16-rc1, Dave Jones reported:
      
        WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
                 xfs_vm_writepage+0x5ce/0x630 [xfs]()
        CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
        Call Trace:
          xfs_vm_writepage+0x5ce/0x630 [xfs]
          shrink_page_list+0x8f9/0xb90
          shrink_inactive_list+0x253/0x510
          shrink_lruvec+0x563/0x6c0
          shrink_zone+0x3b/0x100
          shrink_zones+0x1f1/0x3c0
          try_to_free_pages+0x164/0x380
          __alloc_pages_nodemask+0x822/0xc90
          alloc_pages_vma+0xaf/0x1c0
          handle_mm_fault+0xa31/0xc50
        etc.
      
       970   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
       971                   PF_MEMALLOC))
      
      I did not respond at the time, because a glance at the PageDirty block
      in shrink_page_list() quickly shows that this is impossible: we don't do
      writeback on file pages (other than tmpfs) from direct reclaim nowadays.
      Dave was hallucinating, but it would have been disrespectful to say so.
      
      However, my own /var/log/messages now shows similar complaints
      
        WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
        WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
      
      from stressing some mmotm trees during July.
      
      Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
      so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
      to mapping->a_ops->writepage()?
      
      Yes, 3.16-rc1's commit 68711a74 ("mm, migration: add destination
      page freeing callback") has provided such a way to compaction: if
      migrating a SwapBacked page fails, its newpage may be put back on the
      list for later use with PageSwapBacked still set, and nothing will clear
      it.
      
      Whether that can do anything worse than issue WARN_ON_ONCEs, and get
      some statistics wrong, is unclear: easier to fix than to think through
      the consequences.
      
      Fixing it here, before the put_new_page(), addresses the bug directly,
      but is probably the worst place to fix it.  Page migration is doing too
      many parts of the job on too many levels: fixing it in
      move_to_new_page() to complement its SetPageSwapBacked would be
      preferable, except why is it (and newpage->mapping and newpage->index)
      done there, rather than down in migrate_page_move_mapping(), once we are
      sure of success? Not a cleanup to get into right now, especially not
      with memcg cleanups coming in 3.17.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      94109cd2
    • Shaohua Li's avatar
      x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB · c3206478
      Shaohua Li authored
      commit b13b1d2d upstream.
      
      We use the accessed bit to age a page at page reclaim time,
      and currently we also flush the TLB when doing so.
      
      But in some workloads TLB flush overhead is very heavy. In my
      simple multithreaded app with a lot of swap to several pcie
      SSDs, removing the tlb flush gives about 20% ~ 30% swapout
      speedup.
      
      Fortunately just removing the TLB flush is a valid optimization:
      on x86 CPUs, clearing the accessed bit without a TLB flush
      doesn't cause data corruption.
      
      It could cause incorrect page aging and the (mistaken) reclaim of
      hot pages, but the chance of that should be relatively low.
      
      So as a performance optimization don't flush the TLB when
      clearing the accessed bit, it will eventually be flushed by
      a context switch or a VM operation anyway. [ In the rare
      event of it not getting flushed for a long time the delay
      shouldn't really matter because there's no real memory
      pressure for swapout to react to. ]
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: linux-mm@kvack.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
      [ Rewrote the changelog and the code comments. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c3206478
    • Vlastimil Babka's avatar
      mm, compaction: properly signal and act upon lock and need_sched() contention · b25fd5de
      Vlastimil Babka authored
      commit be976572 upstream.
      
      Compaction uses compact_checklock_irqsave() function to periodically check
      for lock contention and need_resched() to either abort async compaction,
      or to free the lock, schedule and retake the lock.  When aborting,
      cc->contended is set to signal the contended state to the caller.  Two
      problems have been identified in this mechanism.
      
      First, compaction also calls directly cond_resched() in both scanners when
      no lock is yet taken.  This call either does not abort async compaction,
      or set cc->contended appropriately.  This patch introduces a new
      compact_should_abort() function to achieve both.  In isolate_freepages(),
      the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
      match what the migration scanner does in the preliminary page checks.  In
      case a pageblock is found suitable for calling isolate_freepages_block(),
      the checks within there are done on higher frequency.
      
      Second, isolate_freepages() does not check if isolate_freepages_block()
      aborted due to contention, and advances to the next pageblock.  This
      violates the principle of aborting on contention, and might result in
      pageblocks not being scanned completely, since the scanning cursor is
      advanced.  This problem has been noticed in the code by Joonsoo Kim when
      reviewing related patches.  This patch makes isolate_freepages_block()
      check the cc->contended flag and abort.
      
      In case isolate_freepages() has already isolated some pages before
      aborting due to contention, page migration will proceed, which is OK since
      we do not want to waste the work that has been done, and page migration
      has own checks for contention.  However, we do not want another isolation
      attempt by either of the scanners, so cc->contended flag check is added
      also to compaction_alloc() and compact_finished() to make sure compaction
      is aborted right after the migration.
      
      The outcome of the patch should be reduced lock contention by async
      compaction and lower latencies for higher-order allocations where direct
      compaction is involved.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Tested-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Tested-by: default avatarStephen Warren <swarren@nvidia.com>
      Tested-by: default avatarFabio Estevam <fabio.estevam@freescale.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b25fd5de
    • Vlastimil Babka's avatar
      mm/compaction: avoid rescanning pageblocks in isolate_freepages · 71a5b801
      Vlastimil Babka authored
      commit e9ade569 upstream.
      
      The compaction free scanner in isolate_freepages() currently remembers PFN
      of the highest pageblock where it successfully isolates, to be used as the
      starting pageblock for the next invocation.  The rationale behind this is
      that page migration might return free pages to the allocator when
      migration fails and we don't want to skip them if the compaction
      continues.
      
      Since migration now returns free pages back to compaction code where they
      can be reused, this is no longer a concern.  This patch changes
      isolate_freepages() so that the PFN for restarting is updated with each
      pageblock where isolation is attempted.  Using stress-highalloc from
      mmtests, this resulted in 10% reduction of the pages scanned by the free
      scanner.
      
      Note that the somewhat similar functionality that records highest
      successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
      This cache is used when the whole compaction is restarted, not for
      multiple invocations of the free scanner during single compaction.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      71a5b801
    • Vlastimil Babka's avatar
      mm/compaction: do not count migratepages when unnecessary · 5e4084a6
      Vlastimil Babka authored
      commit f8c9301f upstream.
      
      During compaction, update_nr_listpages() has been used to count remaining
      non-migrated and free pages after a call to migrage_pages().  The
      freepages counting has become unneccessary, and it turns out that
      migratepages counting is also unnecessary in most cases.
      
      The only situation when it's needed to count cc->migratepages is when
      migrate_pages() returns with a negative error code.  Otherwise, the
      non-negative return value is the number of pages that were not migrated,
      which is exactly the count of remaining pages in the cc->migratepages
      list.
      
      Furthermore, any non-zero count is only interesting for the tracepoint of
      mm_compaction_migratepages events, because after that all remaining
      unmigrated pages are put back and their count is set to 0.
      
      This patch therefore removes update_nr_listpages() completely, and changes
      the tracepoint definition so that the manual counting is done only when
      the tracepoint is enabled, and only when migrate_pages() returns a
      negative error code.
      
      Furthermore, migrate_pages() and the tracepoints won't be called when
      there's nothing to migrate.  This potentially avoids some wasted cycles
      and reduces the volume of uninteresting mm_compaction_migratepages events
      where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
      was about 75% of the events.  The mm_compaction_isolate_migratepages event
      is better for determining that nothing was isolated for migration, and
      this one was just duplicating the info.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5e4084a6
    • David Rientjes's avatar
      mm, compaction: terminate async compaction when rescheduling · 0d9b7924
      David Rientjes authored
      commit aeef4b83 upstream.
      
      Async compaction terminates prematurely when need_resched(), see
      compact_checklock_irqsave().  This can never trigger, however, if the
      cond_resched() in isolate_migratepages_range() always takes care of the
      scheduling.
      
      If the cond_resched() actually triggers, then terminate this pageblock
      scan for async compaction as well.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0d9b7924
    • David Rientjes's avatar
      mm, compaction: embed migration mode in compact_control · 812fcdf3
      David Rientjes authored
      commit e0b9daeb upstream.
      
      We're going to want to manipulate the migration mode for compaction in the
      page allocator, and currently compact_control's sync field is only a bool.
      
      Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
      depending on the value of this bool.  Convert the bool to enum
      migrate_mode and pass the migration mode in directly.  Later, we'll want
      to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
      avoid unnecessary latency.
      
      This also alters compaction triggered from sysfs, either for the entire
      system or for a node, to force MIGRATE_SYNC.
      
      [akpm@linux-foundation.org: fix build]
      [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      812fcdf3
    • David Rientjes's avatar
      mm, compaction: add per-zone migration pfn cache for async compaction · 7e95430e
      David Rientjes authored
      commit 35979ef3 upstream.
      
      Each zone has a cached migration scanner pfn for memory compaction so that
      subsequent calls to memory compaction can start where the previous call
      left off.
      
      Currently, the compaction migration scanner only updates the per-zone
      cached pfn when pageblocks were not skipped for async compaction.  This
      creates a dependency on calling sync compaction to avoid having subsequent
      calls to async compaction from scanning an enormous amount of non-MOVABLE
      pageblocks each time it is called.  On large machines, this could be
      potentially very expensive.
      
      This patch adds a per-zone cached migration scanner pfn only for async
      compaction.  It is updated everytime a pageblock has been scanned in its
      entirety and when no pages from it were successfully isolated.  The cached
      migration scanner pfn for sync compaction is updated only when called for
      sync compaction.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7e95430e
    • David Rientjes's avatar
      mm, compaction: return failed migration target pages back to freelist · b264e9ab
      David Rientjes authored
      commit d53aea3d upstream.
      
      Greg reported that he found isolated free pages were returned back to the
      VM rather than the compaction freelist.  This will cause holes behind the
      free scanner and cause it to reallocate additional memory if necessary
      later.
      
      He detected the problem at runtime seeing that ext4 metadata pages (esp
      the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
      constantly visited by compaction calls of migrate_pages().  These pages
      had a non-zero b_count which caused fallback_migrate_page() ->
      try_to_release_page() -> try_to_free_buffers() to fail.
      
      Memory compaction works by having a "freeing scanner" scan from one end of
      a zone which isolates pages as migration targets while another "migrating
      scanner" scans from the other end of the same zone which isolates pages
      for migration.
      
      When page migration fails for an isolated page, the target page is
      returned to the system rather than the freelist built by the freeing
      scanner.  This may require the freeing scanner to continue scanning memory
      after suitable migration targets have already been returned to the system
      needlessly.
      
      This patch returns destination pages to the freeing scanner freelist when
      page migration fails.  This prevents unnecessary work done by the freeing
      scanner but also encourages memory to be as compacted as possible at the
      end of the zone.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b264e9ab
    • David Rientjes's avatar
      mm, migration: add destination page freeing callback · ee92d4d6
      David Rientjes authored
      commit 68711a74 upstream.
      
      Memory migration uses a callback defined by the caller to determine how to
      allocate destination pages.  When migration fails for a source page,
      however, it frees the destination page back to the system.
      
      This patch adds a memory migration callback defined by the caller to
      determine how to free destination pages.  If a caller, such as memory
      compaction, builds its own freelist for migration targets, this can reuse
      already freed memory instead of scanning additional memory.
      
      If the caller provides a function to handle freeing of destination pages,
      it is called when page migration fails.  If the caller passes NULL then
      freeing back to the system will be handled as usual.  This patch
      introduces no functional change.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ee92d4d6
    • Vlastimil Babka's avatar
      mm/compaction: cleanup isolate_freepages() · e644c10b
      Vlastimil Babka authored
      commit c96b9e50 upstream.
      
      isolate_freepages() is currently somewhat hard to follow thanks to many
      looks like it is related to the 'low_pfn' variable, but in fact it is not.
      
      This patch renames the 'high_pfn' variable to a hopefully less confusing name,
      and slightly changes its handling without a functional change. A comment made
      obsolete by recent changes is also updated.
      
      [akpm@linux-foundation.org: comment fixes, per Minchan]
      [iamjoonsoo.kim@lge.com: cleanups]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e644c10b
    • Heesub Shin's avatar
      mm/compaction: clean up unused code lines · 87db4a8a
      Heesub Shin authored
      commit 13fb44e4 upstream.
      
      Remove code lines currently not in use or never called.
      Signed-off-by: default avatarHeesub Shin <heesub.shin@samsung.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      87db4a8a
    • Fabian Frederick's avatar
      mm/readahead.c: inline ra_submit · 32e8fcae
      Fabian Frederick authored
      commit 29f175d1 upstream.
      
      Commit f9acc8c7 ("readahead: sanify file_ra_state names") left
      ra_submit with a single function call.
      
      Move ra_submit to internal.h and inline it to save some stack.  Thanks
      to Andrew Morton for commenting different versions.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      32e8fcae
    • Al Viro's avatar
      callers of iov_copy_from_user_atomic() don't need pagecache_disable() · 72ef5b50
      Al Viro authored
      commit 9e8c2af9 upstream.
      
      ... it does that itself (via kmap_atomic())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      72ef5b50
    • Sasha Levin's avatar
      mm: remove read_cache_page_async() · 4fb08e5a
      Sasha Levin authored
      commit 67f9fd91 upstream.
      
      This patch removes read_cache_page_async() which wasn't really needed
      anywhere and simplifies the code around it a bit.
      
      read_cache_page_async() is useful when we want to read a page into the
      cache without waiting for it to complete.  This happens when the
      appropriate callback 'filler' doesn't complete its read operation and
      releases the page lock immediately, and instead queues a different
      completion routine to do that.  This never actually happened anywhere in
      the code.
      
      read_cache_page_async() had 3 different callers:
      
      - read_cache_page() which is the sync version, it would just wait for
        the requested read to complete using wait_on_page_read().
      
      - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
        function it supplied doesn't do any async reads, and would complete
        before the filler function returns - making it actually a sync read.
      
      - CRAMFS would call it using the read_mapping_page_async() wrapper, with
        a similar story to JFFS2 - the filler function doesn't do anything that
        reminds async reads and would always complete before the filler function
        returns.
      
      To sum it up, the code in mm/filemap.c never took advantage of having
      read_cache_page_async().  While there are filler callbacks that do async
      reads (such as the block one), we always called it with the
      read_cache_page().
      
      This patch adds a mandatory wait for read to complete when adding a new
      page to the cache, and removes read_cache_page_async() and its wrappers.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4fb08e5a
    • Johannes Weiner's avatar
      mm: madvise: fix MADV_WILLNEED on shmem swapouts · 5c3ce5b2
      Johannes Weiner authored
      commit 55231e5c upstream.
      
      MADV_WILLNEED currently does not read swapped out shmem pages back in.
      
      Commit 0cd6144a ("mm + fs: prepare for non-page entries in page
      cache radix trees") made find_get_page() filter exceptional radix tree
      entries but failed to convert all find_get_page() callers that WANT
      exceptional entries over to find_get_entry().  One of them is shmem swap
      readahead in madvise, which now skips over any swap-out records.
      
      Convert it to find_get_entry().
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5c3ce5b2
    • Johannes Weiner's avatar
      mm + fs: prepare for non-page entries in page cache radix trees · e714f0cf
      Johannes Weiner authored
      commit 0cd6144a upstream.
      
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e714f0cf
    • Johannes Weiner's avatar
      mm: filemap: move radix tree hole searching here · 3721b421
      Johannes Weiner authored
      commit e7b563bb upstream.
      
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3721b421
    • Johannes Weiner's avatar
      mm: shmem: save one radix tree lookup when truncating swapped pages · a3d18e49
      Johannes Weiner authored
      commit 6dbaf22c upstream.
      
      Page cache radix tree slots are usually stabilized by the page lock, but
      shmem's swap cookies have no such thing.  Because the overall truncation
      loop is lockless, the swap entry is currently confirmed by a tree lookup
      and then deleted by another tree lookup under the same tree lock region.
      
      Use radix_tree_delete_item() instead, which does the verification and
      deletion with only one lookup.  This also allows removing the
      delete-only special case from shmem_radix_tree_replace().
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a3d18e49
    • Johannes Weiner's avatar
      lib: radix-tree: add radix_tree_delete_item() · 50c4613d
      Johannes Weiner authored
      commit 53c59f26 upstream.
      
      Provide a function that does not just delete an entry at a given index,
      but also allows passing in an expected item.  Delete only if that item
      is still located at the specified index.
      
      This is handy when lockless tree traversals want to delete entries as
      well because they don't have to do an second, locked lookup to verify
      the slot has not changed under them before deleting the entry.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      50c4613d
    • Linus Torvalds's avatar
      mm: don't pointlessly use BUG_ON() for sanity check · c05ac84a
      Linus Torvalds authored
      commit 50f5aa8a upstream.
      
      BUG_ON() is a big hammer, and should be used _only_ if there is some
      major corruption that you cannot possibly recover from, making it
      imperative that the current process (and possibly the whole machine) be
      terminated with extreme prejudice.
      
      The trivial sanity check in the vmacache code is *not* such a fatal
      error.  Recovering from it is absolutely trivial, and using BUG_ON()
      just makes it harder to debug for no actual advantage.
      
      To make matters worse, the placement of the BUG_ON() (only if the range
      check matched) actually makes it harder to hit the sanity check to begin
      with, so _if_ there is a bug (and we just got a report from Srivatsa
      Bhat that this can indeed trigger), it is harder to debug not just
      because the machine is possibly dead, but because we don't have better
      coverage.
      
      BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
      because it is simply just about the worst thing you can ever do if you
      hit some "this cannot happen" situation.
      Reported-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c05ac84a
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 9c007307
      Davidlohr Bueso authored
      commit 615d6e87 upstream.
      
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9c007307
    • Christoph Lameter's avatar
      vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() · edce92fc
      Christoph Lameter authored
      commit 83da7510 upstream.
      
      Seems to be called with preemption enabled.  Therefore it must use
      mod_zone_page_state instead.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Reported-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Tested-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      edce92fc