1. 22 Sep, 2009 40 commits
    • Hugh Dickins's avatar
      mm: FOLL flags for GUP flags · 58fa879e
      Hugh Dickins authored
      __get_user_pages() has been taking its own GUP flags, then processing
      them into FOLL flags for follow_page().  Though oddly named, the FOLL
      flags are more widely used, so pass them to __get_user_pages() now.
      Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.
      
      (The patch to __get_user_pages() looks peculiar, with both gup_flags
      and foll_flags: the gup_flags remain constant; but as before there's
      an exceptional case, out of scope of the patch, in which foll_flags
      per page have FOLL_WRITE masked off.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58fa879e
    • Hugh Dickins's avatar
      mm: reinstate ZERO_PAGE · a13ea5b7
      Hugh Dickins authored
      KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
      advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
      using in 2.6.24.  And there were a couple of regression reports on LKML.
      
      Following suggestions from Linus, reinstate do_anonymous_page() use of
      the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
      with (map)count updates - let vm_normal_page() regard it as abnormal.
      
      Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
      most powerpc): that's not essential, but minimizes additional branches
      (keeping them in the unlikely pte_special case); and incidentally
      excludes mips (some models of which needed eight colours of ZERO_PAGE
      to avoid costly exceptions).
      
      Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
      callers won't want to make exceptions for it, so increment its count
      there.  Changes to mlock and migration? happily seems not needed.
      
      In most places it's quicker to check pfn than struct page address:
      prepare a __read_mostly zero_pfn for that.  Does get_dump_page()
      still need its ZERO_PAGE check? probably not, but keep it anyway.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a13ea5b7
    • Hugh Dickins's avatar
      mm: fix anonymous dirtying · 1ac0cb5d
      Hugh Dickins authored
      do_anonymous_page() has been wrong to dirty the pte regardless.
      If it's not going to mark the pte writable, then it won't help
      to mark it dirty here, and clogs up memory with pages which will
      need swap instead of being thrown away.  Especially wrong if no
      overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
      we could exceed the limit and OOM despite no overcommit.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ac0cb5d
    • Hugh Dickins's avatar
      mm: follow_hugetlb_page flags · 2a15efc9
      Hugh Dickins authored
      follow_hugetlb_page() shouldn't be guessing about the coredump case
      either: pass the foll_flags down to it, instead of just the write bit.
      
      Remove that obscure huge_zeropage_ok() test.  The decision is easy,
      though unlike the non-huge case - here vm_ops->fault is always set.
      But we know that a fault would serve up zeroes, unless there's
      already a hugetlbfs pagecache page to back the range.
      
      (Alternatively, since hugetlb pages aren't swapped out under pressure,
      you could save more dump space by arguing that a page not yet faulted
      into this process cannot be relevant to the dump; but that would be
      more surprising.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a15efc9
    • Hugh Dickins's avatar
      mm: FOLL_DUMP replace FOLL_ANON · 8e4b9a60
      Hugh Dickins authored
      The "FOLL_ANON optimization" and its use_zero_page() test have caused
      confusion and bugs: why does it test VM_SHARED? for the very good but
      unsatisfying reason that VMware crashed without.  As we look to maybe
      reinstating anonymous use of the ZERO_PAGE, we need to sort this out.
      
      Easily done: it's silly for __get_user_pages() and follow_page() to
      be guessing whether it's safe to assume that they're being used for
      a coredump (which can take a shortcut snapshot where other uses must
      handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.
      
      get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e4b9a60
    • Hugh Dickins's avatar
      mm: add get_dump_page · f3e8fccd
      Hugh Dickins authored
      In preparation for the next patch, add a simple get_dump_page(addr)
      interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
      get_user_pages() directly.  They're not interested in errors: they
      just want to use holes as much as possible, to save space and make
      sure that the data is aligned where the headers said it would be.
      
      Oh, and don't use that horrid DUMP_SEEK(off) macro!
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3e8fccd
    • Hugh Dickins's avatar
      mm: remove unused GUP flags · 1c3aff1c
      Hugh Dickins authored
      GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
      flags added solely to prevent __get_user_pages() from doing some of
      what it usually does, in the munlock case: we can now remove them.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c3aff1c
    • Hugh Dickins's avatar
      mm: munlock use follow_page · 408e82b7
      Hugh Dickins authored
      Hiroaki Wakabayashi points out that when mlock() has been interrupted
      by SIGKILL, the subsequent munlock() takes unnecessarily long because
      its use of __get_user_pages() insists on faulting in all the pages
      which mlock() never reached.
      
      It's worse than slowness if mlock() is terminated by Out Of Memory kill:
      the munlock_vma_pages_all() in exit_mmap() insists on faulting in all the
      pages which mlock() could not find memory for; so innocent bystanders are
      killed too, and perhaps the system hangs.
      
      __get_user_pages() does a lot that's silly for munlock(): so remove the
      munlock option from __mlock_vma_pages_range(), and use a simple loop of
      follow_page()s in munlock_vma_pages_range() instead; ignoring absent
      pages, and not marking present pages as accessed or dirty.
      
      (Change munlock() to only go so far as mlock() reached?  That does not
      work out, given the convention that mlock() claims complete success even
      when it has to give up early - in part so that an underlying file can be
      extended later, and those pages locked which earlier would give SIGBUS.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: <stable@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarHiroaki Wakabayashi <primulaelatior@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      408e82b7
    • Minchan Kim's avatar
      mm: fix NUMA accounting in numastat.txt · 5d3bc270
      Minchan Kim authored
      In Documentation/numastat.txt, it confused me.  For example, there are
      nodes [0,1] in system.
      
      barrios:~$ cat /proc/zoneinfo | egrep 'numa|zone'
      Node 0, zone	DMA
      	numa_hit	33226
      	numa_miss	1739
      	numa_foreign	27978
      	..
      	..
      Node 1, zone	DMA
      	numa_hit	307
      	numa_miss	46900
      	numa_foreign	0
      
      1) In node 0,  NUMA_MISS means it wanted to allocate page
      in node 1 but ended up with page in node 0
      
      2) In node 0, NUMA_FOREIGN means it wanted to allocate page
      in node 0 but ended up with page from Node 1.
      
      But now, numastat explains it oppositely about (MISS, FOREIGN).
      Let's fix up with viewpoint of zone.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d3bc270
    • Mel Gorman's avatar
      page-allocator: maintain rolling count of pages to free from the PCP · a6f9edd6
      Mel Gorman authored
      When round-robin freeing pages from the PCP lists, empty lists may be
      encountered.  In the event one of the lists has more pages than another,
      there may be numerous checks for list_empty() which is undesirable.  This
      patch maintains a count of pages to free which is incremented when empty
      lists are encountered.  The intention is that more pages will then be
      freed from fuller lists than the empty ones reducing the number of empty
      list checks in the free path.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6f9edd6
    • Mel Gorman's avatar
      page-allocator: split per-cpu list into one-list-per-migrate-type · 5f8dcc21
      Mel Gorman authored
      The following two patches remove searching in the page allocator fast-path
      by maintaining multiple free-lists in the per-cpu structure.  At the time
      the search was introduced, increasing the per-cpu structures would waste a
      lot of memory as per-cpu structures were statically allocated at
      compile-time.  This is no longer the case.
      
      The patches are as follows. They are based on mmotm-2009-08-27.
      
      Patch 1 adds multiple lists to struct per_cpu_pages, one per
      	migratetype that can be stored on the PCP lists.
      
      Patch 2 notes that the pcpu drain path check empty lists multiple times. The
      	patch reduces the number of checks by maintaining a count of free
      	lists encountered. Lists containing pages will then free multiple
      	pages in batch
      
      The patches were tested with kernbench, netperf udp/tcp, hackbench and
      sysbench.  The netperf tests were not bound to any CPU in particular and
      were run such that the results should be 99% confidence that the reported
      results are within 1% of the estimated mean.  sysbench was run with a
      postgres background and read-only tests.  Similar to netperf, it was run
      multiple times so that it's 99% confidence results are within 1%.  The
      patches were tested on x86, x86-64 and ppc64 as
      
      x86:	Intel Pentium D 3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.34% to 2.28% gain
      	netperf-tcp	- 0.45% to 1.22% gain
      	hackbench	- Small variances, very close to noise
      	sysbench	- Very small gains
      
      x86-64:	AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 1.83% to 10.42% gains
      	netperf-tcp	- No conclusive until buffer >= PAGE_SIZE
      				4096	+15.83%
      				8192	+ 0.34% (not significant)
      				16384	+ 1%
      	hackbench	- Small gains, very close to noise
      	sysbench	- 0.79% to 1.6% gain
      
      ppc64:	PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
      	kernbench	- No significant difference, variance well within noise
      	netperf-udp	- 2-3% gain for almost all buffer sizes tested
      	netperf-tcp	- losses on small buffers, gains on larger buffers
      			  possibly indicates some bad caching effect.
      	hackbench	- No significant difference
      	sysbench	- 2-4% gain
      
      This patch:
      
      Currently the per-cpu page allocator searches the PCP list for pages of
      the correct migrate-type to reduce the possibility of pages being
      inappropriate placed from a fragmentation perspective.  This search is
      potentially expensive in a fast-path and undesirable.  Splitting the
      per-cpu list into multiple lists increases the size of a per-cpu structure
      and this was potentially a major problem at the time the search was
      introduced.  These problem has been mitigated as now only the necessary
      number of structures is allocated for the running system.
      
      This patch replaces a list search in the per-cpu allocator with one list
      per migrate type.  The potential snag with this approach is when bulk
      freeing pages.  We round-robin free pages based on migrate type which has
      little bearing on the cache hotness of the page and potentially checks
      empty lists repeatedly in the event the majority of PCP pages are of one
      type.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f8dcc21
    • KOSAKI Motohiro's avatar
      oom: fix oom_adjust_write() input sanity check · 5d863b89
      KOSAKI Motohiro authored
      Andrew Morton pointed out oom_adjust_write() has very strange EIO
      and new line handling. this patch fixes it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d863b89
    • KOSAKI Motohiro's avatar
      oom: oom_kill doesn't kill vfork parent (or child) · 8c5cd6f3
      KOSAKI Motohiro authored
      Current oom_kill doesn't only kill the victim process, but also kill all
      thas shread the same mm.  it mean vfork parent will be killed.
      
      This is definitely incorrect.  another process have another oom_adj.  we
      shouldn't ignore their oom_adj (it might have OOM_DISABLE).
      
      following caller hit the minefield.
      
      ===============================
              switch (constraint) {
              case CONSTRAINT_MEMORY_POLICY:
                      oom_kill_process(current, gfp_mask, order, 0, NULL,
                                      "No available memory (MPOL_BIND)");
                      break;
      
      Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
      We don't need to care multi thread in here.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c5cd6f3
    • KOSAKI Motohiro's avatar
      oom: make oom_score to per-process value · 495789a5
      KOSAKI Motohiro authored
      oom-killer kills a process, not task.  Then oom_score should be calculated
      as per-process too.  it makes consistency more and makes speed up
      select_bad_process().
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      495789a5
    • KOSAKI Motohiro's avatar
      oom: move oom_adj value from task_struct to signal_struct · 28b83c51
      KOSAKI Motohiro authored
      Currently, OOM logic callflow is here.
      
          __out_of_memory()
              select_bad_process()            for each task
                  badness()                   calculate badness of one task
                      oom_kill_process()      search child
                          oom_kill_task()     kill target task and mm shared tasks with it
      
      example, process-A have two thread, thread-A and thread-B and it have very
      fat memory and each thread have following oom_adj and oom_score.
      
           thread-A: oom_adj = OOM_DISABLE, oom_score = 0
           thread-B: oom_adj = 0,           oom_score = very-high
      
      Then, select_bad_process() select thread-B, but oom_kill_task() refuse
      kill the task because thread-A have OOM_DISABLE.  Thus __out_of_memory()
      call select_bad_process() again.  but select_bad_process() select the same
      task.  It mean kernel fall in livelock.
      
      The fact is, select_bad_process() must select killable task.  otherwise
      OOM logic go into livelock.
      
      And root cause is, oom_adj shouldn't be per-thread value.  it should be
      per-process value because OOM-killer kill a process, not thread.  Thus
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct signal_struct.  it naturally prevent
      select_bad_process() choose wrong task.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28b83c51
    • Vincent Li's avatar
      mm/vmscan: remove page_queue_congested() comment · f168e1b6
      Vincent Li authored
      Commit 084f71ae(kill page_queue_congested()) removed
      page_queue_congested().  Remove the page_queue_congested() comment in
      vmscan pageout() too.
      Signed-off-by: default avatarVincent Li <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f168e1b6
    • Wu Fengguang's avatar
      mm: do batched scans for mem_cgroup · f8629631
      Wu Fengguang authored
      For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
      which case shrink_list() _still_ calls isolate_pages() with the much
      larger SWAP_CLUSTER_MAX.  It effectively scales up the inactive list scan
      rate by up to 32 times.
      
      For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
      So when shrink_zone() expects to scan 4 pages in the active/inactive list,
      the active list will be scanned 4 pages, while the inactive list will be
      (over) scanned SWAP_CLUSTER_MAX=32 pages in effect.  And that could break
      the balance between the two lists.
      
      It can further impact the scan of anon active list, due to the anon
      active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
      
      inactive anon list over scanned => inactive_anon_is_low() == TRUE
                                      => shrink_active_list()
                                      => active anon list over scanned
      
      So the end result may be
      
      - anon inactive  => over scanned
      - anon active    => over scanned (maybe not as much)
      - file inactive  => over scanned
      - file active    => under scanned (relatively)
      
      The accesses to nr_saved_scan are not lock protected and so not 100%
      accurate, however we can tolerate small errors and the resulted small
      imbalanced scan rates between zones.
      
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8629631
    • Alexey Dobriyan's avatar
    • Vincent Li's avatar
      mm/vmscan: rename zone_nr_pages() to zone_nr_lru_pages() · 0b217676
      Vincent Li authored
      The name `zone_nr_pages' can be mis-read as zone's (total) number pages,
      but it actually returns zone's LRU list number pages.
      Signed-off-by: default avatarVincent Li <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b217676
    • Jan Beulich's avatar
      mm: also use alloc_large_system_hash() for the PID hash table · 2c85f51d
      Jan Beulich authored
      This is being done by allowing boot time allocations to specify that they
      may want a sub-page sized amount of memory.
      
      Overall this seems more consistent with the other hash table allocations,
      and allows making two supposedly mm-only variables really mm-only
      (nr_{kernel,all}_pages).
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c85f51d
    • Jan Beulich's avatar
      mm: don't use alloc_bootmem_low() where not strictly needed · 3c1596ef
      Jan Beulich authored
      Since alloc_bootmem() will never return inaccessible (via virtual
      addressing) memory anyway, using the ..._low() variant only makes sense
      when the physical address range of the allocated memory must fulfill
      further constraints, espacially since on 64-bits (or more generally in all
      cases where the pools the two variants allocate from are than the full
      available range.
      
      Probably the use in alloc_tce_table() could also be eliminated (based on
      code inspection of pci-calgary_64.c), but that seems too risky given I
      know nothing about that hardware and have no way to test it.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c1596ef
    • Jan Beulich's avatar
      mm: replace various uses of num_physpages by totalram_pages · 4481374c
      Jan Beulich authored
      Sizing of memory allocations shouldn't depend on the number of physical
      pages found in a system, as that generally includes (perhaps a huge amount
      of) non-RAM pages.  The amount of what actually is usable as storage
      should instead be used as a basis here.
      
      Some of the calculations (i.e.  those not intending to use high memory)
      should likely even use (totalram_pages - totalhigh_pages).
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4481374c
    • Jan Beulich's avatar
      memory hotplug: fix updating of num_physpages for hot plugged memory · 4738e1b9
      Jan Beulich authored
      Sizing of memory allocations shouldn't depend on the number of physical
      pages found in a system, as that generally includes (perhaps a huge amount
      of) non-RAM pages.  The amount of what actually is usable as storage
      should instead be used as a basis here.
      
      In line with that, the memory hotplug code should update num_physpages in
      a way that it retains its original (post-boot) meaning; in particular,
      decreasing the value should at best be done with great care - this patch
      doesn't try to ever decrease this value at all as it doesn't really seem
      meaningful to do so.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4738e1b9
    • Mel Gorman's avatar
      page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone · 78986a67
      Mel Gorman authored
      After anti-fragmentation was merged, a bug was reported whereby devices
      that depended on high-order atomic allocations were failing.  The solution
      was to preserve a property in the buddy allocator which tended to keep the
      minimum number of free pages in the zone at the lower physical addresses
      and contiguous.  To preserve this property, MIGRATE_RESERVE was introduced
      and a number of pageblocks at the start of a zone would be marked
      "reserve", the number of which depended on min_free_kbytes.
      
      Anti-fragmentation works by avoiding the mixing of page migratetypes
      within the same pageblock.  One way of helping this is to increase
      min_free_kbytes because it becomes less like that it will be necessary to
      place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
      there in large contiguous blocks instead of helping anti-fragmentation as
      much as it should.  With the page-allocator tracepoint patches applied, it
      was found during anti-fragmentation tests that the number of
      fragmentation-related events were far higher than expected even with
      min_free_kbytes at higher values.
      
      This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
      to two.  For example, with a sufficient min_free_kbytes, 4MB of memory
      will be kept aside on an x86-64 and remain more or less free and
      contiguous for the systems uptime.  This should be sufficient for devices
      depending on high-order atomic allocations while helping fragmentation
      control when min_free_kbytes is tuned appropriately.  As side-effect of
      this patch is that the reserve variable is converted to int as unsigned
      long was the wrong type to use when ensuring that only the required number
      of reserve blocks are created.
      
      With the patches applied, fragmentation-related events as measured by the
      page allocator tracepoints were significantly reduced when running some
      fragmentation stress-tests on systems with min_free_kbytes tuned to a
      value appropriate for hugepage allocations at runtime.  On x86, the events
      recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
      99.83%.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78986a67
    • Johannes Weiner's avatar
      mm: document is_page_cache_freeable() · ceddc3a5
      Johannes Weiner authored
      Enlighten the reader of this code about what reference count makes a page
      cache page freeable.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ceddc3a5
    • Johannes Weiner's avatar
      mm: return boolean from page_has_private() · edcf4748
      Johannes Weiner authored
      Make page_has_private() return a true boolean value and remove the double
      negations from the two callsites using it for arithmetic.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edcf4748
    • Johannes Weiner's avatar
      mm: return boolean from page_is_file_cache() · 6c0b1351
      Johannes Weiner authored
      page_is_file_cache() has been used for both boolean checks and LRU
      arithmetic, which was always a bit weird.
      
      Now that page_lru_base_type() exists for LRU arithmetic, make
      page_is_file_cache() a real predicate function and adjust the
      boolean-using callsites to drop those pesky double negations.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c0b1351
    • Johannes Weiner's avatar
      mm: introduce page_lru_base_type() · 401a8e1c
      Johannes Weiner authored
      Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
      another helper with a more appropriate name and convert the non-boolean
      users of page_is_file_cache() accordingly.
      
      This new helper gives the LRU base type a page is supposed to live on,
      inactive anon or inactive file.
      
      [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      401a8e1c
    • Johannes Weiner's avatar
      mm: drop unneeded double negations · b7c46d15
      Johannes Weiner authored
      Remove double negations where the operand is already boolean.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7c46d15
    • Sage Weil's avatar
      mm: remove broken 'kzalloc' mempool · bba78819
      Sage Weil authored
      The kzalloc mempool zeros items when they are initially allocated, but
      does not rezero used items that are returned to the pool.  Consequently
      mempool_alloc()s may return non-zeroed memory.
      
      Since there are/were only two in-tree users for
      mempool_create_kzalloc_pool(), and 'fixing' this in a way that will
      re-zero used (but not new) items before first use is non-trivial, just
      remove it.
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bba78819
    • Sage Weil's avatar
      md: avoid use of broken kzalloc mempool · bbba809e
      Sage Weil authored
      The kzalloc mempool does not re-zero items that have been used and then
      returned to the pool.  Manually zero the allocated multipath_bh instead.
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbba809e
    • Jaswinder Singh Rajput's avatar
      mm: includecheck fix for mm/nommu.c · 72ff13b7
      Jaswinder Singh Rajput authored
      Fix the following 'make includecheck' warning:
      
        mm/nommu.c: internal.h is included more than once.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: default avatarGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72ff13b7
    • Jaswinder Singh Rajput's avatar
      mm: includecheck fix for mm/shmem.c · cff397e6
      Jaswinder Singh Rajput authored
      Fix the following 'make includecheck' warning:
      
        mm/shmem.c: linux/vfs.h is included more than once.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cff397e6
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() does not return -EEXIST · 2ca4532a
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
      or get_swap_page() would call add_to_swap_cache().  So add_to_swap_cache()
      doesn't return -EEXIST any more.
      
      Even though it doesn't return -EEXIST, it's not good behavior conceptually
      to call swapcache_prepare() in the -EEXIST case, because it means clearing
      SWAP_HAS_CACHE flag while the entry is on swap cache.
      
      This patch removes redundant codes and comments from callers of it, and
      adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca4532a
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() must not sleep · 31a56396
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
      cache but it has SWAP_HAS_CACHE flag.
      
      Such entries can exist on add/delete path of swap cache.  On add path,
      add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
      on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
      cleared) soon after __delete_from_swap_cache() is called.  So, the
      busy-wait works well in most cases.
      
      But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
      read_swap_cache_async() tries to swap-in the same entry on the same cpu.
      
      This patch calls radix_tree_preload() before swapcache_prepare() and
      divides add_to_swap_cache() into two part: radix_tree_preload() part and
      radix_tree_insert() part(define it as __add_to_swap_cache()).
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a56396
    • Mel Gorman's avatar
      tracing, documentation: Add a document on the kmem tracepoints · 8fbb398f
      Mel Gorman authored
      Knowing tracepoints exist is not quite the same as knowing what they
      should be used for.  This patch adds a document giving a basic description
      of the kmem tracepoints and why they might be useful to a performance
      analyst.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fbb398f
    • Mel Gorman's avatar
      tracing, documentation: add a document describing how to do some performance... · bb722220
      Mel Gorman authored
      tracing, documentation: add a document describing how to do some performance analysis with tracepoints
      
      The documentation for ftrace, events and tracepoints is pretty extensive.
      Similarly, the perf PCL tools help files --help are there and the code
      simple enough to figure out what much of the switches mean.  However,
      pulling the discrete bits and pieces together and translating that into
      "how do I solve a problem" requires a fair amount of imagination.
      
      This patch adds a simple document intended to get someone started on the
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb722220
    • Mel Gorman's avatar
      tracing, page-allocator: add a postprocessing script for page-allocator-related ftrace events · c9d05cfc
      Mel Gorman authored
      This patch adds a simple post-processing script for the
      page-allocator-related trace events.  It can be used to give an indication
      of who the most allocator-intensive processes are and how often the zone
      lock was taken during the tracing period.  Example output looks like
      
      Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                      under lock     direct  pagevec      drain
      swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
      Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
      modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
      xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
      awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
      thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
      hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
      akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
      xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
      khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0
      
      Optionally, the output can include information on the parent or aggregate
      based on process name instead of aggregating based on each pid. Example output
      including parent information and stripped out the PID looks something like;
      
      Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                           under lock     direct  pagevec      drain
      gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
      init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
      git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0
      
      This says that Xorg allocated 3796 pages and it's parent process is gdm
      with a PID of 3756;
      
      The postprocessor parses the text output of tracing.  While there is a
      binary format, the expectation is that the binary output can be readily
      translated into text and post-processed offline.  Obviously if the text
      format changes, the parser will break but the regular expression parser is
      fairly rudimentary so should be readily adjustable.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9d05cfc
    • Mel Gorman's avatar
      tracing, page-allocator: add trace event for page traffic related to the buddy lists · 0d3d062a
      Mel Gorman authored
      The page allocation trace event reports that a page was successfully
      allocated but it does not specify where it came from.  When analysing
      performance, it can be important to distinguish between pages coming from
      the per-cpu allocator and pages coming from the buddy lists as the latter
      requires the zone lock to the taken and more data structures to be
      examined.
      
      This patch adds a trace event for __rmqueue reporting when a page is being
      allocated from the buddy lists.  It distinguishes between being called to
      refill the per-cpu lists or whether it is a high-order allocation.
      Similarly, this patch adds an event to catch when the PCP lists are being
      drained a little and pages are going back to the buddy lists.
      
      This is trickier to draw conclusions from but high activity on those
      events could explain why there were a large number of cache misses on a
      page-allocator-intensive workload.  The coalescing and splitting of
      buddies involves a lot of writing of page metadata and cache line bounces
      not to mention the acquisition of an interrupt-safe lock necessary to
      enter this path.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d3d062a
    • Mel Gorman's avatar
      tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes · e0fff1bd
      Mel Gorman authored
      Fragmentation avoidance depends on being able to use free pages from lists
      of the appropriate migrate type.  In the event this is not possible,
      __rmqueue_fallback() selects a different list and in some circumstances
      change the migratetype of the pageblock.  Simplistically, the more times
      this event occurs, the more likely that fragmentation will be a problem
      later for hugepage allocation at least but there are other considerations
      such as the order of page being split to satisfy the allocation.
      
      This patch adds a trace event for __rmqueue_fallback() that reports what
      page is being used for the fallback, the orders of relevant pages, the
      desired migratetype and the migratetype of the lists being used, whether
      the pageblock changed type and whether this event is important with
      respect to fragmentation avoidance or not.  This information can be used
      to help analyse fragmentation avoidance and help decide whether
      min_free_kbytes should be increased or not.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0fff1bd