1. 06 Mar, 2010 40 commits
    • Richard Kennedy's avatar
      cpuidle menu: remove 8 bytes of padding on 64 bit builds · 56e6943b
      Richard Kennedy authored
      Reorder struct menu_device to remove 8 bytes of padding on 64 bit builds.
      Size drops from 136 to 128 bytes, so possibly needing one fewer cache
      lines.
      Signed-off-by: default avatarRichard Kennedy <richard@rsk.demon.co.uk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56e6943b
    • Roel Kluin's avatar
      alpha: PTR_ERR overwrites -EINVAL in syscall osf_mount · 77079dbe
      Roel Kluin authored
      The initial -EINVAL value is overwritten by `retval = PTR_ERR(name)'.  If
      this isn't an error pointer and typenr is not 1, 6 or 9, then this retval,
      a pointer cast to a long, is returned.
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Acked-by: default avatarRichard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77079dbe
    • FUJITA Tomonori's avatar
      frv: remove pci_dma_sync_single() and pci_dma_sync_sg() · 68221908
      FUJITA Tomonori authored
      No architecture except for frv has pci_dma_sync_single() and
      pci_dma_sync_sg().  The APIs are deprecated.
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68221908
    • Hugh Dickins's avatar
      mm: add comment on swap_duplicate's error code · 08259d58
      Hugh Dickins authored
      swap_duplicate()'s loop appears to miss out on returning the error code
      from __swap_duplicate(), except when that's -ENOMEM.  In fact this is
      intentional: prior to -ENOMEM for swap_count_continuation,
      swap_duplicate() was void (and the case only occurs when copy_one_pte()
      hits a corrupt pte).  But that's surprising behaviour, which certainly
      deserves a comment.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarHuang Shijie <shijie8@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08259d58
    • Steven J. Magnani's avatar
      nommu: get_user_pages(): pin last page on non-page-aligned start · c08c6e1f
      Steven J. Magnani authored
      The noMMU version of get_user_pages() fails to pin the last page when the
      start address isn't page-aligned.  The patch fixes this in a way that
      makes find_extend_vma() congruent to its MMU cousin.
      Signed-off-by: default avatarSteven J. Magnani <steve@digidescorp.com>
      Acked-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c08c6e1f
    • Amerigo Wang's avatar
      mm: use the same log level for show_mem() · f047f4f3
      Amerigo Wang authored
      Use the same log level for printk's in show_mem(), so that those messages
      can be shown completely when using log level 6.
      Signed-off-by: default avatarWANG Cong <amwang@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f047f4f3
    • David Rientjes's avatar
      mm: add comment about deprecation of __GFP_NOFAIL · 478352e7
      David Rientjes authored
      __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
      users should be added.
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      478352e7
    • Johannes Weiner's avatar
      vmscan: detect mapped file pages used only once · 64574746
      Johannes Weiner authored
      The VM currently assumes that an inactive, mapped and referenced file page
      is in use and promotes it to the active list.
      
      However, every mapped file page starts out like this and thus a problem
      arises when workloads create a stream of such pages that are used only for
      a short time.  By flooding the active list with those pages, the VM
      quickly gets into trouble finding eligible reclaim canditates.  The result
      is long allocation latencies and eviction of the wrong pages.
      
      This patch reuses the PG_referenced page flag (used for unmapped file
      pages) to implement a usage detection that scales with the speed of LRU
      list cycling (i.e.  memory pressure).
      
      If the scanner encounters those pages, the flag is set and the page cycled
      again on the inactive list.  Only if it returns with another page table
      reference it is activated.  Otherwise it is reclaimed as 'not recently
      used cache'.
      
      This effectively changes the minimum lifetime of a used-once mapped file
      page from a full memory cycle to an inactive list cycle, which allows it
      to occur in linear streams without affecting the stable working set of the
      system.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64574746
    • Johannes Weiner's avatar
      vmscan: drop page_mapping_inuse() · 31c0569c
      Johannes Weiner authored
      page_mapping_inuse() is a historic predicate function for pages that are
      about to be reclaimed or deactivated.
      
      According to it, a page is in use when it is mapped into page tables OR
      part of swap cache OR backing an mmapped file.
      
      This function is used in combination with page_referenced(), which checks
      for young bits in ptes and the page descriptor itself for the
      PG_referenced bit.  Thus, checking for unmapped swap cache pages is
      meaningless as PG_referenced is not set for anonymous pages and unmapped
      pages do not have young ptes.  The test makes no difference.
      
      Protecting file pages that are not by themselves mapped but are part of a
      mapped file is also a historic leftover for short-lived things like the
      exec() code in libc.  However, the VM now does reference accounting and
      activation of pages at unmap time and thus the special treatment on
      reclaim is obsolete.
      
      This patch drops page_mapping_inuse() and switches the two callsites to
      use page_mapped() directly.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31c0569c
    • Johannes Weiner's avatar
      vmscan: factor out page reference checks · dfc8d636
      Johannes Weiner authored
      The used-once mapped file page detection patchset.
      
      It is meant to help workloads with large amounts of shortly used file
      mappings, like rtorrent hashing a file or git when dealing with loose
      objects (git gc on a bigger site?).
      
      Right now, the VM activates referenced mapped file pages on first
      encounter on the inactive list and it takes a full memory cycle to
      reclaim them again.  When those pages dominate memory, the system
      no longer has a meaningful notion of 'working set' and is required
      to give up the active list to make reclaim progress.  Obviously,
      this results in rather bad scanning latencies and the wrong pages
      being reclaimed.
      
      This patch makes the VM be more careful about activating mapped file
      pages in the first place.  The minimum granted lifetime without
      another memory access becomes an inactive list cycle instead of the
      full memory cycle, which is more natural given the mentioned loads.
      
      This test resembles a hashing rtorrent process.  Sequentially, 32MB
      chunks of a file are mapped into memory, hashed (sha1) and unmapped
      again.  While this happens, every 5 seconds a process is launched and
      its execution time taken:
      
      	python2.4 -c 'import pydoc'
      	old: max=2.31s mean=1.26s (0.34)
      	new: max=1.25s mean=0.32s (0.32)
      
      	find /etc -type f
      	old: max=2.52s mean=1.44s (0.43)
      	new: max=1.92s mean=0.12s (0.17)
      
      	vim -c ':quit'
      	old: max=6.14s mean=4.03s (0.49)
      	new: max=3.48s mean=2.41s (0.25)
      
      	mplayer --help
      	old: max=8.08s mean=5.74s (1.02)
      	new: max=3.79s mean=1.32s (0.81)
      
      	overall hash time (stdev):
      	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
      	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
      
      I also tested kernbench with regular IO streaming in the background to
      see whether the delayed activation of frequently used mapped file
      pages had a negative impact on performance in the presence of pressure
      on the inactive list.  The patch made no significant difference in
      timing, neither for kernbench nor for the streaming IO throughput.
      
      The first patch submission raised concerns about the cost of the extra
      faults for actually activated pages on machines that have no hardware
      support for young page table entries.
      
      I created an artificial worst case scenario on an ARM machine with
      around 300MHz and 64MB of memory to figure out the dimensions
      involved.  The test would mmap a file of 20MB, then
      
        1. touch all its pages to fault them in
        2. force one full scan cycle on the inactive file LRU
        -- old: mapping pages activated
        -- new: mapping pages inactive
        3. touch the mapping pages again
        -- old and new: fault exceptions to set the young bits
        4. force another full scan cycle on the inactive file LRU
        5. touch the mapping pages one last time
        -- new: fault exceptions to set the young bits
      
      The test showed an overall increase of 6% in time over 100 iterations
      of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
      (100 * 5k pages), ignoring the execution time of the test itself,
      makes for about 25us overhead for every page that gets actually
      activated.  Note:
      
        1. File mapping the size of one third of main memory, _completely_
        in active use across memory pressure - i.e., most pages referenced
        within one LRU cycle.  This should be rare to non-existant,
        especially on such embedded setups.
      
        2. Many huge activation batches.  Those batches only occur when the
        working set fluctuates.  If it changes completely between every full
        LRU cycle, you have problematic reclaim overhead anyway.
      
        3. Access of activated pages at maximum speed: sequential loads from
        every single page without doing anything in between.  In reality,
        the extra faults will get distributed between actual operations on
        the data.
      
      So even if a workload manages to get the VM into the situation of
      activating a third of memory in one go on such a setup, it will take
      2.2 seconds instead 2.1 without the patch.
      
      Comparing the numbers (and my user-experience over several months),
      I think this change is an overall improvement to the VM.
      
      Patch 1 is only refactoring to break up that ugly compound conditional
      in shrink_page_list() and make it easy to document and add new checks
      in a readable fashion.
      
      Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
      strictly related to #3, but it was in the original submission and is a
      net simplification, so I kept it.
      
      Patch 3 implements used-once detection of mapped file pages.
      
      This patch:
      
      Moving the big conditional into its own predicate function makes the code
      a bit easier to read and allows for better commenting on the checks
      one-by-one.
      
      This is just cleaning up, no semantics should have been changed.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfc8d636
    • Mel Gorman's avatar
      mm: document /sys/devices/system/node/nodeX · e7c84ee2
      Mel Gorman authored
      Add a bare description of what /sys/devices/system/node/nodeX is.  Others
      will follow in time but right now, none of that tree is documented.  The
      existence of this file might at least encourage people to document new
      entries.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7c84ee2
    • Mel Gorman's avatar
      mm: document /proc/pagetypeinfo · a1b57ac0
      Mel Gorman authored
      Add documentation for /proc/pagetypeinfo.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1b57ac0
    • David Rientjes's avatar
      mm: suppress pfn range output for zones without pages · 72f0ba02
      David Rientjes authored
      free_area_init_nodes() emits pfn ranges for all zones on the system.
      There may be no pages on a higher zone, however, due to memory limitations
      or the use of the mem= kernel parameter.  For example:
      
      Zone PFN ranges:
        DMA      0x00000001 -> 0x00001000
        DMA32    0x00001000 -> 0x00100000
        Normal   0x00100000 -> 0x00100000
      
      The implementation copies the previous zone's highest pfn, if any, as the
      next zone's lowest pfn.  If its highest pfn is then greater than the
      amount of addressable memory, the upper memory limit is used instead.
      Thus, both the lowest and highest possible pfn for higher zones without
      memory may be the same.
      
      The pfn range for zones without memory is now shown as "empty" instead.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72f0ba02
    • Rafael J. Wysocki's avatar
      mm/pm: force GFP_NOIO during suspend/hibernation and resume · 452aa699
      Rafael J. Wysocki authored
      There are quite a few GFP_KERNEL memory allocations made during
      suspend/hibernation and resume that may cause the system to hang, because
      the I/O operations they depend on cannot be completed due to the
      underlying devices being suspended.
      
      Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
      gfp_allowed_mask before suspend/hibernation and restoring the original
      values of these bits in gfp_allowed_mask durig the subsequent resume.
      
      [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: default avatarMaxim Levitsky <maximlevitsky@gmail.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      452aa699
    • Hugh Dickins's avatar
      mm/swapfile.c: fix swapon size off-by-one · ad2bd7e0
      Hugh Dickins authored
      There's an off-by-one disagreement between mkswap and swapon about the
      meaning of swap_header last_page: mkswap (in all versions I've looked at:
      util-linux-ng and BusyBox and old util-linux; probably as far back as
      1999) consistently means the offset (in page units) of the last page of
      the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
      strangely takes it to mean the size (in page units) of the swap area.
      
      This disagreement is the safe way round; but it's worrying people, and
      loses us one page of swap.
      
      The fix is not just to add one to nr_good_pages: we need to get maxpages
      (the size of the swap_map array) right before that; and though that is an
      unsigned long, be careful not to overflow the unsigned int p->max which
      later holds it (probably why header uses __u32 last_page instead of size).
      
      Why did we subtract one from the maximum swp_offset to calculate maxpages?
       Though it was probably me who made that change in 2.4.10, I don't get it:
      and now we should be adding one (without risk of overflow in this case).
      
      Fix the handling of swap_header badpages: it could have overrun the
      swap_map when very large swap area used on a more limited architecture.
      
      Remove pre-initializations of swap_header, nr_good_pages and maxpages:
      those date from when sys_swapon was supporting other versions of header.
      Reported-by: default avatarNitin Gupta <ngupta@vflare.org>
      Reported-by: default avatarJarkko Lavinen <jarkko.lavinen@nokia.com>
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2bd7e0
    • Rik van Riel's avatar
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel authored
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • Rik van Riel's avatar
      rmap: move exclusively owned pages to own anon_vma in do_wp_page() · c44b6743
      Rik van Riel authored
      When the parent process breaks the COW on a page, both the original which
      is mapped at child and the new page which is mapped parent end up in that
      same anon_vma.  Generally this won't be a problem, but for some workloads
      it could preserve the O(N) rmap scanning complexity.
      
      A simple fix is to ensure that, when a page which is mapped child gets
      reused in do_wp_page, because we already are the exclusive owner, the page
      gets moved to our own exclusive child's anon_vma.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c44b6743
    • Rik van Riel's avatar
      rmap: remove obsolete check from __page_check_anon_rmap() · 033a64b5
      Rik van Riel authored
      When an anonymous page is inherited from a parent process, the
      vma->anon_vma can differ from the page anon_vma.  This can trip up
      __page_check_anon_rmap, which is indirectly called from do_swap_page().
      
      Remove that obsolete check to prevent an oops.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      033a64b5
    • Rik van Riel's avatar
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel authored
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • Thiago Farina's avatar
      mm/memcontrol.c: fix "integer as NULL pointer" sparse warning · 648bcc77
      Thiago Farina authored
      mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer
      Signed-off-by: default avatarThiago Farina <tfransosi@gmail.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      648bcc77
    • Andrew Morton's avatar
      include/linux/fs.h: convert FMODE_* constants to hex · 19adf9c5
      Andrew Morton authored
      It was tolerable until Eric went and added 8388608.
      
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19adf9c5
    • Wu Fengguang's avatar
      readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM · 0141450f
      Wu Fengguang authored
      This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
      
      POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
      a 16K read will be carried out in 4 _sync_ 1-page reads.
      
      In other places, ra_pages==0 means
      - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
      - some IO error happened
      where multi-page read IO won't help or should be avoided.
      
      POSIX_FADV_RANDOM actually want a different semantics: to disable the
      *heuristic* readahead algorithm, and to use a dumb one which faithfully
      submit read IO for whatever application requests.
      
      So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
      
      Note that the random hint is not likely to help random reads performance
      noticeably.  And it may be too permissive on huge request size (its IO
      size is not limited by read_ahead_kb).
      
      In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
      (NFS read) performance of the application increased by 313%!
      Tested-by: default avatarQuentin Barnes <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Cc: <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0141450f
    • Wu Fengguang's avatar
      vfs: take f_lock on modifying f_mode after open time · 42e49608
      Wu Fengguang authored
      We'll introduce FMODE_RANDOM which will be runtime modified.  So protect
      all runtime modification to f_mode with f_lock to avoid races.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42e49608
    • KOSAKI Motohiro's avatar
      mm/migrate.c: kill anon local variable from migrate_page_copy · 85f1fb72
      KOSAKI Motohiro authored
      commit 01b1ae63 ("memcg: simple migration handling") removed
      mem_cgroup_uncharge_cache_page() call from migrate_page_copy.  Local
      variable `anon' is now unused.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85f1fb72
    • KOSAKI Motohiro's avatar
      mm/mempolicy.c: fix indentation of the comments of do_migrate_pages · da0aa138
      KOSAKI Motohiro authored
      Currently, do_migrate_pages() have very long comment and this is not
      indent properly.  I often misunderstand it is function starting commnents
      and confused it.
      
      this patch fixes it.
      
      note: this patch doesn't break 80 column rule. I guess original
            author intended this indentaion, but an accident corrupted it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da0aa138
    • akpm@linux-foundation.org's avatar
      memory-hotplug: create /sys/firmware/memmap entry for new memory · d96ae530
      akpm@linux-foundation.org authored
      A memmap is a directory in sysfs which includes 3 text files: start, end
      and type.  For example:
      
      start: 	0x100000
      end:	0x7e7b1cff
      type:	System RAM
      
      Interface firmware_map_add was not called explicitly.  Remove it and add
      function firmware_map_add_hotplug as hotplug interface of memmap.
      
      Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
      does not export memmap entry for it.  We add a call in function add_memory
      to function firmware_map_add_hotplug.
      
      Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
      will be called when initialize memmap and hot-add memory.
      
      [akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment]
      Signed-off-by: default avatarShaohui Zheng <shaohui.zheng@intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d96ae530
    • KOSAKI Motohiro's avatar
      mm: fix mbind vma merge problem · 9d8cebd4
      KOSAKI Motohiro authored
      Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
      Unfortunately, many vma can reduce performance...
      
      This patch fixes it.
      
          reproduced program
          ----------------------------------------------------------------
           #include <numaif.h>
           #include <numa.h>
           #include <sys/mman.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdlib.h>
           #include <string.h>
      
          static unsigned long pagesize;
      
          int main(int argc, char** argv)
          {
          	void* addr;
          	int ch;
          	int node;
          	struct bitmask *nmask = numa_allocate_nodemask();
          	int err;
          	int node_set = 0;
          	char buf[128];
      
          	while ((ch = getopt(argc, argv, "n:")) != -1){
          		switch (ch){
          		case 'n':
          			node = strtol(optarg, NULL, 0);
          			numa_bitmask_setbit(nmask, node);
          			node_set = 1;
          			break;
          		default:
          			;
          		}
          	}
          	argc -= optind;
          	argv += optind;
      
          	if (!node_set)
          		numa_bitmask_setbit(nmask, 0);
      
          	pagesize = getpagesize();
      
          	addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
          		    MAP_ANON|MAP_PRIVATE, 0, 0);
          	if (addr == MAP_FAILED)
          		perror("mmap "), exit(1);
      
          	fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);
      
          	/* make page populate */
          	memset(addr, 0, pagesize*3);
      
          	/* first mbind */
          	err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
          		    nmask->size, MPOL_MF_MOVE_ALL);
          	if (err)
          		error("mbind1 ");
      
          	/* second mbind */
          	err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
          	if (err)
          		error("mbind2 ");
      
          	sprintf(buf, "cat /proc/%d/maps", getpid());
          	system(buf);
      
          	return 0;
          }
          ----------------------------------------------------------------
      
      result without this patch
      
      	addr = 0x7fe26ef09000
      	[snip]
      	7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
      	7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
      	7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
      	7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0
      
      	=> 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.
      
      result with this patch
      
      	addr = 0x7fc9ebc76000
      	[snip]
      	7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
      	7fffbe690000-7fffbe6a5000 rw-p 00000000	00:00 0	[stack]
      
      	=> 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.
      
      [minchan.kim@gmail.com: fix file offset passed to vma_merge()]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d8cebd4
    • KOSAKI Motohiro's avatar
      mm: restore zone->all_unreclaimable to independence word · 93e4a89a
      KOSAKI Motohiro authored
      commit e815af95 ("change all_unreclaimable zone member to flags") changed
      all_unreclaimable member to bit flag.  But it had an undesireble side
      effect.  free_one_page() is one of most hot path in linux kernel and
      increasing atomic ops in it can reduce kernel performance a bit.
      
      Thus, this patch revert such commit partially. at least
      all_unreclaimable shouldn't share memory word with other zone flags.
      
      [akpm@linux-foundation.org: fix patch interaction]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Huang Shijie <shijie8@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e4a89a
    • Li Hong's avatar
      mm: remove free_hot_page() · fc91668e
      Li Hong authored
      free_hot_page() is just a wrapper around free_hot_cold_page() with
      parameter 'cold = 0'.  After adding a clear comment for
      free_hot_cold_page(), it is reasonable to remove a level of call.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc91668e
    • Li Hong's avatar
      mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct · c475dab6
      Li Hong authored
      Move a call of trace_mm_page_free_direct() from free_hot_page() to
      free_hot_cold_page().  It is clearer and close to kmemcheck_free_shadow(),
      as it is done in function __free_pages_ok().
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c475dab6
    • Li Hong's avatar
      mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct · f650316c
      Li Hong authored
      trace_mm_page_free_direct() is called in function __free_pages().  But it
      is called again in free_hot_page() if order == 0 and produce duplicate
      records in trace file for mm_page_free_direct event.  As below:
      
      K-PID    CPU#    TIMESTAMP  FUNCTION
        gnome-terminal-1567  [000]  4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
      
      This patch removes the first call and adds a call to
      trace_mm_page_free_direct() in __free_pages_ok().
      Signed-off-by: default avatarLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f650316c
    • KOSAKI Motohiro's avatar
      mm, lockdep: annotate reclaim context to zone reclaim too · 76ca542d
      KOSAKI Motohiro authored
      Commit cf40bd16 ("lockdep: annotate reclaim context") introduced reclaim
      context annotation.  But it didn't annotate zone reclaim.  This patch do
      it.
      
      The point is, commit cf40bd16 annotate __alloc_pages_direct_reclaim but
      zone-reclaim doesn't use __alloc_pages_direct_reclaim.
      
      current call graph is
      
      __alloc_pages_nodemask
         get_page_from_freelist
             zone_reclaim()
         __alloc_pages_slowpath
             __alloc_pages_direct_reclaim
                 try_to_free_pages
      
      Actually, if zone_reclaim_mode=1, VM never call
      __alloc_pages_direct_reclaim in usual VM pressure.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76ca542d
    • KOSAKI Motohiro's avatar
      vmscan: get_scan_ratio() cleanup · 84b18490
      KOSAKI Motohiro authored
      The get_scan_ratio() should have all scan-ratio related calculations.
      Thus, this patch move some calculation into get_scan_ratio.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84b18490
    • Minchan Kim's avatar
      vmscan: check high watermark after shrink zone · 45973d74
      Minchan Kim authored
      Kswapd checks that zone has sufficient pages free via zone_watermark_ok().
      
      If any zone doesn't have enough pages, we set all_zones_ok to zero.
      !all_zone_ok makes kswapd retry rather than sleeping.
      
      I think the watermark check before shrink_zone() is pointless.  Only after
      kswapd has tried to shrink the zone is the check meaningful.
      
      Move the check to after the call to shrink_zone().
      
      [akpm@linux-foundation.org: fix comment, layout]
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45973d74
    • Jiri Slaby's avatar
      mm: use rlimit helpers · 59e99e5b
      Jiri Slaby authored
      Make sure compiler won't do weird things with limits.  E.g.  fetching them
      twice may return 2 different values after writable limits are implemented.
      
      I.e.  either use rlimit helpers added in
      3e10e716 ("resource: add helpers for
      fetching rlimits") or ACCESS_ONCE if not applicable.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59e99e5b
    • KOSAKI Motohiro's avatar
      mm: mlock_vma_pages_range() only return success or failure · 06f9d8c2
      KOSAKI Motohiro authored
      Currently, mlock_vma_pages_range() only return len or 0.  then current
      error handling of mmap_region() is meaningless complex.
      
      This patch makes simplify and makes consist with brk() code.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06f9d8c2
    • KOSAKI Motohiro's avatar
      mm: mlock_vma_pages_range() never return negative value · c58267c3
      KOSAKI Motohiro authored
      Currently, mlock_vma_pages_range() never return negative value.  Then, we
      can remove some worthless error check.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c58267c3
    • KAMEZAWA Hiroyuki's avatar
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki authored
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • KAMEZAWA Hiroyuki's avatar
      mm: avoid false sharing of mm_counter · 34e55232
      KAMEZAWA Hiroyuki authored
      Considering the nature of per mm stats, it's the shared object among
      threads and can be a cache-miss point in the page fault path.
      
      This patch adds per-thread cache for mm_counter.  RSS value will be
      counted into a struct in task_struct and synchronized with mm's one at
      events.
      
      Now, in this patch, the event is the number of calls to handle_mm_fault.
      Per-thread value is added to mm at each 64 calls.
      
       rough estimation with small benchmark on parallel thread (2threads) shows
       [before]
           4.5 cache-miss/faults
       [after]
           4.0 cache-miss/faults
       Anyway, the most contended object is mmap_sem if the number of threads grows.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34e55232
    • KAMEZAWA Hiroyuki's avatar
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki authored
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d559db08