1. 22 Sep, 2009 40 commits
    • Sage Weil's avatar
      md: avoid use of broken kzalloc mempool · bbba809e
      Sage Weil authored
      The kzalloc mempool does not re-zero items that have been used and then
      returned to the pool.  Manually zero the allocated multipath_bh instead.
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbba809e
    • Jaswinder Singh Rajput's avatar
      mm: includecheck fix for mm/nommu.c · 72ff13b7
      Jaswinder Singh Rajput authored
      Fix the following 'make includecheck' warning:
      
        mm/nommu.c: internal.h is included more than once.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: default avatarGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72ff13b7
    • Jaswinder Singh Rajput's avatar
      mm: includecheck fix for mm/shmem.c · cff397e6
      Jaswinder Singh Rajput authored
      Fix the following 'make includecheck' warning:
      
        mm/shmem.c: linux/vfs.h is included more than once.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cff397e6
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() does not return -EEXIST · 2ca4532a
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
      or get_swap_page() would call add_to_swap_cache().  So add_to_swap_cache()
      doesn't return -EEXIST any more.
      
      Even though it doesn't return -EEXIST, it's not good behavior conceptually
      to call swapcache_prepare() in the -EEXIST case, because it means clearing
      SWAP_HAS_CACHE flag while the entry is on swap cache.
      
      This patch removes redundant codes and comments from callers of it, and
      adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca4532a
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() must not sleep · 31a56396
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
      cache but it has SWAP_HAS_CACHE flag.
      
      Such entries can exist on add/delete path of swap cache.  On add path,
      add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
      on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
      cleared) soon after __delete_from_swap_cache() is called.  So, the
      busy-wait works well in most cases.
      
      But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
      read_swap_cache_async() tries to swap-in the same entry on the same cpu.
      
      This patch calls radix_tree_preload() before swapcache_prepare() and
      divides add_to_swap_cache() into two part: radix_tree_preload() part and
      radix_tree_insert() part(define it as __add_to_swap_cache()).
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a56396
    • Mel Gorman's avatar
      tracing, documentation: Add a document on the kmem tracepoints · 8fbb398f
      Mel Gorman authored
      Knowing tracepoints exist is not quite the same as knowing what they
      should be used for.  This patch adds a document giving a basic description
      of the kmem tracepoints and why they might be useful to a performance
      analyst.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fbb398f
    • Mel Gorman's avatar
      tracing, documentation: add a document describing how to do some performance... · bb722220
      Mel Gorman authored
      tracing, documentation: add a document describing how to do some performance analysis with tracepoints
      
      The documentation for ftrace, events and tracepoints is pretty extensive.
      Similarly, the perf PCL tools help files --help are there and the code
      simple enough to figure out what much of the switches mean.  However,
      pulling the discrete bits and pieces together and translating that into
      "how do I solve a problem" requires a fair amount of imagination.
      
      This patch adds a simple document intended to get someone started on the
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb722220
    • Mel Gorman's avatar
      tracing, page-allocator: add a postprocessing script for page-allocator-related ftrace events · c9d05cfc
      Mel Gorman authored
      This patch adds a simple post-processing script for the
      page-allocator-related trace events.  It can be used to give an indication
      of who the most allocator-intensive processes are and how often the zone
      lock was taken during the tracing period.  Example output looks like
      
      Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                      under lock     direct  pagevec      drain
      swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
      Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
      modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
      xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
      awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
      thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
      hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
      akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
      xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
      khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0
      
      Optionally, the output can include information on the parent or aggregate
      based on process name instead of aggregating based on each pid. Example output
      including parent information and stripped out the PID looks something like;
      
      Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                           under lock     direct  pagevec      drain
      gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
      init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
      git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0
      
      This says that Xorg allocated 3796 pages and it's parent process is gdm
      with a PID of 3756;
      
      The postprocessor parses the text output of tracing.  While there is a
      binary format, the expectation is that the binary output can be readily
      translated into text and post-processed offline.  Obviously if the text
      format changes, the parser will break but the regular expression parser is
      fairly rudimentary so should be readily adjustable.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9d05cfc
    • Mel Gorman's avatar
      tracing, page-allocator: add trace event for page traffic related to the buddy lists · 0d3d062a
      Mel Gorman authored
      The page allocation trace event reports that a page was successfully
      allocated but it does not specify where it came from.  When analysing
      performance, it can be important to distinguish between pages coming from
      the per-cpu allocator and pages coming from the buddy lists as the latter
      requires the zone lock to the taken and more data structures to be
      examined.
      
      This patch adds a trace event for __rmqueue reporting when a page is being
      allocated from the buddy lists.  It distinguishes between being called to
      refill the per-cpu lists or whether it is a high-order allocation.
      Similarly, this patch adds an event to catch when the PCP lists are being
      drained a little and pages are going back to the buddy lists.
      
      This is trickier to draw conclusions from but high activity on those
      events could explain why there were a large number of cache misses on a
      page-allocator-intensive workload.  The coalescing and splitting of
      buddies involves a lot of writing of page metadata and cache line bounces
      not to mention the acquisition of an interrupt-safe lock necessary to
      enter this path.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d3d062a
    • Mel Gorman's avatar
      tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes · e0fff1bd
      Mel Gorman authored
      Fragmentation avoidance depends on being able to use free pages from lists
      of the appropriate migrate type.  In the event this is not possible,
      __rmqueue_fallback() selects a different list and in some circumstances
      change the migratetype of the pageblock.  Simplistically, the more times
      this event occurs, the more likely that fragmentation will be a problem
      later for hugepage allocation at least but there are other considerations
      such as the order of page being split to satisfy the allocation.
      
      This patch adds a trace event for __rmqueue_fallback() that reports what
      page is being used for the fallback, the orders of relevant pages, the
      desired migratetype and the migratetype of the lists being used, whether
      the pageblock changed type and whether this event is important with
      respect to fragmentation avoidance or not.  This information can be used
      to help analyse fragmentation avoidance and help decide whether
      min_free_kbytes should be increased or not.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0fff1bd
    • Mel Gorman's avatar
      tracing, page-allocator: add trace events for page allocation and page freeing · 4b4f278c
      Mel Gorman authored
      This patch adds trace events for the allocation and freeing of pages,
      including the freeing of pagevecs.  Using the events, it will be known
      what struct page and pfns are being allocated and freed and what the call
      site was in many cases.
      
      The page alloc tracepoints be used as an indicator as to whether the
      workload was heavily dependant on the page allocator or not.  You can make
      a guess based on vmstat but you can't get a per-process breakdown.
      Depending on the call path, the call_site for page allocation may be
      __get_free_pages() instead of a useful callsite.  Instead of passing down
      a return address similar to slab debugging, the user should enable the
      stacktrace and seg-addr options to get a proper stack trace.
      
      The pagevec free tracepoint has a different usecase.  It can be used to
      get a idea of how many pages are being dumped off the LRU and whether it
      is kswapd doing the work or a process doing direct reclaim.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b4f278c
    • Mel Gorman's avatar
      page-allocator: remove dead function free_cold_page() · 38a39857
      Mel Gorman authored
      The function free_cold_page() has no callers so delete it.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38a39857
    • Geert Uytterhoeven's avatar
      arches: drop superfluous casts in nr_free_pages() callers · cc013a88
      Geert Uytterhoeven authored
      Commit 96177299 ("Drop free_pages()")
      modified nr_free_pages() to return 'unsigned long' instead of 'unsigned
      int'.  This made the casts to 'unsigned long' in most callers superfluous,
      so remove them.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarKyle McMartin <kyle@mcmartin.ca>
      Acked-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Chris Zankel <zankel@tensilica.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc013a88
    • KAMEZAWA Hiroyuki's avatar
      kcore: /proc/kcore should use vread · 73d7c33e
      KAMEZAWA Hiroyuki authored
      /proc/kcore has its own routine to access vmallc area.  It can be replaced
      with vread().  And by this, /proc/kcore can do safe access to vmalloc
      area.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Mike Smith <scgtrp@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73d7c33e
    • KAMEZAWA Hiroyuki's avatar
      kcore: fix vread/vwrite to be aware of holes · d0107eb0
      KAMEZAWA Hiroyuki authored
      vread/vwrite access vmalloc area without checking there is a page or not.
      In most case, this works well.
      
      In old ages, the caller of get_vm_ara() is only IOREMAP and there is no
      memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] (
      -PAGE_SIZE is for a guard page.)
      
      After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous
      virtual address but remap _later_.  There tend to be a hole in valid
      vmalloc area in vm_struct lists.  Then, skip the hole (not mapped page) is
      necessary.  This patch updates vread/vwrite() for avoiding memory hole.
      
      Routines which access vmalloc area without knowing for which addr is used
      are
        - /proc/kcore
        - /dev/kmem
      
      kcore checks IOREMAP, /dev/kmem doesn't.  After this patch, IOREMAP is
      checked and /dev/kmem will avoid to read/write it.  Fixes to /proc/kcore
      will be in the next patch in series.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Mike Smith <scgtrp@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0107eb0
    • KAMEZAWA Hiroyuki's avatar
      vmalloc: unmap vmalloc area after hiding it · dd32c279
      KAMEZAWA Hiroyuki authored
      vmap area should be purged after vm_struct is removed from the list
      because vread/vwrite etc...believes the range is valid while it's on
      vm_struct list.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Cc: Mike Smith <scgtrp@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd32c279
    • Mel Gorman's avatar
      page-allocator: change migratetype for all pageblocks within a high-order page... · 2f66a68f
      Mel Gorman authored
      page-allocator: change migratetype for all pageblocks within a high-order page during __rmqueue_fallback
      
      When there are no pages of a target migratetype free, the page allocator
      selects a high-order block of another migratetype to allocate from.  When
      the order of the page taken is greater than pageblock_order, all
      pageblocks within that high-order page should change migratetype so that
      pages are later freed to the correct free-lists.
      
      The current behaviour is that pageblocks change migratetype if the order
      being split matches the pageblock_order.  When pageblock_order <
      MAX_ORDER-1, ownership is not changing correct and pages are being later
      freed to the incorrect list and this impacts fragmentation avoidance.
      
      This patch changes all pageblocks within the high-order page being split
      to the correct migratetype.  Without the patch, allocation success rates
      for hugepages under stress were about 59% of physical memory on x86-64.
      With the patch applied, this goes up to 65%.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f66a68f
    • Benjamin Herrenschmidt's avatar
      mm: kmem_cache_create(): make it easier to catch NULL cache names · fe1ff49d
      Benjamin Herrenschmidt authored
      Right now, if you inadvertently pass NULL to kmem_cache_create() at boot
      time, it crashes much later after boot somewhere deep inside sysfs which
      makes it very non obvious to figure out what's going on.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe1ff49d
    • Moussa A. Ba's avatar
      pagemap clear_refs: modify to specify anon or mapped vma clearing · 398499d5
      Moussa A. Ba authored
      The patch makes the clear_refs more versatile in adding the option to
      select anonymous pages or file backed pages for clearing.  This addition
      has a measurable impact on user space application performance as it
      decreases the number of pagewalks in scenarios where one is only
      interested in a specific type of page (anonymous or file mapped).
      
      The patch adds anonymous and file backed filters to the clear_refs interface.
      
      echo 1 > /proc/PID/clear_refs resets the bits on all pages
      echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
      echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only
      
      Any other value is ignored
      Signed-off-by: default avatarMoussa A. Ba <moussa.a.ba@gmail.com>
      Signed-off-by: default avatarJared E. Hulbert <jaredeh@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      398499d5
    • Hugh Dickins's avatar
      ksm: mremap use err from ksm_madvise · 7103ad32
      Hugh Dickins authored
      mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
      because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
      -ENOMEM (letting madvise convert that to -EAGAIN), and can also say
      -ERESTARTSYS when signalled: so pass the error from ksm_madvise.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7103ad32
    • Hugh Dickins's avatar
      ksm: unmerge is an origin of OOMs · 35451bee
      Hugh Dickins authored
      Just as the swapoff system call allocates many pages of RAM to various
      processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
      (unmerge) is liable to allocate many pages of RAM to various processes,
      perhaps triggering OOM; and each is normally run from a modest admin
      process (swapoff or shell), easily repeated until it succeeds.
      
      So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
      try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
      with that, to ask the OOM killer to kill them first, to prevent them from
      spawning more and more OOM kills.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35451bee
    • Hugh Dickins's avatar
      ksm: clean up obsolete references · a913e182
      Hugh Dickins authored
      A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
      longer applies, and it can be made private to ksm.c; there's no more
      reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a913e182
    • Hugh Dickins's avatar
      ksm: remove VM_MERGEABLE_FLAGS · 8314c4f2
      Hugh Dickins authored
      KSM originally stood for Kernel Shared Memory: but the kernel has long
      supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
      something else.  So we switched to saying "merge" instead of "share".
      
      But Chris Wright points out that this is confusing where mmap.c merges
      adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
      is_mergeable_vma() to let vmas be merged despite flags being different.
      
      Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
      only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
      that directly, with a comment on it in is_mergeable_vma().
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8314c4f2
    • Hugh Dickins's avatar
      ksm: add some documentation · 7701c9c0
      Hugh Dickins authored
      Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7701c9c0
    • Hugh Dickins's avatar
      ksm: sysfs and defaults · 2ffd8679
      Hugh Dickins authored
      At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
      to provide the /sys/kernel/mm/ksm files to tune and activate it.
      
      Make KSM depend on SYSFS?  Could do, but it might be better to provide
      some defaults so that KSM works out-of-the-box, ready for testers to
      madvise MADV_MERGEABLE, even without SYSFS.
      
      Though anyone serious is likely to want to retune the numbers to their
      taste once they have experience; and whether these settings ever reach
      2.6.32 can be discussed along the way.
      
      Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ffd8679
    • Andrea Arcangeli's avatar
      ksm: fix deadlock with munlock in exit_mmap · 1c2fb7a4
      Andrea Arcangeli authored
      Rawhide users have reported hang at startup when cryptsetup is run: the
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks.
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c2fb7a4
    • Hugh Dickins's avatar
      ksm: fix oom deadlock · 9ba69294
      Hugh Dickins authored
      There's a now-obvious deadlock in KSM's out-of-memory handling:
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ba69294
    • Hugh Dickins's avatar
      ksm: distribute remove_mm_from_lists · cd551f97
      Hugh Dickins authored
      Do some housekeeping in ksm.c, to help make the next patch easier
      to understand: remove the function remove_mm_from_lists, distributing
      its code to its callsites scan_get_next_rmap_item and __ksm_exit.
      
      That turns out to be a win in scan_get_next_rmap_item: move its
      remove_trailing_rmap_items and cursor advancement up, and it becomes
      simpler than before.  __ksm_exit becomes messier, but will change
      again; and moving its remove_trailing_rmap_items up lets us strengthen
      the unstable tree item's age condition in remove_rmap_item_from_tree.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd551f97
    • Hugh Dickins's avatar
      ksm: fix endless loop on oom · d952b791
      Hugh Dickins authored
      break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
      only be a problem for ksmd when a memory control group imposes limits
      (normally the OOM killer will kill others with an mm until it succeeds);
      but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
      do need to route the error (or kill) back to the caller (or sighandling).
      
      Test signal_pending in unmerge_ksm_pages, which could be a lengthy
      procedure if it has to spill into swap: returning -ERESTARTSYS so that
      trivial signals will restart but fatals will terminate (is that right?
      we do different things in different places in mm, none exactly this).
      
      unmerge_and_remove_all_rmap_items was forgetting to lock when going
      down the mm_list: fix that.  Whether it's successful or not, reset
      ksm_scan cursor to head; but only if it's successful, reset seqnr
      (shown in full_scans) - page counts will have gone down to zero.
      
      This patch leaves a significant OOM deadlock, but it's a good step
      on the way, and that deadlock is fixed in a subsequent patch.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d952b791
    • Hugh Dickins's avatar
      ksm: five little cleanups · 81464e30
      Hugh Dickins authored
      1. We don't use __break_cow entry point now: merge it into break_cow.
      2. remove_all_slot_rmap_items is just a special case of
         remove_trailing_rmap_items: use the latter instead.
      3. Extend comment on unmerge_ksm_pages and rmap_items.
      4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
         instead of duplicating its code; and so swap them around.
      5. Comment on cmp_and_merge_page described last year's: update it.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81464e30
    • Hugh Dickins's avatar
      ksm: keep quiet while list empty · 6e158384
      Hugh Dickins authored
      ksm_scan_thread already sleeps in wait_event_interruptible until setting
      ksm_run activates it; but if there's nothing on its list to look at, i.e.
      nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
      up system time and full_scans: ksmd_should_run added to check that too.
      
      And move the mutex_lock out around it: the new counts showed that when
      ksm_run is stopped, a little work often got done afterwards, because it
      had been read before taking the mutex.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e158384
    • Hugh Dickins's avatar
      ksm: break cow once unshared · 26465d3e
      Hugh Dickins authored
      We kept agreeing not to bother about the unswappable shared KSM pages
      which later become unshared by others: observation suggests they're not
      a significant proportion.  But they are disadvantageous, and it is easier
      to break COW to replace them by swappable pages, than offer statistics
      to show that they don't matter; then we can stop worrying about them.
      
      Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
      this pass: give them a good chance of getting into the unstable tree
      on the next pass, or back into the stable, by computing checksum now.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26465d3e
    • Hugh Dickins's avatar
      ksm: pages_unshared and pages_volatile · 473b0ce4
      Hugh Dickins authored
      The pages_shared and pages_sharing counts give a good picture of how
      successful KSM is at sharing; but no clue to how much wasted work it's
      doing to get there.  Add pages_unshared (count of unique pages waiting
      in the unstable tree, hoping to find a mate) and pages_volatile.
      
      pages_volatile is harder to define.  It includes those pages changing
      too fast to get into the unstable tree, but also whatever other edge
      conditions prevent a page getting into the trees: a high value may
      deserve investigation.  Don't try to calculate it from the various
      conditions: it's the total of rmap_items less those accounted for.
      
      Also show full_scans: the number of completed scans of everything
      registered in the mm list.
      
      The locking for all these counts is simply ksm_thread_mutex.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      473b0ce4
    • Hugh Dickins's avatar
      ksm: move pages_sharing updates · e178dfde
      Hugh Dickins authored
      The pages_shared count is incremented and decremented when adding a node
      to and removing a node from the stable tree: easy to understand.  But the
      pages_sharing count was hard to follow, being adjusted in various places:
      increment and decrement it when adding to and removing from the stable tree.
      
      And the pages_sharing variable used to include the pages_shared, then those
      were subtracted when shown in the pages_sharing sysfs file: now keep it as
      an exclusive count of leaves hanging off the stable tree nodes, throughout.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e178dfde
    • Hugh Dickins's avatar
      ksm: rename kernel_pages_allocated · b4028260
      Hugh Dickins authored
      We're not implementing swapping of KSM pages in its first release;
      but when that follows, "kernel_pages_allocated" will be a very poor
      name for the sysfs file showing number of nodes in the stable tree:
      rename that to "pages_shared" throughout.
      
      But we already have a "pages_shared", counting those page slots
      sharing the shared pages: first rename that to... "pages_sharing".
      
      What will become of "max_kernel_pages" when the pages shared can
      be swapped?  I guess it will just be removed, so keep that name.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4028260
    • Izik Eidus's avatar
      ksm: change ksm nice level to be 5 · 339aa624
      Izik Eidus authored
      ksm should try not to disturb other tasks as much as possible.
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      339aa624
    • Izik Eidus's avatar
      ksm: change copyright message · 36b2528d
      Izik Eidus authored
      Adding Hugh Dickins into the authors list.
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36b2528d
    • Hugh Dickins's avatar
      ksm: prevent mremap move poisoning · 1ff82995
      Hugh Dickins authored
      KSM's scan allows for user pages to be COWed or unmapped at any time,
      without requiring any notification.  But its stable tree does assume that
      when it finds a KSM page where it placed a KSM page, then it is the same
      KSM page that it placed there.
      
      mremap move could break that assumption: if an area containing a KSM page
      was unmapped, then an area containing a different KSM page was moved with
      mremap into the place of the original, before KSM's scan came around to
      notice.  That could then poison a node of the stable tree, so that memcmps
      would "lie" and upset the ordering of the tree.
      
      Probably noone will ever need mremap move on a VM_MERGEABLE area; except
      that prohibiting it would make trouble for schemes in which we try making
      everything VM_MERGEABLE e.g.  for testing: an mremap which normally works
      would then fail mysteriously.
      
      There's no need to go to any trouble, such as re-sorting KSM's list of
      rmap_items to match the new layout: simply unmerge the area to COW all its
      KSM pages before moving, but leave VM_MERGEABLE on so that they're
      remerged later.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ff82995
    • Izik Eidus's avatar
      ksm: Kernel SamePage Merging · 31dbd01f
      Izik Eidus authored
      Ksm is code that allows merging of identical pages between one or more
      applications, in a way invisible to the applications that use it.  Pages
      that are merged are marked as read-only, then COWed when any application
      tries to change them.
      
      Whereas fork() allows sharing anonymous pages between parent and child,
      ksm can share anonymous pages between unrelated processes.
      
      Ksm works by walking over the memory pages of the applications it scans,
      in order to find identical pages.  It uses two sorted data structures,
      called the stable and unstable trees, to locate identical pages in an
      effective way.
      
      When ksm finds two identical pages, it marks them as readonly and merges
      them into a single page.  After the pages have been marked as readonly and
      merged into one, Linux treats them as normal copy-on-write pages, copying
      to a fresh anonymous page if write access is required later.
      
      Ksm scans and merges anonymous pages only in those memory areas that have
      been registered with it by madvise(addr, length, MADV_MERGEABLE).
      
      The ksm scanner is controlled by sysfs files in /sys/kernel/mm/ksm/:
      
      max_kernel_pages - the maximum number of unswappable kernel pages
                         which may be allocated by ksm (0 for unlimited).
      
      kernel_pages_allocated - how many ksm pages are currently allocated,
                               sharing identical content between different
                               processes (pages unswappable in this release).
      
      pages_shared - how many pages have been saved by sharing with ksm pages
                     (kernel_pages_allocated being excluded from this count).
      
      pages_to_scan - how many pages ksm should scan before sleeping.
      
      sleep_millisecs - how many milliseconds ksm should sleep between scans.
      
      run - write 0 to disable ksm, read 0 while ksm is disabled (default),
            write 1 to run ksm, read 1 while ksm is running,
            write 2 to disable ksm and unmerge all its pages.
      
      Includes contributions by Andrea Arcangeli Chris Wright and Hugh Dickins.
      
      [hugh.dickins@tiscali.co.uk: fix rare page leak]
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31dbd01f
    • Hugh Dickins's avatar
      ksm: identify PageKsm pages · 9a840895
      Hugh Dickins authored
      KSM will need to identify its kernel merged pages unambiguously, and
      /proc/kpageflags will probably like to do so too.
      
      Since KSM will only be substituting anonymous pages, statistics are best
      preserved by making a PageKsm page a special PageAnon page: one with no
      anon_vma.
      
      But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
      PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a840895