1. 26 Jul, 2011 40 commits
    • Mathias Krause's avatar
      m68k, exec: remove redundant set_fs(USER_DS) · b7de1100
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7de1100
    • Mathias Krause's avatar
      m32r, exec: remove redundant set_fs(USER_DS) · f7960625
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7960625
    • Mathias Krause's avatar
      alpha, exec: remove redundant set_fs(USER_DS) · 6fa80900
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fa80900
    • Mathias Krause's avatar
      h8300, exec: remove redundant set_fs(USER_DS) · 023f21b9
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      023f21b9
    • Wu Fengguang's avatar
      writeback: account NR_WRITTEN at IO completion time · 99b12e3d
      Wu Fengguang authored
      NR_WRITTEN is now accounted at block IO enqueue time, which is not very
      accurate as to common understanding.  This moves NR_WRITTEN accounting to
      the IO completion time and makes it more consistent with BDI_WRITTEN,
      which is used for bandwidth estimation.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99b12e3d
    • Hugh Dickins's avatar
      tmpfs: simplify unuse and writepage · 48f170fb
      Hugh Dickins authored
      shmem_unuse_inode() and shmem_writepage() contain a little code to cope
      with pages inserted independently into the filecache, probably by a
      filesystem stacked on top of tmpfs, then fed to its ->readpage() or
      ->writepage().
      
      Unionfs was indeed experimenting with working in that way three years ago,
      but I find no current examples: nowadays the stacking filesystems use vfs
      interfaces to the lower filesystem.
      
      It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Erez Zadok <ezk@fsl.cs.sunysb.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48f170fb
    • Hugh Dickins's avatar
      tmpfs: simplify filepage/swappage · 27ab7006
      Hugh Dickins authored
      We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
      filepage passed in via shmem_readpage(), then swappage found, which must
      then be copied over to it.
      
      Although at first it's tempting to replace the **pagep arg by returning
      struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
      callers, so leave as is.
      
      Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
      complication came from uninitialized pages inserted into filecache prior
      to readpage; but now we're in control, and only release pagelock on
      filecache once it's uptodate (if an error occurs in reading back from
      swap, the page remains in swapcache, never moved to filecache).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ab7006
    • Hugh Dickins's avatar
      tmpfs: simplify prealloc_page · e83c32e8
      Hugh Dickins authored
      The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
      complicated: first simplify that before going on to filepage/swappage.
      
      That's right, don't report ENOMEM when the preallocation fails: we may or
      may not need the page.  But simply report ENOMEM once we find we do need
      it, instead of dropping lock, repeating allocation, unwinding on failure
      etc.  And leave the out label on the fast path, don't goto.
      
      Fix something that looks like a bug but turns out not to be: set
      PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
      the removed case was doing.  That's important before adding to LRU
      (determines which LRU the page goes on), and does affect which path it
      takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
      is handled no differently from CACHE.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: "Zhang, Yanmin" <yanmin.zhang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e83c32e8
    • Hugh Dickins's avatar
      tmpfs: remove_shmem_readpage · 9276aad6
      Hugh Dickins authored
      Remove that pernicious shmem_readpage() at last: the things we needed it
      for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
      shmem_file_splice_read() and shmem_read_mapping_page_gfp().
      
      This removal clears the way for a simpler shmem_getpage_gfp(), since page
      is never passed in; but leave most of that cleanup until after.
      
      sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
      instead of unexpectedly trying to read ahead on tmpfs: if that proves to
      be an issue for someone, then we can either arrange for them to return
      success instead, or try to implement async readahead on tmpfs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9276aad6
    • Hugh Dickins's avatar
      tmpfs: pass gfp to shmem_getpage_gfp · 68da9f05
      Hugh Dickins authored
      Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
      shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().
      
      Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
      CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().
      
      Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
      though we might later want to let that case route to read_cache_page_gfp().
      
      It annoys me to have these two almost-redundant args, gfp and fault_type:
      I can't find a better way; but initialize fault_type only in shmem_fault().
      
      Note that before, read_cache_page_gfp() was allocating i915_gem's pages
      with __GFP_NORETRY as intended; but the corresponding swap vector pages
      got allocated without it, leaving a small possibility of OOM.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68da9f05
    • Hugh Dickins's avatar
      tmpfs: refine shmem_file_splice_read · 71f0e07a
      Hugh Dickins authored
      Tidy up shmem_file_splice_read():
      
      Remove readahead: okay, we could implement shmem readahead on swap,
      but have never done so before, swap being the slow exceptional path.
      
      Use shmem_getpage() instead of find_or_create_page() plus ->readpage().
      
      Remove several comments: sorry, I found them more distracting than
      helpful, and this will not be the reference version of splice_read().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71f0e07a
    • Hugh Dickins's avatar
      tmpfs: clone shmem_file_splice_read() · 708e3508
      Hugh Dickins authored
      Copy __generic_file_splice_read() and generic_file_splice_read() from
      fs/splice.c to shmem_file_splice_read() in mm/shmem.c.  Make
      page_cache_pipe_buf_ops and spd_release_page() accessible to it.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      708e3508
    • Benjamin Herrenschmidt's avatar
      mm/futex: fix futex writes on archs with SW tracking of dirty & young · 2efaca92
      Benjamin Herrenschmidt authored
      I haven't reproduced it myself but the fail scenario is that on such
      machines (notably ARM and some embedded powerpc), if you manage to hit
      that futex path on a writable page whose dirty bit has gone from the PTE,
      you'll livelock inside the kernel from what I can tell.
      
      It will go in a loop of trying the atomic access, failing, trying gup to
      "fix it up", getting succcess from gup, go back to the atomic access,
      failing again because dirty wasn't fixed etc...
      
      So I think you essentially hang in the kernel.
      
      The scenario is probably rare'ish because affected architecture are
      embedded and tend to not swap much (if at all) so we probably rarely hit
      the case where dirty is missing or young is missing, but I think Shan has
      a piece of SW that can reliably reproduce it using a shared writable
      mapping & fork or something like that.
      
      On archs who use SW tracking of dirty & young, a page without dirty is
      effectively mapped read-only and a page without young unaccessible in the
      PTE.
      
      Additionally, some architectures might lazily flush the TLB when relaxing
      write protection (by doing only a local flush), and expect a fault to
      invalidate the stale entry if it's still present on another processor.
      
      The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
      "fix it up" by causing get_user_pages() which would then be equivalent to
      taking the fault.
      
      However that isn't the case.  get_user_pages() will not call
      handle_mm_fault() in the case where the PTE seems to have the right
      permissions, regardless of the dirty and young state.  It will eventually
      update those bits ...  in the struct page, but not in the PTE.
      
      Additionally, it will not handle the lazy TLB flushing that can be
      required by some architectures in the fault case.
      
      Basically, gup is the wrong interface for the job.  The patch provides a
      more appropriate one which boils down to just calling handle_mm_fault()
      since what we are trying to do is simulate a real page fault.
      
      The futex code currently attempts to write to user memory within a
      pagefault disabled section, and if that fails, tries to fix it up using
      get_user_pages().
      
      This doesn't work on archs where the dirty and young bits are maintained
      by software, since they will gate access permission in the TLB, and will
      not be updated by gup().
      
      In addition, there's an expectation on some archs that a spurious write
      fault triggers a local TLB flush, and that is missing from the picture as
      well.
      
      I decided that adding those "features" to gup() would be too much for this
      already too complex function, and instead added a new simpler
      fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
      which the futex code can call.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-by: default avatarShan Hai <haishan.bai@gmail.com>
      Tested-by: default avatarShan Hai <haishan.bai@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Darren Hart <darren.hart@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2efaca92
    • Konstantin Khlebnikov's avatar
      mm: remove useless rcu lock-unlock from mapping_tagged() · 72c47832
      Konstantin Khlebnikov authored
      radix_tree_tagged() is lockless - it reads from a member of the raid-tree
      root node.  It does not require any protection.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72c47832
    • Mel Gorman's avatar
      mm: page allocator: reconsider zones for allocation after direct reclaim · 76d3fbf8
      Mel Gorman authored
      With zone_reclaim_mode enabled, it's possible for zones to be considered
      full in the zonelist_cache so they are skipped in the future.  If the
      process enters direct reclaim, the ZLC may still consider zones to be full
      even after reclaiming pages.  Reconsider all zones for allocation if
      direct reclaim returns successfully.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76d3fbf8
    • Mel Gorman's avatar
      mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim · cd38b115
      Mel Gorman authored
      There have been a small number of complaints about significant stalls
      while copying large amounts of data on NUMA machines reported on a
      distribution bugzilla.  In these cases, zone_reclaim was enabled by
      default due to large NUMA distances.  In general, the complaints have not
      been about the workload itself unless it was a file server (in which case
      the recommendation was disable zone_reclaim).
      
      The stalls are mostly due to significant amounts of time spent scanning
      the preferred zone for pages to free.  After a failure, it might fallback
      to another node (as zonelists are often node-ordered rather than
      zone-ordered) but stall quickly again when the next allocation attempt
      occurs.  In bad cases, each page allocated results in a full scan of the
      preferred zone.
      
      Patch 1 checks the preferred zone for recent allocation failure
              which is particularly important if zone_reclaim has failed
              recently.  This avoids rescanning the zone in the near future and
              instead falling back to another node.  This may hurt node locality
              in some cases but a failure to zone_reclaim is more expensive than
              a remote access.
      
      Patch 2 clears the zlc information after direct reclaim.
              Otherwise, zone_reclaim can mark zones full, direct reclaim can
              reclaim enough pages but the zone is still not considered for
              allocation.
      
      This was tested on a 24-thread 2-node x86_64 machine.  The tests were
      focused on large amounts of IO.  All tests were bound to the CPUs on
      node-0 to avoid disturbances due to processes being scheduled on different
      nodes.  The kernels tested are
      
      3.0-rc6-vanilla		Vanilla 3.0-rc6
      zlcfirst		Patch 1 applied
      zlcreconsider		Patches 1+2 applied
      
      FS-Mark
      ./fs_mark  -d  /tmp/fsmark-10813  -D  100  -N  5000  -n  208  -L  35  -t  24  -S0  -s  524288
                      fsmark-3.0-rc6       3.0-rc6       		3.0-rc6
                         vanilla			 zlcfirs 	zlcreconsider
      Files/s  min          54.90 ( 0.00%)       49.80 (-10.24%)       49.10 (-11.81%)
      Files/s  mean        100.11 ( 0.00%)      135.17 (25.94%)      146.93 (31.87%)
      Files/s  stddev       57.51 ( 0.00%)      138.97 (58.62%)      158.69 (63.76%)
      Files/s  max         361.10 ( 0.00%)      834.40 (56.72%)      802.40 (55.00%)
      Overhead min       76704.00 ( 0.00%)    76501.00 ( 0.27%)    77784.00 (-1.39%)
      Overhead mean    1485356.51 ( 0.00%)  1035797.83 (43.40%)  1594680.26 (-6.86%)
      Overhead stddev  1848122.53 ( 0.00%)   881489.88 (109.66%)  1772354.90 ( 4.27%)
      Overhead max     7989060.00 ( 0.00%)  3369118.00 (137.13%) 10135324.00 (-21.18%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)        501.49    493.91    499.93
      Total Elapsed Time (seconds)               2451.57   2257.48   2215.92
      
      MMTests Statistics: vmstat
      Page Ins                                       46268       63840       66008
      Page Outs                                   90821596    90671128    88043732
      Swap Ins                                           0           0           0
      Swap Outs                                          0           0           0
      Direct pages scanned                        13091697     8966863     8971790
      Kswapd pages scanned                               0     1830011     1831116
      Kswapd pages reclaimed                             0     1829068     1829930
      Direct pages reclaimed                      13037777     8956828     8648314
      Kswapd efficiency                               100%         99%         99%
      Kswapd velocity                                0.000     810.643     826.346
      Direct efficiency                                99%         99%         96%
      Direct velocity                             5340.128    3972.068    4048.788
      Percentage direct scans                         100%         83%         83%
      Page writes by reclaim                             0           3           0
      Slabs scanned                                 796672      720640      720256
      Direct inode steals                          7422667     7160012     7088638
      Kswapd inode steals                                0     1736840     2021238
      
      Test completes far faster with a large increase in the number of files
      created per second.  Standard deviation is high as a small number of
      iterations were much higher than the mean.  The number of pages scanned by
      zone_reclaim is reduced and kswapd is used for more work.
      
      LARGE DD
                     		3.0-rc6       3.0-rc6       3.0-rc6
                         	vanilla     zlcfirst     zlcreconsider
      download tar           59 ( 0.00%)   59 ( 0.00%)   55 ( 7.27%)
      dd source files       527 ( 0.00%)  296 (78.04%)  320 (64.69%)
      delete source          36 ( 0.00%)   19 (89.47%)   20 (80.00%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)        125.03    118.98    122.01
      Total Elapsed Time (seconds)                624.56    375.02    398.06
      
      MMTests Statistics: vmstat
      Page Ins                                     3594216      439368      407032
      Page Outs                                   23380832    23380488    23377444
      Swap Ins                                           0           0           0
      Swap Outs                                          0         436         287
      Direct pages scanned                        17482342    69315973    82864918
      Kswapd pages scanned                               0      519123      575425
      Kswapd pages reclaimed                             0      466501      522487
      Direct pages reclaimed                       5858054     2732949     2712547
      Kswapd efficiency                               100%         89%         90%
      Kswapd velocity                                0.000    1384.254    1445.574
      Direct efficiency                                33%          3%          3%
      Direct velocity                            27991.453  184832.737  208171.929
      Percentage direct scans                         100%         99%         99%
      Page writes by reclaim                             0        5082       13917
      Slabs scanned                                  17280       29952       35328
      Direct inode steals                           115257     1431122      332201
      Kswapd inode steals                                0           0      979532
      
      This test downloads a large tarfile and copies it with dd a number of
      times - similar to the most recent bug report I've dealt with.  Time to
      completion is reduced.  The number of pages scanned directly is still
      disturbingly high with a low efficiency but this is likely due to the
      number of dirty pages encountered.  The figures could probably be improved
      with more work around how kswapd is used and how dirty pages are handled
      but that is separate work and this result is significant on its own.
      
      Streaming Mapped Writer
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)        124.47    111.67    112.64
      Total Elapsed Time (seconds)               2138.14   1816.30   1867.56
      
      MMTests Statistics: vmstat
      Page Ins                                       90760       89124       89516
      Page Outs                                  121028340   120199524   120736696
      Swap Ins                                           0          86          55
      Swap Outs                                          0           0           0
      Direct pages scanned                       114989363    96461439    96330619
      Kswapd pages scanned                        56430948    56965763    57075875
      Kswapd pages reclaimed                      27743219    27752044    27766606
      Direct pages reclaimed                         49777       46884       36655
      Kswapd efficiency                                49%         48%         48%
      Kswapd velocity                            26392.541   31363.631   30561.736
      Direct efficiency                                 0%          0%          0%
      Direct velocity                            53780.091   53108.759   51581.004
      Percentage direct scans                          67%         62%         62%
      Page writes by reclaim                           385         122        1513
      Slabs scanned                                  43008       39040       42112
      Direct inode steals                                0          10           8
      Kswapd inode steals                              733         534         477
      
      This test just creates a large file mapping and writes to it linearly.
      Time to completion is again reduced.
      
      The gains are mostly down to two things.  In many cases, there is less
      scanning as zone_reclaim simply gives up faster due to recent failures.
      The second reason is that memory is used more efficiently.  Instead of
      scanning the preferred zone every time, the allocator falls back to
      another zone and uses it instead improving overall memory utilisation.
      
      This patch: initialise ZLC for first zone eligible for zone_reclaim.
      
      The zonelist cache (ZLC) is used among other things to record if
      zone_reclaim() failed for a particular zone recently.  The intention is to
      avoid a high cost scanning extremely long zonelists or scanning within the
      zone uselessly.
      
      Currently the zonelist cache is setup only after the first zone has been
      considered and zone_reclaim() has been called.  The objective was to avoid
      a costly setup but zone_reclaim is itself quite expensive.  If it is
      failing regularly such as the first eligible zone having mostly mapped
      pages, the cost in scanning and allocation stalls is far higher than the
      ZLC initialisation step.
      
      This patch initialises ZLC before the first eligible zone calls
      zone_reclaim().  Once initialised, it is checked whether the zone failed
      zone_reclaim recently.  If it has, the zone is skipped.  As the first zone
      is now being checked, additional care has to be taken about zones marked
      full.  A zone can be marked "full" because it should not have enough
      unmapped pages for zone_reclaim but this is excessive as direct reclaim or
      kswapd may succeed where zone_reclaim fails.  Only mark zones "full" after
      zone_reclaim fails if it failed to reclaim enough pages after scanning.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd38b115
    • KAMEZAWA Hiroyuki's avatar
      mm: preallocate page before lock_page() at filemap COW · 1d65f86d
      KAMEZAWA Hiroyuki authored
      Currently we are keeping faulted page locked throughout whole __do_fault
      call (except for page_mkwrite code path) after calling file system's fault
      code.  If we do early COW, we allocate a new page which has to be charged
      for a memcg (mem_cgroup_newpage_charge).
      
      This function, however, might block for unbounded amount of time if memcg
      oom killer is disabled or fork-bomb is running because the only way out of
      the OOM situation is either an external event or OOM-situation fix.
      
      In the end we are keeping the faulted page locked and blocking other
      processes from faulting it in which is not good at all because we are
      basically punishing potentially an unrelated process for OOM condition in
      a different group (I have seen stuck system because of ld-2.11.1.so being
      locked).
      
      We can do test easily.
      
       % cgcreate -g memory:A
       % cgset -r memory.limit_in_bytes=64M A
       % cgset -r memory.memsw.limit_in_bytes=64M A
       % cd kernel_dir; cgexec -g memory:A make -j
      
      Then, the whole system will live-locked until you kill 'make -j'
      by hands (or push reboot...) This is because some important page in a
      a shared library are locked.
      
      Considering again, the new page is not necessary to be allocated
      with lock_page() held. And usual page allocation may dive into
      long memory reclaim loop with holding lock_page() and can cause
      very long latency.
      
      There are 3 ways.
        1. do allocation/charge before lock_page()
           Pros. - simple and can handle page allocation in the same manner.
                   This will reduce holding time of lock_page() in general.
           Cons. - we do page allocation even if ->fault() returns error.
      
        2. do charge after unlock_page(). Even if charge fails, it's just OOM.
           Pros. - no impact to non-memcg path.
           Cons. - implemenation requires special cares of LRU and we need to modify
                   page_add_new_anon_rmap()...
      
        3. do unlock->charge->lock again method.
           Pros. - no impact to non-memcg path.
           Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
                   lock and get it again...
      
      This patch moves "charge" and memory allocation for COW page
      before lock_page(). Then, we can avoid scanning LRU with holding
      a lock on a page and latency under lock_page() will be reduced.
      
      Then, above livelock disappears.
      
      [akpm@linux-foundation.org: fix code layout]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: default avatarLutz Vieweg <lvml@5t9.de>
      Original-idea-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d65f86d
    • Hugh Dickins's avatar
      tmpfs: no need to use i_lock · d515afe8
      Hugh Dickins authored
      2.6.36's 7e496299 ("tmpfs: make tmpfs scalable with percpu_counter for
      used blocks") to make tmpfs scalable with percpu_counter used
      inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
      that was adverse to scalability, and unnecessary, since info->lock is
      already held there in the fast paths.
      
      Remove those uses of i_lock, and add info->lock in the three error paths
      where it's then needed across shmem_free_blocks().  It's not actually
      needed across shmem_unacct_blocks(), but they're so often paired that it
      looks wrong to split them apart.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d515afe8
    • Hugh Dickins's avatar
      mm: pincer in truncate_inode_pages_range · d0823576
      Hugh Dickins authored
      truncate_inode_pages_range()'s final loop has a nice pincer property,
      bringing start and end together, squeezing out the last pages.  But the
      range handling missed out on that, just sliding up the range, perhaps
      letting pages come in behind it.  Add one more test to give it the same
      pincer effect.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0823576
    • Hugh Dickins's avatar
      mm: consistent truncate and invalidate loops · b85e0eff
      Hugh Dickins authored
      Make the pagevec_lookup loops in truncate_inode_pages_range(),
      invalidate_mapping_pages() and invalidate_inode_pages2_range() more
      consistent with each other.
      
      They were relying upon page->index of an unlocked page, but apologizing
      for it: accept it, embrace it, add comments and WARN_ONs, and simplify the
      index handling.
      
      invalidate_inode_pages2_range() had special handling for a wrapped
      page->index + 1 = 0 case; but MAX_LFS_FILESIZE doesn't let us anywhere
      near there, and a corrupt page->index in the radix_tree could cause more
      trouble than that would catch.  Remove that wrapped handling.
      
      invalidate_inode_pages2_range() uses min() to limit the pagevec_lookup
      when near the end of the range: copy that into the other two, although
      it's less useful than you might think (it limits the use of the buffer,
      rather than the indices looked up).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b85e0eff
    • Hugh Dickins's avatar
      mm: tidy vmtruncate_range and related functions · 8a549bea
      Hugh Dickins authored
      Use consistent variable names in truncate_pagecache(), truncate_setsize(),
      vmtruncate() and vmtruncate_range().
      
      unmap_mapping_range() and vmtruncate_range() have mismatched interfaces:
      don't change either, but make the vmtruncates more precise about what they
      expect unmap_mapping_range() to do.
      
      vmtruncate_range() is currently called only with page-aligned start and
      end+1: can handle unaligned start, but unaligned end+1 would hit BUG_ON in
      truncate_inode_pages_range() (lacks partial clearing of the end page).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a549bea
    • Hugh Dickins's avatar
      mm: truncate functions are in truncate.c · 85821aab
      Hugh Dickins authored
      Correct comment on truncate_inode_pages*() in linux/mm.h; and remove
      declaration of page_unuse(), it didn't exist even in 2.2.26 or 2.4.0!
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85821aab
    • Hugh Dickins's avatar
      mm: cleanup descriptions of filler arg · 5e5358e7
      Hugh Dickins authored
      The often-NULL data arg to read_cache_page() and read_mapping_page()
      functions is misdescribed as "destination for read data": no, it's the
      first arg to the filler function, often struct file * to ->readpage().
      
      Satisfy checkpatch.pl on those filler prototypes, and tidy up the
      declarations in linux/pagemap.h.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e5358e7
    • David S. Miller's avatar
    • David S. Miller's avatar
      sparc64: add support for _PAGE_SPECIAL · 683d2fa6
      David S. Miller authored
      Luckily there are still a few software PTE bits remaining and they even
      match up in both the sun4u and sun4v pte layouts.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      683d2fa6
    • David S. Miller's avatar
      sparc64: use RCU page table freeing · 4a0100f7
      David S. Miller authored
      Make use of the generic RCU page table freeing on Sparc64, doing so allows
      for race-free software page-table walkers like gup_fast().
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a0100f7
    • David S. Miller's avatar
      sparc64: kill page table quicklists · 4dedbf8d
      David S. Miller authored
      With the recent mmu_gather changes that included generic RCU freeing of
      page-tables, it is now quite straightforward to implement gup_fast() on
      sparc64.
      
      This patch:
      
      Remove the page table quicklists.  They are pointless and make it harder
      to use RCU page table freeing and share code with other architectures.
      
      BTW, this is the second time this has happened, see commit 3c936465
      ("[SPARC64]: Kill pgtable quicklists and use SLAB.")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dedbf8d
    • Dmitry Fink's avatar
      mmap: fix and tidy up overcommit page arithmetic · c15bef30
      Dmitry Fink authored
      - shmem pages are not immediately available, but they are not
        potentially available either, even if we swap them out, they will just
        relocate from memory into swap, total amount of immediate and
        potentially available memory is not going to be affected, so we
        shouldn't count them as potentially free in the first place.
      
      - nr_free_pages() is not an expensive operation anymore, there is no
        need to split the decision making in two halves and repeat code.
      Signed-off-by: default avatarDmitry Fink <dmitry.fink@palm.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c15bef30
    • Andrew Morton's avatar
      mm/memblock.c: avoid abuse of RED_INACTIVE · c9d8c3d0
      Andrew Morton authored
      RED_INACTIVE is a slab thing, and reusing it for memblock was
      inappropriate, because memblock is dealing with phys_addr_t's which have a
      Kconfigurable sizeof().
      
      Create a new poison type for this application.  Fixes the sparse warning
      
          warning: cast truncates bits from constant value (9f911029d74e35b becomes 9d74e35b)
      Reported-by: default avatarH Hartley Sweeten <hartleys@visionengravers.com>
      Tested-by: default avatarH Hartley Sweeten <hartleys@visionengravers.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9d8c3d0
    • David Rientjes's avatar
      oom: make deprecated use of oom_adj more verbose · be8f684d
      David Rientjes authored
      /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012
      according to Documentation/feature-removal-schedule.txt.
      
      This patch makes the warning more verbose by making it appear as a more
      serious problem (the presence of a stack trace and being multiline should
      attract more attention) so that applications still using the old interface
      can get fixed.
      
      Very popular users of the old interface have been converted since the oom
      killer rewrite has been introduced.  udevd switched to the
      /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and
      opensshd switched in 5.7p1.
      
      At the start of 2012, this should be changed into a WARN() to emit all
      such incidents and then finally remove the tunable in August 2012 as
      scheduled.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be8f684d
    • David Rientjes's avatar
      oom: remove references to old badness() function · 11239836
      David Rientjes authored
      The badness() function in the oom killer was renamed to oom_badness() in
      a63d83f4 ("oom: badness heuristic rewrite") since it is a globally
      exported function for clarity.
      
      The prototype for the old function still existed in linux/oom.h, so remove
      it.  There are no existing users.
      
      Also fixes documentation and comment references to badness() and adjusts
      them accordingly.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11239836
    • Andrew Morton's avatar
      mm/memory.c: remove ZAP_BLOCK_SIZE · 6ac47520
      Andrew Morton authored
      ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
      Remove i_mmap_lock lockbreak").  So zap it.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ac47520
    • Chris Forbes's avatar
      mm: hugetlb: fix coding style issues · 32f84528
      Chris Forbes authored
      Fix coding style issues flagged by checkpatch.pl
      Signed-off-by: default avatarChris Forbes <chrisf@ijw.co.nz>
      Acked-by: default avatarEric B Munson <emunson@mgebm.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32f84528
    • Chris Wright's avatar
      mm/huge_memory.c: minor lock simplification in __khugepaged_exit · d788e80a
      Chris Wright authored
      The lock is released first thing in all three branches.  Simplify this by
      unconditionally releasing lock and remove else clause which was only there
      to be sure lock was released.
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d788e80a
    • Daniel Kiper's avatar
      mm/page_cgroup.c: simplify code by using SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macros · 1bb36fbd
      Daniel Kiper authored
      Commit a539f353 ("mm: add SECTION_ALIGN_UP() and
      SECTION_ALIGN_DOWN() macro") introduced the SECTION_ALIGN_UP() and
      SECTION_ALIGN_DOWN() macros.  Use those macros to increase code
      readability.
      Signed-off-by: default avatarDaniel Kiper <dkiper@net-space.pl>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bb36fbd
    • WANG Cong's avatar
      mm: remove the leftovers of noswapaccount · 00a66d29
      WANG Cong authored
      In commit a2c8990a ("memsw: remove noswapaccount kernel parameter"),
      Michal forgot to remove some left pieces of noswapaccount in the tree,
      this patch removes them all.
      Signed-off-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00a66d29
    • KOSAKI Motohiro's avatar
      pagewalk: fix code comment for THP · dd78553b
      KOSAKI Motohiro authored
      Commit bae9c19b ("thp: split_huge_page_mm/vma") changed locking behavior
      of walk_page_range().  Thus this patch changes the comment too.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd78553b
    • KOSAKI Motohiro's avatar
      pagewalk: add locking-rule comments · c27fe4c8
      KOSAKI Motohiro authored
      Originally, walk_hugetlb_range() didn't require a caller take any lock.
      But commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
      walk_page_range") changed its rule.  Because it added find_vma() call in
      walk_hugetlb_range().
      
      Any locking-rule change commit should write a doc too.
      
      [akpm@linux-foundation.org: clarify comment]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c27fe4c8
    • KOSAKI Motohiro's avatar
      pagewalk: don't look up vma if walk->hugetlb_entry is unused · 6c6d5280
      KOSAKI Motohiro authored
      Currently, walk_page_range() calls find_vma() every page table for walk
      iteration.  but it's completely unnecessary if walk->hugetlb_entry is
      unused.  And we don't have to assume find_vma() is a lightweight
      operation.  So this patch checks the walk->hugetlb_entry and avoids the
      find_vma() call if possible.
      
      This patch also makes some cleanups.  1) remove ugly uninitialized_var()
      and 2) #ifdef in function body.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c6d5280
    • KOSAKI Motohiro's avatar
      pagewalk: fix walk_page_range() don't check find_vma() result properly · 4b6ddbf7
      KOSAKI Motohiro authored
      The doc of find_vma() says,
      
          /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
          struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
          {
           (snip)
      
      Thus, caller should confirm whether the returned vma matches a desired one.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b6ddbf7