1. 05 Feb, 2008 40 commits
    • Benjamin Herrenschmidt's avatar
      add mm argument to pte/pmd/pud/pgd_free · 5e541973
      Benjamin Herrenschmidt authored
      (with Martin Schwidefsky <schwidefsky@de.ibm.com>)
      
      The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
      first argument.  The free functions do not get the mm_struct argument.  This
      is 1) asymmetrical and 2) to do mm related page table allocations the mm
      argument is needed on the free function as well.
      
      [kamalesh@linux.vnet.ibm.com: i386 fix]
      [akpm@linux-foundation.org: coding-syle fixes]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e541973
    • Christoph Lameter's avatar
      Page allocator: clean up pcp draining functions · 9f8f2172
      Christoph Lameter authored
      - Add comments explaing how drain_pages() works.
      
      - Eliminate useless functions
      
      - Rename drain_all_local_pages to drain_all_pages(). It does drain
        all pages not only those of the local processor.
      
      - Eliminate useless interrupt off / on sequences. drain_pages()
        disables interrupts on its own. The execution thread is
        pinned to processor by the caller. So there is no need to
        disable interrupts.
      
      - Put drain_all_pages() declaration in gfp.h and remove the
        declarations from suspend.h and from mm/memory_hotplug.c
      
      - Make software suspend call drain_all_pages(). The draining
        of processor local pages is may not the right approach if
        software suspend wants to support SMP. If they call drain_all_pages
        then we can make drain_pages() static.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Daniel Walker <dwalker@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f8f2172
    • Nick Piggin's avatar
      radix-tree: avoid atomic allocations for preloaded insertions · e2848a0e
      Nick Piggin authored
      Most pagecache (and some other) radix tree insertions have the great
      opportunity to preallocate a few nodes with relaxed gfp flags.  But the
      preallocation is squandered when it comes time to allocate a node, we
      default to first attempting a GFP_ATOMIC allocation -- that doesn't
      normally fail, but it can eat into atomic memory reserves that we don't
      need to be using.
      
      Another upshot of this is that it removes the sometimes highly contended
      zone->lock from underneath tree_lock.  Pagecache insertions are always
      performed with a radix tree preload, and after this change, such a
      situation will never fall back to kmem_cache_alloc within
      radix_tree_node_alloc.
      
      David Miller reports seeing this allocation fail on a highly threaded
      sparc64 system:
      
      [527319.459981] dd: page allocation failure. order:0, mode:0x20
      [527319.460403] Call Trace:
      [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
      [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
      [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
      [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
      [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
      [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
      [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
      [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
      [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
      [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
      [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
      [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
      [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
      [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
      
      The calltrace is significant: __do_page_cache_readahead allocates a number
      of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
      memory to satisfy GFP_ATOMIC allocations.  However after the list of pages
      goes to mpage_readpages, there can be significant intervals (including disk
      IO) before all the pages are inserted into the radix-tree.  So the reserves
      can easily be depleted at that point.  The patch is confirmed to fix the
      problem.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2848a0e
    • Adrian Bunk's avatar
      make __vmalloc_area_node() static · e31d9eb5
      Adrian Bunk authored
      __vmalloc_area_node() can become static.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e31d9eb5
    • Balbir Singh's avatar
      Remove unused code from mm/tiny-shmem.c · 625d9573
      Balbir Singh authored
      This code in mm/tiny-shmem.c is under #if 0 - remove it.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      625d9573
    • Adrian Bunk's avatar
      mm/page-writeback.c: make a function static · f61eaf9f
      Adrian Bunk authored
      task_dirty_limit() can become static.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f61eaf9f
    • Matt Mackall's avatar
      maps4: make page monitoring /proc file optional · 1e883281
      Matt Mackall authored
      Make /proc/ page monitoring configurable
      
      This puts the following files under an embedded config option:
      
      /proc/pid/clear_refs
      /proc/pid/smaps
      /proc/pid/pagemap
      /proc/kpagecount
      /proc/kpageflags
      
      [akpm@linux-foundation.org: Kconfig fix]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e883281
    • Matt Mackall's avatar
      maps4: add /proc/kpageflags interface · 304daa81
      Matt Mackall authored
      This makes a subset of physical page flags available to userspace. Together
      with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
      
      Exported flags are decoupled from the kernel's internal flags. This
      allows us to reorder flag bits, and synthesize any bits that get
      redefined in terms of other bits.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [akpm@linux-foundation.org: s/0/NULL/]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      304daa81
    • Matt Mackall's avatar
      maps4: add /proc/kpagecount interface · 161f47bf
      Matt Mackall authored
      This makes physical page map counts available to userspace. Together
      with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
      monitor memory usage on a per-page basis.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [bunk@stusta.de: make struct proc_kpagemap static]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      161f47bf
    • Matt Mackall's avatar
      maps4: add /proc/pid/pagemap interface · 85863e47
      Matt Mackall authored
      This interface provides a mapping for each page in an address space to its
      physical page frame number, allowing precise determination of what pages are
      mapped and what pages are shared between processes.
      
      New in this version:
      
      - headers gone again (as recommended by Dave Hansen and Alan Cox)
      - 64-bit entries (as per discussion with Andi Kleen)
      - swap pte information exported (from Dave Hansen)
      - page walker callback for holes (from Dave Hansen)
      - direct put_user I/O (as suggested by Rusty Russell)
      
      This patch folds in cleanups and swap PTE support from Dave Hansen
      <haveblue@us.ibm.com>.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85863e47
    • Matt Mackall's avatar
      maps4: regroup task_mmu by interface · a6198797
      Matt Mackall authored
      Reorder source so that all the code and data for each interface is together.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6198797
    • Matt Mackall's avatar
      maps4: move clear_refs code to task_mmu.c · f248dcb3
      Matt Mackall authored
      This puts all the clear_refs code where it belongs and probably lets things
      compile on MMU-less systems as well.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f248dcb3
    • Matt Mackall's avatar
      maps4: simplify interdependence of maps and smaps · 4752c369
      Matt Mackall authored
      This pulls the shared map display code out of show_map and puts it in
      show_smap where it belongs.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4752c369
    • Matt Mackall's avatar
      maps4: use pagewalker in clear_refs and smaps · b3ae5acb
      Matt Mackall authored
      Use the generic pagewalker for smaps and clear_refs
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ae5acb
    • Matt Mackall's avatar
      maps4: introduce a generic page walker · e6473092
      Matt Mackall authored
      Introduce a general page table walker
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6473092
    • Matt Mackall's avatar
      maps4: move is_swap_pte · 698dd4ba
      Matt Mackall authored
      Move is_swap_pte helper function to swapops.h for use by pagemap code
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698dd4ba
    • Dave Hansen's avatar
      maps4: rework TASK_SIZE macros · 82455257
      Dave Hansen authored
      The following replaces the earlier patches sent.  It should address
      David Rientjes's comments, and has been compile tested on all the
      architectures that it touches, save for parisc.
      
      For the /proc/<pid>/pagemap code[1], we need to able to query how
      much virtual address space a particular task has.  The trick is
      that we do it through /proc and can't use TASK_SIZE since it
      references "current" on some arches.  The process opening the
      /proc file might be a 32-bit process opening a 64-bit process's
      pagemap file.
      
      x86_64 already has a TASK_SIZE_OF() macro:
      
      #define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)
      
      I'd like to have that for other architectures.  So, add it
      for all the architectures that actually use "current" in
      their TASK_SIZE.  For the others, just add a quick #define
      in sched.h to use plain old TASK_SIZE.
      
      1. http://www.linuxworld.com/news/2007/042407-kernel.html
      
      - MIPS portion from Ralf Baechle <ralf@linux-mips.org>
      
      [akpm@linux-foundation.org: fix mips build]
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82455257
    • Fengguang Wu's avatar
      maps4: add proportional set size accounting in smaps · ec4dd3eb
      Fengguang Wu authored
      The "proportional set size" (PSS) of a process is the count of pages it has
      in memory, where each page is divided by the number of processes sharing
      it.  So if a process has 1000 pages all to itself, and 1000 shared with one
      other process, its PSS will be 1500.
      
                     - lwn.net: "ELC: How much memory are applications really using?"
      
      The PSS proposed by Matt Mackall is a very nice metic for measuring an
      process's memory footprint.  So collect and export it via
      /proc/<pid>/smaps.
      
      Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
      job.  They are comprehensive tools.  But for PSS, let's do it in the simple
      way.
      
      Cc: John Berthels <jjberthels@gmail.com>
      Cc: Bernardo Innocenti <bernie@codewiz.org>
      Cc: Padraig Brady <P@draigBrady.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec4dd3eb
    • Christoph Hellwig's avatar
      clean up vmtruncate · 61d5048f
      Christoph Hellwig authored
      vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
      proper if else for the two major cases of extending and truncating truncate
      and thus makes it a lot more readable while keeping exactly the same
      functinality.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61d5048f
    • Hugh Dickins's avatar
      tmpfs: fix shmem_swaplist races · 1b1b32f2
      Hugh Dickins authored
      Intensive swapoff testing shows shmem_unuse spinning on an entry in
      shmem_swaplist pointing to itself: how does that come about?  Days pass...
      
      First guess is this: shmem_delete_inode tests list_empty without taking the
      global mutex (so the swapping case doesn't slow down the common case); but
      there's an instant in shmem_unuse_inode's list_move_tail when the list entry
      may appear empty (a rare case, because it's actually moving the head not the
      the list member).  So there's a danger of leaving the inode on the swaplist
      when it's freed, then reinitialized to point to itself when reused.  Fix that
      by skipping the list_move_tail when it's a no-op, which happens to plug this.
      
      But this same spinning then surfaces on another machine.  Ah, I'd never
      suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
      still hold page lock, which would hold off inode deletion if the page were in
      pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
      doesn't wait on locked pages).  Hmm: we could put the the inode on swaplist
      earlier, but then shmem_unuse_inode could never prune unswapped inodes.
      
      Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
      though I am a little uneasy about the iput which has to follow - it works, and
      I see nothing wrong with it, but it is surprising that shmem inode deletion
      may now occur below shmem_writepage.  Revisit this fix later?
      
      And while we're looking at these races: the way shmem_unuse tests swapped
      without holding info->lock looks unsafe, if we've more than one swap area: a
      racing shmem_writepage on another page of the same inode could be putting it
      in swapcache, just as we're deciding to remove the inode from swaplist -
      there's a danger of going on swap without being listed, so a later swapoff
      would hang, being unable to locate the entry.  Move that test and removal down
      into shmem_unuse_inode, once info->lock is held.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b1b32f2
    • Hugh Dickins's avatar
      tmpfs: radix_tree_preloading · b409f9fc
      Hugh Dickins authored
      Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
      or swap cache, without any radix tree preload: so tending to deplete emergency
      reserves of memory.
      
      GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
      being called under memory pressure, so must not wait for more memory to become
      available.  But shmem_unuse_inode now has a window in which it can and should
      preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
      add_to_page_cache.
      
      shmem_getpage is not so straightforward: its filepage/swappage integrity
      relies upon exchanging between caches under spinlock, and it would need a lot
      of restructuring to place the preloads correctly.  Instead, follow its pattern
      of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
      add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
      radix_tree_preload, followed immediately by radix_tree_preload_end - that
      won't guarantee success in the next add_to_page_cache, but doesn't need to.
      
      And we can then remove that bothersome congestion_wait: when needed, it'll
      automatically get done in the course of the radix_tree_preload.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Looks-good-to: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b409f9fc
    • Hugh Dickins's avatar
      tmpfs: open a window in shmem_unuse_inode · 2e0e26c7
      Hugh Dickins authored
      There are a couple of reasons (patches follow) why it would be good to open a
      window for sleep in shmem_unuse_inode, between its search for a matching swap
      entry, and its handling of the entry found.
      
      shmem_unuse_inode must then use igrab to hold the inode against deletion in
      that window, and its corresponding iput might result in deletion: so it had
      better unlock_page before the iput, and might as well release the page too.
      
      Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
      leave the loop.  So this unwinding moves from try_to_unuse and shmem_unuse
      into shmem_unuse_inode, in the case when it finds a match.
      
      Let try_to_unuse break on error in the shmem_unuse case, as it does in the
      unuse_mm case: though at this point in the series, no error to break on.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e0e26c7
    • Hugh Dickins's avatar
      tmpfs: make shmem_unuse more preemptible · cb5f7b9a
      Hugh Dickins authored
      shmem_unuse is at present an unbroken search through every swap vector page of
      every tmpfs file which might be swapped, all under shmem_swaplist_lock.  This
      dates from long ago, when the caller held mmlist_lock over it all too: long
      gone, but there's never been much pressure for preemptible swapoff.
      
      Make it a little more preemptible, replacing shmem_swaplist_lock by
      shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
      cond_resched_lock (on info->lock) at one convenient point in the
      shmem_unuse_inode loop, where it has no outstanding kmap_atomic.
      
      If we're serious about preemptible swapoff, there's much further to go e.g.
      I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
      case dictate preemptiblility for other configs.  But as in the earlier patch
      to make swapoff scan ptes preemptibly, my hidden agenda is really towards
      making memcgroups work, hardly about preemptibility at all.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb5f7b9a
    • Hugh Dickins's avatar
      tmpfs: allocate on read when stacked · a0ee5ec5
      Hugh Dickins authored
      tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
      size=0).  But if a stacked filesystem such as unionfs gets pages from a sparse
      tmpfs file by reading holes, and then writes to them, it can easily exceed any
      such limit at present.
      
      So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
      reading for the kernel (a KERNEL_DS check, ugh, sorry about that).  Indeed,
      pessimistically mark such pages as dirty, so they cannot get reclaimed and
      unaccounted by mistake.  The venerable shmem_recalc_inode code (originally to
      account for the reclaim of clean pages) suffices to get the accounting right
      when swappages are dropped in favour of more uptodate filepages.
      
      This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
      caused by unionfs writing to a very sparse tmpfs file: to minimize memory
      allocation in swapout, tmpfs requires the swap vector be allocated upfront,
      which wasn't always happening in this stacked case.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0ee5ec5
    • Hugh Dickins's avatar
      tmpfs: allow filepage alongside swappage · d9fe526a
      Hugh Dickins authored
      tmpfs has long allowed for a fresh filepage to be created in pagecache, just
      before shmem_getpage gets the chance to match it up with the swappage which
      already belongs to that offset.  But unionfs_writepage now does a
      find_or_create_page, divorced from shmem_getpage, which leaves conflicting
      filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.
      
      Therefore shmem_writepage (where a page is swizzled from file to swap) must
      now be on the lookout for existing swap, ready to free it in favour of the
      more uptodate filepage, instead of BUGging on that clash.  And when the
      add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
      filepage, otherwise swapoff would hang.  Whereas when add_to_page_cache fails
      in shmem_getpage, it should retry in the same way it already does.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9fe526a
    • Hugh Dickins's avatar
      tmpfs: move swap swizzling into shmem · 73b1262f
      Hugh Dickins authored
      move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
      between tmpfs page cache and swap cache, to avoid page copying) are only used
      by shmem.c; and our subsequent fix for unionfs needs different treatments in
      the two instances of move_from_swap_cache.  Move them from swap_state.c into
      their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
      add_to_swap_cache externally visible.
      
      shmem.c likes to say set_page_dirty where swap_state.c liked to say
      SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
      makes moot (and implies we should lose that "shift page from clean_pages to
      dirty_pages list" comment: it's on neither).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b1262f
    • Hugh Dickins's avatar
      tmpfs: shuffle add_to_swap_caches · f000944d
      Hugh Dickins authored
      add_to_swap_cache doesn't amount to much: merge it into its sole caller
      read_swap_cache_async.  But we'll be needing to call __add_to_swap_cache from
      shmem.c, so promote it to the new add_to_swap_cache.  Both were static, so
      there's no interface confusion to worry about.
      
      And lose that inappropriate "Anon pages are already on the LRU" comment in the
      merging: they're not already on the LRU, as Nick Piggin noticed.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      No-problems-with: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f000944d
    • Hugh Dickins's avatar
      tmpfs: move swap_state stats update · bb63be0a
      Hugh Dickins authored
      Both unionfs and memcgroups pose challenges to tmpfs and shmem.  To help fix,
      it's best to move the swap swizzling functions from swap_state.c to shmem.c.
      As a preliminary to that, move swap stats updating down into
      __add_to_swap_cache, which will remain internal to swap_state.c.
      
      Well, actually, just move down the incrementation of add_total: remove
      noent_race and exist_race completely, they are relics of my 2.4.11 testing.
      Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb63be0a
    • Michael Marineau's avatar
      tmpfs: fix mounts when size is less than the page size · 818db359
      Michael Marineau authored
      When tmpfs is mounted with a size less than one page, the number of blocks
      is set to 0 which makes the tmpfs mount unlimited.  This can lead to a
      quick and surprising death if someone typos a tmpfs mount command and
      writes too much.
      
      tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0,
      as Documentation/filesystems/tmpfs.txt says.
      
      Hugh: do this by rounding size up instead of down in all cases: which
      slightly expands other odd-sized tmpfs mounts, but in a consistent way.
      Signed-off-by: default avatarMichael Marineau <mike@marineau.org>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      818db359
    • Pavel Emelyanov's avatar
      shmem: factor out sbi->free_inodes manipulations · 5b04c689
      Pavel Emelyanov authored
      The shmem_sb_info structure has a number of free_inodes. This
      value is altered in appropriate places under spinlock and with
      the sbi->max_inodes != 0 check.
      
      Consolidate these manipulations into two helpers.
      
      This is minus 42 bytes of shmem.o and minus 4 :) lines of code.
      
      [akpm@linux-foundation.org: fix error return values]
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b04c689
    • Hugh Dickins's avatar
      swapoff: scan ptes preemptibly · 2e441889
      Hugh Dickins authored
      Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency
      in swapoff by scanning the page table preemptibly: so long as unuse_pte is
      careful to recheck that entry under pte lock.
      
      (To tell the truth, this patch was not inspired by any cries for lower
      latency here: rather, this restructuring permits a future memory controller
      patch to allocate with GFP_KERNEL in unuse_pte, where before it could not.
      But it would be wrong to tuck this change away inside a memcgroup patch.)
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Tested-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e441889
    • Hugh Dickins's avatar
      swapin: fix valid_swaphandles defect · 8952898b
      Hugh Dickins authored
      valid_swaphandles is supposed to do a quick pass over the swap map entries
      neigbouring the entry which swapin_readahead is targetting, to determine for
      it a range worth reading all together.  But since it always starts its search
      from the beginning of the swap "cluster", a reject (free entry) there
      immediately curtails the readaround, and every swapin_readahead from that
      cluster is for just a single page.  Instead scan forwards and backwards around
      the target entry.
      
      Use better names for some variables: a swap_info pointer is usually called
      "si" not "swapdev".  And at the end, if only the target page should be read,
      return count of 0 to disable readaround, to avoid the unnecessarily repeated
      call to read_swap_cache_async.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8952898b
    • Hugh Dickins's avatar
      shmem_file_write is redundant · 5402b976
      Hugh Dickins authored
      With the old aops, writing to a tmpfs file had to use its own special method:
      the generic method would pass in a fresh page to prepare_write when the right
      page was there in swapcache - which was inefficient to handle, even once we'd
      concocted the code to handle it.
      
      With the new aops, the generic method uses shmem_write_end, which lets
      shmem_getpage find the right page: so now abandon shmem_file_write in favour
      of the generic method.  Yes, that does do several things that tmpfs hasn't
      really needed (notably balance_dirty_pages_ratelimited, which ramfs also
      calls); but more use of common code is preferable.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5402b976
    • Hugh Dickins's avatar
      shmem_getpage return page locked · d3602444
      Hugh Dickins authored
      In the new aops, write_begin is supposed to return the page locked: though
      I've seen no ill effects, that's been overlooked in the case of
      shmem_write_begin, and should be fixed.  Then shmem_write_end must unlock the
      page: do so _after_ updating i_size, as we found to be important in other
      filesystems (though since shmem pages don't go the usual writeback route, they
      never suffered from that corruption).
      
      For shmem_write_begin to return the page locked, we need shmem_getpage to
      return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's
      simplify the interface and return it locked even when SGP_READ.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3602444
    • Hugh Dickins's avatar
      shmem: SGP_QUICK and SGP_FAULT redundant · 27d54b39
      Hugh Dickins authored
      Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no
      users now.  Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and
      shmem_getpage is about to return with page always locked).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27d54b39
    • Hugh Dickins's avatar
      swapin needs gfp_mask for loop on tmpfs · 02098fea
      Hugh Dickins authored
      Building in a filesystem on a loop device on a tmpfs file can hang when
      swapping, the loop thread caught in that infamous throttle_vm_writeout.
      
      In theory this is a long standing problem, which I've either never seen in
      practice, or long ago suppressed the recollection, after discounting my load
      and my tmpfs size as unrealistically high.  But now, with the new aops, it has
      become easy to hang on one machine.
      
      Loop used to grab_cache_page before the old prepare_write to tmpfs, which
      seems to have been enough to free up some memory for any swapin needed; but
      the new write_begin lets tmpfs find or allocate the page (much nicer, since
      grab_cache_page missed tmpfs pages in swapcache).
      
      When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
      has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
      break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
      read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
      mapping_gfp_mask - hence the hang.
      
      So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
      swapin_readahead to read_swap_cache_async to add_to_swap_cache.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02098fea
    • Hugh Dickins's avatar
      swapin_readahead: move and rearrange args · 46017e95
      Hugh Dickins authored
      swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
      beside its kindred read_swap_cache_async.  Why were its args in a different
      order?  rearrange them.  And since it was always followed by a
      read_swap_cache_async of the target page, fold that in and return struct
      page*.  Then CONFIG_SWAP=n no longer needs valid_swaphandles and
      read_swap_cache_async stubs.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46017e95
    • Hugh Dickins's avatar
      swapin_readahead: excise NUMA bogosity · c4cc6d07
      Hugh Dickins authored
      For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
      code, advancing addr, and stepping on to the next vma at the boundary, to line
      up the mempolicy for each page allocation.
      
      It _might_ be a good idea to allocate swap more according to vma layout; but
      the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
      allocated as needed for pages as they sink to the bottom of the inactive LRUs.
       Sometimes that may match vma layout, but not so often that it's worth going
      to these misleading vma->vm_next lengths: rip all that out.
      
      Originally I intended to retain the incrementation of addr, but correct its
      initial value: valid_swaphandles generally supplies an offset below the target
      addr (this is readaround rather than readahead), but addr has not been
      adjusted accordingly, so in the interleave case it has usually been allocating
      the target page from the "wrong" node (though that may not matter very much).
      
      But look at the equivalent shmem_swapin code: either by oversight or by
      design, though it has all the apparatus for choosing a new mempolicy per page,
      it uses the same idx throughout, choosing the same mempolicy and interleave
      node for each page of the cluster.
      
      Which is actually a much better strategy: each node has its own LRUs and its
      own kswapd, so if you're betting on any particular relationship between swap
      and node, the best bet is that nearby swap entries belong to pages from the
      same node - even when the mempolicy of the target page is to interleave.  And
      examining a map of nodes corresponding to swap entries on a numa=fake system
      bears this out.  (We could later tweak swap allocation to make it even more
      likely, but this patch is merely about removing cruft.)
      
      So, neither adjust nor increment addr in swapin_readahead, and then
      shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
      once per cluster, and so few fields of pvma are used, let's skip the memset -
      from shmem_alloc_page also.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4cc6d07
    • Ken Chen's avatar
      hugetlb: allow sticky directory mount option · 75897d60
      Ken Chen authored
      Allow sticky directory mount option for hugetlbfs.  This allows admin
      to create a shared hugetlbfs mount point for multiple users, while
      prevent accidental file deletion that users may step on each other.
      It is similiar to default tmpfs mount option, or typical option used
      on /tmp.
      Signed-off-by: default avatarKen Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75897d60
    • Christoph Lameter's avatar
      bufferhead: revert constructor removal · b98938c3
      Christoph Lameter authored
      The constructor for buffer_head slabs was removed recently.  We need the
      constructor back in slab defrag in order to insure that slab objects always
      have a definite state even before we allocated them.
      
      I think we mistakenly merged the removal of the constuctor into a cleanup
      patch.  You (ie: akpm) had a test that showed that the removal of the
      constructor led to a small regression.  The prior state makes things easier
      for slab defrag.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b98938c3