1. 05 Feb, 2008 40 commits
    • Fengguang Wu's avatar
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu authored
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      Tested-by: default avatarMike Snitzer <snitzer@gmail.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • Sam Ravnborg's avatar
      mm: fix section mismatch warning in sparse.c · a322f8ab
      Sam Ravnborg authored
      Fix following warning:
      WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()
      
      static sparse_early_usemap_alloc() were used only by sparse_init()
      and with sparse_init() annotated _init it is safe to
      annotate sparse_early_usemap_alloc with __init too.
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a322f8ab
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
    • Shaohua Li's avatar
      page migraton: handle orphaned pages · 62e1c553
      Shaohua Li authored
      Orphaned page might have fs-private metadata, the page is truncated.  As
      the page hasn't mapping, page migration refuse to migrate the page.  It
      appears the page is only freed in page reclaim and if zone watermark is
      low, the page is never freed, as a result migration always fail.  I thought
      we could free the metadata so such page can be freed in migration and make
      migration more reliable.
      
      [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62e1c553
    • Yasunori Goto's avatar
      Document lowmem_reserve_ratio · 7786fa9a
      Yasunori Goto authored
      Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
      document has been not changed.  The lowmem_reserve_ratio seems quite hard
      to estimate, but there is no guidance.  This patch is to change document
      for it.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andrea Arcangeli <andrea@cpushare.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7786fa9a
    • Masatake YAMATO's avatar
      check ADVICE of fadvise64_64 even if get_xip_page is given · b5beb1ca
      Masatake YAMATO authored
      I've written some test programs in ltp project.  During writing I met an
      problem which I cannot solve in user land.  So I wrote a patch for linux
      kernel.  Please, include this patch if acceptable.
      
      The test program tests the 4th parameter of fadvise64_64:
      
          long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
      
      My test case calls fadvise64_64 with invalid advice value and checks errno is
      set to EINVAL.  About the advice parameter man page says:
      
          ...
          Permissible values for advice include:
      
      	   POSIX_FADV_NORMAL
                        ...
      	   POSIX_FADV_SEQUENTIAL
                        ...
      	   POSIX_FADV_RANDOM
      		  ...
      	   POSIX_FADV_NOREUSE
                        ...
      	   POSIX_FADV_WILLNEED
                        ...
      	   POSIX_FADV_DONTNEED
      		  ...
          ERRORS
                 ...
      	   EINVAL An invalid value was specified for advice.
      
      However, I got a bug report that the system call invocations
      in my test case returned 0 unexpectedly.
      
      I've inspected the kernel code:
      
          asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
          {
      	    struct file *file = fget(fd);
      	    struct address_space *mapping;
      	    struct backing_dev_info *bdi;
      	    loff_t endbyte;			/* inclusive */
      	    pgoff_t start_index;
      	    pgoff_t end_index;
      	    unsigned long nrpages;
      	    int ret = 0;
      
      	    if (!file)
      		    return -EBADF;
      
      	    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
      		    ret = -ESPIPE;
      		    goto out;
      	    }
      
      	    mapping = file->f_mapping;
      	    if (!mapping || len < 0) {
      		    ret = -EINVAL;
      		    goto out;
      	    }
      
      	    if (mapping->a_ops->get_xip_page)
      		    /* no bad return value, but ignore advice */
      		    goto out;
          ...
          out:
      	    fput(file);
      	    return ret;
          }
      
      I found the advice parameter is just ignored in the case
      mapping->a_ops->get_xip_page is given. This behavior is different from
      what is written on the man page. Is this o.k.?
      
      get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
      Anyway I cannot find the easy way to detect get_xip_page
      field is given or CONFIG_EXT2_FS_XIP is true from the
      user space.
      
      I propose the following patch which checks the advice parameter
      even if get_xip_page is given.
      Signed-off-by: default avatarMasatake YAMATO <yamato@redhat.com>
      Acked-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5beb1ca
    • Larry Woodman's avatar
      Include count of pagecache pages in show_mem() output · e6f3602d
      Larry Woodman authored
      The show_mem() output does not include the total number of pagecache
      pages.  This would be helpful when analyzing the debug information in
      the /var/log/messages file after OOM kills occur.
      
      This patch includes the total pagecache pages in that output.
      Signed-off-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6f3602d
    • Bjorn Steinbrink's avatar
      Fix dirty page accounting leak with ext3 data=journal · a2b34564
      Bjorn Steinbrink authored
      In 46d2277c ("Clean up and make
      try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
      was changed to bail out if the page was dirty.
      
      That in turn caused truncate_complete_page to leak massive amounts of
      memory, because the dirty bit was only cleared after the call to
      try_to_free_buffers.
      
      So the call to cancel_dirty_page was moved up to have the dirty bit
      cleared early in 3e67c098 ("truncate:
      clear page dirtiness before running try_to_free_buffers()").
      
      The problem with that fix is, that the page can be redirtied after
      cancel_dirty_page was called, eg. like this:
      
      truncate_complete_page()
        cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
        do_invalidatepage()
          ext3_invalidatepage()
            journal_invalidatepage()
              journal_unmap_buffer()
                __dispose_buffer()
                  __journal_unfile_buffer()
                    __journal_temp_unlink_buffer()
                      mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
      
      And then we end up with dirty pages being wrongly accounted.
      
      As a result, in ecdfc978 ("Resurrect
      'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
      were reverted, so the original reason for the massive memory leak is
      gone, and we can also revert the move of the call to cancel_dirty_page
      from truncate_complete_page and get the accounting right again.
      
      I'm not sure if it matters, but opposed to the final check in
      __remove_from_page_cache, this one also cares about the task io
      accounting, so maybe we want to use this instead, although it's not
      quite the clean fix either.
      Signed-off-by: default avatarBjörn Steinbrink <B.Steinbrink@gmx.de>
      Tested-by: default avatarKrzysztof Piotr Oledzki <ole@ans.pl>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Osterried <osterried@jesse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2b34564
    • Qi Yong's avatar
      set_page_refcounted() VM_BUG_ON fix · ae1276b9
      Qi Yong authored
      The current PageTail semantic is that a PageTail page is first a
      PageCompound page.  So remove the redundant PageCompound test in
      set_page_refcounted().
      Signed-off-by: default avatarQi Yong <qiyong@fc-cn.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae1276b9
    • Harvey Harrison's avatar
      mm: remove fastcall from mm/ · 920c7a5d
      Harvey Harrison authored
      fastcall is always defined to be empty, remove it
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      920c7a5d
    • Andi Kleen's avatar
    • Qi Yong's avatar
      skip writing data pages when inode is under I_SYNC · 2d544564
      Qi Yong authored
      Since I_SYNC was split out from I_LOCK, the concern in commit
      4b89eed9 ("Write back inode data pages
      even when the inode itself is locked") is not longer valid.
      
      We should revert to the original behavior: in __writeback_single_inode(),
      when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
      skip writing entirely.  Otherwise, we are double calling do_writepages()
      Signed-off-by: default avatarQi Yong <qiyong@fc-cn.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Joern Engel <joern@wohnheim.fh-wedel.de>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d544564
    • Hugh Dickins's avatar
      mm: don't waste swap on locked pages · 5a9bbdcd
      Hugh Dickins authored
      try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
      migrating), and recycles it back to the active list.  But if it's an
      anonymous page, we've already allocated swap to it: just wasting swap.
      Spot locked pages in page_referenced_one and treat them as referenced.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Tested-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ethan Solomita <solo@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a9bbdcd
    • Christoph Lameter's avatar
      vmstat: remove prefetch · 9eccf2a8
      Christoph Lameter authored
      Remove the prefetch logic in order to avoid touching impossible per cpu
      areas.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eccf2a8
    • Andrea Arcangeli's avatar
      Fix /proc dcache deadlock in do_exit · 7766755a
      Andrea Arcangeli authored
      This patch fixes a sles9 system hang in start_this_handle from a customer
      with some heavy workload where all tasks are waiting on kjournald to commit
      the transaction, but kjournald waits on t_updates to go down to zero (it
      never does).
      
      This was reported as a lowmem shortage deadlock but when checking the debug
      data I noticed the VM wasn't under pressure at all (well it was really
      under vm pressure, because lots of tasks hanged in the VM prune_dcache
      methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
      mode, the holder of the journal handle should have if this was a vm issue
      in the first place).
      
      No task was apparently holding the leftover handle in the committing
      transaction, so I deduced t_updates was stuck to 1 because a journal_stop
      was never run by some path (this turned out to be correct).  With a debug
      patch adding proper reverse links and stack trace logging in ext3 deployed
      in production, I found journal_stop is never run because
      mark_inode_dirty_sync is called inside release_task called by do_exit.
      (that was quite fun because I would have never thought about this
      subtleness, I thought a regular path in ext3 had a bug and it forgot to
      call journal_stop)
      
      do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
      come back to run journal_stop)
      
      The reason is that shrink_dcache_parent is racy by design (feature not
      a bug) and it can do blocking I/O in some case, but the point is that
      calling shrink_dcache_parent at the last stage of do_exit isn't safe
      for self-reaping tasks.
      
      I guess the memory pressure of the unbalanced highmem system allowed
      to trigger this more easily.
      
      Now mainline doesn't have this line in iput (like sles9 has):
      
          	     if (inode->i_state & I_DIRTY_DELAYED)
      	     			mark_inode_dirty_sync(inode);
      
      so it will probably not crash with ext3, but for example ext2 implements an
      I/O-blocking ext2_put_inode that will lead to similar screwups with
      ext2_free_blocks never coming back and it's definitely wrong to call
      blocking-IO paths inside do_exit.  So this should fix a subtle bug in
      mainline too (not verified in practice though).  The equivalent fix for
      ext3 is also not verified yet to fix the problem in sles9 but I don't have
      doubt it will (it usually takes days to crash, so it'll take weeks to be
      sure).
      
      An alternate fix would be to offload that work to a kernel thread, but I
      don't think a reschedule for this is worth it, the vm should be able to
      collect those entries for the synchronous release_task.
      Signed-off-by: default avatarAndrea Arcangeli <andrea@suse.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7766755a
    • Bron Gondwana's avatar
      mm/page-writeback: highmem_is_dirtyable option · 195cf453
      Bron Gondwana authored
      Add vm.highmem_is_dirtyable toggle
      
      A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
      approximately 2Gb size which contains a hash format that is written
      randomly by the dbclean process.  On 2.6.16 this process took a few
      minutes.  With lowmem only accounting of dirty ratios, this takes about 12
      hours of 100% disk IO, all random writes.
      
      Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
      add the highmem back to the total available memory count.
      
      [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
      Signed-off-by: default avatarBron Gondwana <brong@fastmail.fm>
      Cc: Ethan Solomita <solo@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      195cf453
    • Christoph Lameter's avatar
      Page allocator: get rid of the list of cold pages · 3dfa5721
      Christoph Lameter authored
      We have repeatedly discussed if the cold pages still have a point. There is
      one way to join the two lists: Use a single list and put the cold pages at the
      end and the hot pages at the beginning. That way a single list can serve for
      both types of allocations.
      
      The discussion of the RFC for this and Mel's measurements indicate that
      there may not be too much of a point left to having separate lists for
      hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Martin Bligh <mbligh@mbligh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dfa5721
    • Robert Bragg's avatar
      mm: don't allow ioremapping of ranges larger than vmalloc space · 5dc33185
      Robert Bragg authored
      When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
      vmlist search routine in __get_vm_area_node can mistakenly allow a driver
      to ioremap a range larger than vmalloc space.
      
      If at the time of the ioremap all existing vmlist areas sit below the
      determined alignment then the search routine continues past all entries and
      exits the for loop - straight into the found: label - without ever testing
      for integer wrapping or that the requested size fits.
      
      We were seeing a driver successfully ioremap 128M of flash even though
      there was only 120M of vmalloc space.  From that point the system was left
      with the remainder of the first 16M of space to vmalloc/ioremap within.
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dc33185
    • Christoph Lameter's avatar
      vmstat: small revisions to refresh_cpu_vm_stats() · a7f75e25
      Christoph Lameter authored
      1. Add comments explaining how the function can be called.
      
      2. Collect global diffs in a local array and only spill
         them once into the global counters when the zone scan
         is finished. This means that we only touch each global
         counter once instead of each time we fold cpu counters
         into zone counters.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7f75e25
    • Martin Schwidefsky's avatar
      arch_rebalance_pgtables call · 08e7d9b5
      Martin Schwidefsky authored
      In order to change the layout of the page tables after an mmap has crossed the
      adress space limit of the current page table layout a architecture hook in
      get_unmapped_area is needed.  The arguments are the address of the new mapping
      and the length of it.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08e7d9b5
    • Benjamin Herrenschmidt's avatar
      add mm argument to pte/pmd/pud/pgd_free · 5e541973
      Benjamin Herrenschmidt authored
      (with Martin Schwidefsky <schwidefsky@de.ibm.com>)
      
      The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
      first argument.  The free functions do not get the mm_struct argument.  This
      is 1) asymmetrical and 2) to do mm related page table allocations the mm
      argument is needed on the free function as well.
      
      [kamalesh@linux.vnet.ibm.com: i386 fix]
      [akpm@linux-foundation.org: coding-syle fixes]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e541973
    • Christoph Lameter's avatar
      Page allocator: clean up pcp draining functions · 9f8f2172
      Christoph Lameter authored
      - Add comments explaing how drain_pages() works.
      
      - Eliminate useless functions
      
      - Rename drain_all_local_pages to drain_all_pages(). It does drain
        all pages not only those of the local processor.
      
      - Eliminate useless interrupt off / on sequences. drain_pages()
        disables interrupts on its own. The execution thread is
        pinned to processor by the caller. So there is no need to
        disable interrupts.
      
      - Put drain_all_pages() declaration in gfp.h and remove the
        declarations from suspend.h and from mm/memory_hotplug.c
      
      - Make software suspend call drain_all_pages(). The draining
        of processor local pages is may not the right approach if
        software suspend wants to support SMP. If they call drain_all_pages
        then we can make drain_pages() static.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Daniel Walker <dwalker@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f8f2172
    • Nick Piggin's avatar
      radix-tree: avoid atomic allocations for preloaded insertions · e2848a0e
      Nick Piggin authored
      Most pagecache (and some other) radix tree insertions have the great
      opportunity to preallocate a few nodes with relaxed gfp flags.  But the
      preallocation is squandered when it comes time to allocate a node, we
      default to first attempting a GFP_ATOMIC allocation -- that doesn't
      normally fail, but it can eat into atomic memory reserves that we don't
      need to be using.
      
      Another upshot of this is that it removes the sometimes highly contended
      zone->lock from underneath tree_lock.  Pagecache insertions are always
      performed with a radix tree preload, and after this change, such a
      situation will never fall back to kmem_cache_alloc within
      radix_tree_node_alloc.
      
      David Miller reports seeing this allocation fail on a highly threaded
      sparc64 system:
      
      [527319.459981] dd: page allocation failure. order:0, mode:0x20
      [527319.460403] Call Trace:
      [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
      [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
      [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
      [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
      [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
      [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
      [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
      [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
      [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
      [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
      [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
      [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
      [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
      [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
      
      The calltrace is significant: __do_page_cache_readahead allocates a number
      of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
      memory to satisfy GFP_ATOMIC allocations.  However after the list of pages
      goes to mpage_readpages, there can be significant intervals (including disk
      IO) before all the pages are inserted into the radix-tree.  So the reserves
      can easily be depleted at that point.  The patch is confirmed to fix the
      problem.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2848a0e
    • Adrian Bunk's avatar
      make __vmalloc_area_node() static · e31d9eb5
      Adrian Bunk authored
      __vmalloc_area_node() can become static.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e31d9eb5
    • Balbir Singh's avatar
      Remove unused code from mm/tiny-shmem.c · 625d9573
      Balbir Singh authored
      This code in mm/tiny-shmem.c is under #if 0 - remove it.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      625d9573
    • Adrian Bunk's avatar
      mm/page-writeback.c: make a function static · f61eaf9f
      Adrian Bunk authored
      task_dirty_limit() can become static.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f61eaf9f
    • Matt Mackall's avatar
      maps4: make page monitoring /proc file optional · 1e883281
      Matt Mackall authored
      Make /proc/ page monitoring configurable
      
      This puts the following files under an embedded config option:
      
      /proc/pid/clear_refs
      /proc/pid/smaps
      /proc/pid/pagemap
      /proc/kpagecount
      /proc/kpageflags
      
      [akpm@linux-foundation.org: Kconfig fix]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e883281
    • Matt Mackall's avatar
      maps4: add /proc/kpageflags interface · 304daa81
      Matt Mackall authored
      This makes a subset of physical page flags available to userspace. Together
      with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
      
      Exported flags are decoupled from the kernel's internal flags. This
      allows us to reorder flag bits, and synthesize any bits that get
      redefined in terms of other bits.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [akpm@linux-foundation.org: s/0/NULL/]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      304daa81
    • Matt Mackall's avatar
      maps4: add /proc/kpagecount interface · 161f47bf
      Matt Mackall authored
      This makes physical page map counts available to userspace. Together
      with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
      monitor memory usage on a per-page basis.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [bunk@stusta.de: make struct proc_kpagemap static]
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      161f47bf
    • Matt Mackall's avatar
      maps4: add /proc/pid/pagemap interface · 85863e47
      Matt Mackall authored
      This interface provides a mapping for each page in an address space to its
      physical page frame number, allowing precise determination of what pages are
      mapped and what pages are shared between processes.
      
      New in this version:
      
      - headers gone again (as recommended by Dave Hansen and Alan Cox)
      - 64-bit entries (as per discussion with Andi Kleen)
      - swap pte information exported (from Dave Hansen)
      - page walker callback for holes (from Dave Hansen)
      - direct put_user I/O (as suggested by Rusty Russell)
      
      This patch folds in cleanups and swap PTE support from Dave Hansen
      <haveblue@us.ibm.com>.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85863e47
    • Matt Mackall's avatar
      maps4: regroup task_mmu by interface · a6198797
      Matt Mackall authored
      Reorder source so that all the code and data for each interface is together.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6198797
    • Matt Mackall's avatar
      maps4: move clear_refs code to task_mmu.c · f248dcb3
      Matt Mackall authored
      This puts all the clear_refs code where it belongs and probably lets things
      compile on MMU-less systems as well.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f248dcb3
    • Matt Mackall's avatar
      maps4: simplify interdependence of maps and smaps · 4752c369
      Matt Mackall authored
      This pulls the shared map display code out of show_map and puts it in
      show_smap where it belongs.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4752c369
    • Matt Mackall's avatar
      maps4: use pagewalker in clear_refs and smaps · b3ae5acb
      Matt Mackall authored
      Use the generic pagewalker for smaps and clear_refs
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ae5acb
    • Matt Mackall's avatar
      maps4: introduce a generic page walker · e6473092
      Matt Mackall authored
      Introduce a general page table walker
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6473092
    • Matt Mackall's avatar
      maps4: move is_swap_pte · 698dd4ba
      Matt Mackall authored
      Move is_swap_pte helper function to swapops.h for use by pagemap code
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698dd4ba
    • Dave Hansen's avatar
      maps4: rework TASK_SIZE macros · 82455257
      Dave Hansen authored
      The following replaces the earlier patches sent.  It should address
      David Rientjes's comments, and has been compile tested on all the
      architectures that it touches, save for parisc.
      
      For the /proc/<pid>/pagemap code[1], we need to able to query how
      much virtual address space a particular task has.  The trick is
      that we do it through /proc and can't use TASK_SIZE since it
      references "current" on some arches.  The process opening the
      /proc file might be a 32-bit process opening a 64-bit process's
      pagemap file.
      
      x86_64 already has a TASK_SIZE_OF() macro:
      
      #define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)
      
      I'd like to have that for other architectures.  So, add it
      for all the architectures that actually use "current" in
      their TASK_SIZE.  For the others, just add a quick #define
      in sched.h to use plain old TASK_SIZE.
      
      1. http://www.linuxworld.com/news/2007/042407-kernel.html
      
      - MIPS portion from Ralf Baechle <ralf@linux-mips.org>
      
      [akpm@linux-foundation.org: fix mips build]
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82455257
    • Fengguang Wu's avatar
      maps4: add proportional set size accounting in smaps · ec4dd3eb
      Fengguang Wu authored
      The "proportional set size" (PSS) of a process is the count of pages it has
      in memory, where each page is divided by the number of processes sharing
      it.  So if a process has 1000 pages all to itself, and 1000 shared with one
      other process, its PSS will be 1500.
      
                     - lwn.net: "ELC: How much memory are applications really using?"
      
      The PSS proposed by Matt Mackall is a very nice metic for measuring an
      process's memory footprint.  So collect and export it via
      /proc/<pid>/smaps.
      
      Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
      job.  They are comprehensive tools.  But for PSS, let's do it in the simple
      way.
      
      Cc: John Berthels <jjberthels@gmail.com>
      Cc: Bernardo Innocenti <bernie@codewiz.org>
      Cc: Padraig Brady <P@draigBrady.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec4dd3eb
    • Christoph Hellwig's avatar
      clean up vmtruncate · 61d5048f
      Christoph Hellwig authored
      vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
      proper if else for the two major cases of extending and truncating truncate
      and thus makes it a lot more readable while keeping exactly the same
      functinality.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61d5048f
    • Hugh Dickins's avatar
      tmpfs: fix shmem_swaplist races · 1b1b32f2
      Hugh Dickins authored
      Intensive swapoff testing shows shmem_unuse spinning on an entry in
      shmem_swaplist pointing to itself: how does that come about?  Days pass...
      
      First guess is this: shmem_delete_inode tests list_empty without taking the
      global mutex (so the swapping case doesn't slow down the common case); but
      there's an instant in shmem_unuse_inode's list_move_tail when the list entry
      may appear empty (a rare case, because it's actually moving the head not the
      the list member).  So there's a danger of leaving the inode on the swaplist
      when it's freed, then reinitialized to point to itself when reused.  Fix that
      by skipping the list_move_tail when it's a no-op, which happens to plug this.
      
      But this same spinning then surfaces on another machine.  Ah, I'd never
      suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
      still hold page lock, which would hold off inode deletion if the page were in
      pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
      doesn't wait on locked pages).  Hmm: we could put the the inode on swaplist
      earlier, but then shmem_unuse_inode could never prune unswapped inodes.
      
      Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
      though I am a little uneasy about the iput which has to follow - it works, and
      I see nothing wrong with it, but it is surprising that shmem inode deletion
      may now occur below shmem_writepage.  Revisit this fix later?
      
      And while we're looking at these races: the way shmem_unuse tests swapped
      without holding info->lock looks unsafe, if we've more than one swap area: a
      racing shmem_writepage on another page of the same inode could be putting it
      in swapcache, just as we're deciding to remove the inode from swaplist -
      there's a danger of going on swap without being listed, so a later swapoff
      would hang, being unable to locate the entry.  Move that test and removal down
      into shmem_unuse_inode, once info->lock is held.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b1b32f2