1. 24 Jul, 2008 40 commits
    • Andy Whitcroft's avatar
      mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations · cdfd4325
      Andy Whitcroft authored
      With Mel's hugetlb private reservation support patches applied, strict
      overcommit semantics are applied to both shared and private huge page
      mappings.  This can be a problem if an application relied on unlimited
      overcommit semantics for private mappings.  An example of this would be an
      application which maps a huge area with the intention of using it very
      sparsely.  These application would benefit from being able to opt-out of
      the strict overcommit.  It should be noted that prior to hugetlb
      supporting demand faulting all mappings were fully populated and so
      applications of this type should be rare.
      
      This patch stack implements the MAP_NORESERVE mmap() flag for huge page
      mappings.  This flag has the same meaning as for small page mappings,
      suppressing reservations for that mapping.
      
      Thanks to Mel Gorman for reviewing a number of early versions of these
      patches.
      
      This patch:
      
      When a small page mapping is created with mmap() reservations are created
      by default for any memory pages required.  When the region is read/write
      the reservation is increased for every page, no reservation is needed for
      read-only regions (as they implicitly share the zero page).  Reservations
      are tracked via the VM_ACCOUNT vma flag which is present when the region
      has reservation backing it.  When we convert a region from read-only to
      read-write new reservations are aquired and VM_ACCOUNT is set.  However,
      when a read-only map is created with MAP_NORESERVE it is indistinguishable
      from a normal mapping.  When we then convert that to read/write we are
      forced to incorrectly create reservations for it as we have no record of
      the original MAP_NORESERVE.
      
      This patch introduces a new vma flag VM_NORESERVE which records the
      presence of the original MAP_NORESERVE flag.  This allows us to
      distinguish these two circumstances and correctly account the reserve.
      
      As well as fixing this FIXME in the code, this makes it much easier to
      introduce MAP_NORESERVE support for huge pages as this flag is available
      consistantly for the life of the mapping.  VM_ACCOUNT on the other hand is
      heavily used at the generic level in association with small pages.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdfd4325
    • Andy Whitcroft's avatar
      huge page private reservation review cleanups · e7c4b0bf
      Andy Whitcroft authored
      Create some new accessors for vma private data to cut down on and contain
      the casts.  Encapsulates the huge and small page offset calculations.
      Also adds a couple of VM_BUG_ONs for consistency.
      
      [akpm@linux-foundation.org: Make things static]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7c4b0bf
    • Mel Gorman's avatar
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman authored
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • Mel Gorman's avatar
      hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() · a1e78772
      Mel Gorman authored
      This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
      a similar manner to the reservations taken for MAP_SHARED mappings.  The
      reserve count is accounted both globally and on a per-VMA basis for
      private mappings.  This guarantees that a process that successfully calls
      mmap() will successfully fault all pages in the future unless fork() is
      called.
      
      The characteristics of private mappings of hugetlbfs files behaviour after
      this patch are;
      
      1. The process calling mmap() is guaranteed to succeed all future faults until
         it forks().
      2. On fork(), the parent may die due to SIGKILL on writes to the private
         mapping if enough pages are not available for the COW. For reasonably
         reliable behaviour in the face of a small huge page pool, children of
         hugepage-aware processes should not reference the mappings; such as
         might occur when fork()ing to exec().
      3. On fork(), the child VMAs inherit no reserves. Reads on pages already
         faulted by the parent will succeed. Successful writes will depend on enough
         huge pages being free in the pool.
      4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
         and at fault time otherwise.
      
      Before this patch, all reads or writes in the child potentially needs page
      allocations that can later lead to the death of the parent.  This applies
      to reads and writes of uninstantiated pages as well as COW.  After the
      patch it is only a write to an instantiated page that causes problems.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1e78772
    • Mel Gorman's avatar
      hugetlb: move hugetlb_acct_memory() · fc1b8a73
      Mel Gorman authored
      This is a patchset to give reliable behaviour to a process that
      successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file.  Currently, it
      is possible for the process to be killed due to a small hugepage pool size
      even if it calls mlock().
      
      MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time.  This
      guarantees all future faults against the mapping will succeed.  This
      allows local allocations at first use improving NUMA locality whilst
      retaining reliability.
      
      MAP_PRIVATE mappings do not reserve pages.  This can result in an
      application being SIGKILLed later if a huge page is not available at fault
      time.  This makes huge pages usage very ill-advised in some cases as the
      unexpected application failure cannot be detected and handled as it is
      immediately fatal.  Although an application may force instantiation of the
      pages using mlock(), this may lead to poor memory placement and the
      process may still be killed when performing COW.
      
      This patchset introduces a reliability guarantee for the process which
      creates a private mapping, i.e.  the process that calls mmap() on a
      hugetlbfs file successfully.  The first patch of the set is purely
      mechanical code move to make later diffs easier to read.  The second patch
      will guarantee faults up until the process calls fork().  After patch two,
      as long as the child keeps the mappings, the parent is no longer
      guaranteed to be reliable.  Patch 3 guarantees that the parent will always
      successfully COW by unmapping the pages from the child in the event there
      are insufficient pages in the hugepage pool in allocate a new page, be it
      via a static or dynamic pool.
      
      Existing hugepage-aware applications are unlikely to be affected by this
      change.  For much of hugetlbfs's history, pages were pre-faulted at mmap()
      time or mmap() failed which acts in a reserve-like manner.  If the pool is
      sized correctly already so that parent and child can fault reliably, the
      application will not even notice the reserves.  It's only when the pool is
      too small for the application to function perfectly reliably that the
      reserves come into play.
      
      Credit goes to Andy Whitcroft for cleaning up a number of mistakes during
      review before the patches were released.
      
      This patch:
      
      A later patch in this set needs to call hugetlb_acct_memory() before it is
      defined.  This patch moves the function without modification.  This makes
      later diffs easier to read.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc1b8a73
    • Johannes Weiner's avatar
      mm: drop unneeded pgdat argument from free_area_init_node() · 9109fb7b
      Johannes Weiner authored
      free_area_init_node() gets passed in the node id as well as the node
      descriptor.  This is redundant as the function can trivially get the node
      descriptor itself by means of NODE_DATA() and the node's id.
      
      I checked all the users and NODE_DATA() seems to be usable everywhere
      from where this function is called.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9109fb7b
    • Andrew Morton's avatar
      mapping_set_error: add unlikely() · 2185e69f
      Andrew Morton authored
      This is called on a per-page basis and in the vast majority of cases
      `error' is zero.
      
      Cc: Guillaume Chazarain <guichaz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2185e69f
    • Andy Whitcroft's avatar
      slob: record page flag overlays explicitly · 9023cb7e
      Andy Whitcroft authored
      SLOB reuses two page bits for internal purposes, it overlays PG_active and
      PG_private.  This is hidden away in slob.c.  Document these overlays
      explicitly in the main page-flags enum along with all the others.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9023cb7e
    • Andy Whitcroft's avatar
      slub: record page flag overlays explicitly · 8a38082d
      Andy Whitcroft authored
      SLUB reuses two page bits for internal purposes, it overlays PG_active and
      PG_error.  This is hidden away in slub.c.  Document these overlays
      explicitly in the main page-flags enum along with all the others.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Tested-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a38082d
    • Andy Whitcroft's avatar
      page-flags: record page flag overlays explicitly · 0cad47cf
      Andy Whitcroft authored
      With the recent page flag reorganisation we have a single enum which
      defines the valid page flags and their values, nice and clear.  However
      there are a number of bits which are overloaded by different subsystems.
      Firstly there is PG_owner_priv_1 which is used by filesystems and by XEN.
      Secondly both SLOB and SLUB use a couple of extra page bits to manage
      internal state for pages they own; both overlay other bits.  All of these
      "aliases" are scattered about the source making it very hard for a reader
      to know if the bits are safe to rely on in all contexts; confusion here is
      bad.
      
      As we now have a single place where the bits are clearly assigned it makes
      sense to clarify the reuse of bits by making the aliases explicit and
      visible with the original bit assignments.  This patch creates explicit
      aliases within the enum itself for the overloaded bits, creates standard
      bit accessors PageFoo etc.  and uses those throughout.
      
      This version pulls the bit manipulation out to standard named page bit
      accessors as suggested by Christoph, it retains the explicit mapping to
      the overlayed bits.  A fusion of both ideas.  This has been SLUB and SLOB
      have been compile tested on x86_64 only, and SLUB boot tested.  If people
      feel this is worth doing then I can run a fuller set of testing.
      
      This patch:
      
      Some page flags are used for more than one purpose, for example
      PG_owner_priv_1.  Currently there are individual accessors for each user,
      each built using the common flag name far away from the bit definitions.
      This makes it hard to see all possible uses of these bits.
      
      Now that we have a single enum to generate the bit orders it makes sense
      to express overlays in the same place.  So create per use aliases for this
      bit in the main page-flags enum and use those in the accessors.
      
      [akpm@linux-foundation.org: fix xen]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cad47cf
    • Kentaro Makita's avatar
      fix soft lock up at NFS mount via per-SB LRU-list of unused dentries · da3bbdd4
      Kentaro Makita authored
      [Summary]
      
       Split LRU-list of unused dentries to one per superblock to avoid soft
       lock up during NFS mounts and remounting of any filesystem.
      
       Previously I posted here:
       http://lkml.org/lkml/2008/3/5/590
      
      [Descriptions]
      
      - background
      
        dentry_unused is a list of dentries which are not referenced.
        dentry_unused grows up when references on directories or files are
        released.  This list can be very long if there is huge free memory.
      
      - the problem
      
        When shrink_dcache_sb() is called, it scans all dentry_unused linearly
        under spin_lock(), and if dentry->d_sb is differnt from given
        superblock, scan next dentry.  This scan costs very much if there are
        many entries, and very ineffective if there are many superblocks.
      
        IOW, When we need to shrink unused dentries on one dentry, but scans
        unused dentries on all superblocks in the system.  For example, we scan
        500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
        dentries on other superblocks.
      
        In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
        unused dentries on NFS, but scans 100,000,000 unused dentries on
        superblocks in the system such as local ext3 filesystems.  I hear NFS
        mounting took 1 min on some system in use.
      
      * : NFS uses virtual filesystem in rpc layer, so NFS is affected by
        this problem.
      
        100,000,000 is possible number on large systems.
      
        Per-superblock LRU of unused dentried can reduce the cost in
        reasonable manner.
      
      - How to fix
      
        I found this problem is solved by David Chinner's "Per-superblock
        unused dentry LRU lists V3"(1), so I rebase it and add some fix to
        reclaim with fairness, which is in Andrew Morton's comments(2).
      
        1) http://lkml.org/lkml/2006/5/25/318
        2) http://lkml.org/lkml/2006/5/25/320
      
        Split LRU-list of unused dentries to each superblocks.  Then, NFS
        mounting will check dentries under a superblock instead of all.  But
        this spliting will break LRU of dentry-unused.  So, I've attempted to
        make reclaim unused dentrins with fairness by calculate number of
        dentries to scan on this sb based on following way
      
        number of dentries to scan on this sb =
        count * (number of dentries on this sb / number of dentries in the machine)
      
      - ToDo
       - I have to measuring performance number and do stress tests.
      
       - When unmount occurs during prune_dcache(), scanning on same
        superblock, It is unable to reach next superblock because it is gone
        away.  We restart scannig superblock from first one, it causes
        unfairness of reclaim unused dentries on first superblock.  But I think
        this happens very rarely.
      
      - Test Results
      
        Result on 6GB boxes with excessive unused dentries.
      
      Without patch:
      
      $ cat /proc/sys/fs/dentry-state
      10181835        10180203        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m1.830s
      user    0m0.001s
      sys     0m1.653s
      
       With this patch:
      $ cat /proc/sys/fs/dentry-state
      10236610        10234751        45      0       0       0
      # mount -t nfs 10.124.60.70:/work/kernel-src nfs
      real    0m0.106s
      user    0m0.002s
      sys     0m0.032s
      
      [akpm@linux-foundation.org: fix comments]
      Signed-off-by: default avatarKentaro Makita <k-makita@np.css.fujitsu.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: David Chinner <dgc@sgi.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da3bbdd4
    • Andy Whitcroft's avatar
      buddy: clarify comments describing buddy merge · 3c82d0ce
      Andy Whitcroft authored
      In __free_one_page(), the comment "Move the buddy up one level" appears
      attached to the break and by implication when the break is taken we are
      moving it up one level:
      
      	if (!page_is_buddy(page, buddy, order))
      		break;          /* Move the buddy up one level. */
      
      In reality the inverse is true, we break out when we can no longer merge
      this page with its buddy.  Looking back into pre-history (into the full
      git history) it appears that these two lines accidentally got joined as
      part of another change.
      
      Move the comment down where it belongs below the if and clarify its
      language.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c82d0ce
    • Jan Beulich's avatar
      mm: remove double indirection on tlb parameter to free_pgd_range() & Co · 42b77728
      Jan Beulich authored
      The double indirection here is not needed anywhere and hence (at least)
      confusing.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42b77728
    • Benjamin Herrenschmidt's avatar
      spufs: use new vm_ops->access to allow local state access from gdb · a352894d
      Benjamin Herrenschmidt authored
      This uses the new vm_ops->access to allow gdb to access the SPU local
      store.  We currently prevent access to problem state registers, this can
      be done later if really needed but it's safer not to.
      
      [akpm@linux-foundation.org: fix typo]
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a352894d
    • Benjamin Herrenschmidt's avatar
      powerpc ioremap_prot · a1f242ff
      Benjamin Herrenschmidt authored
      This adds ioremap_prot and pte_pgprot() so that one can extract protection
      bits from a PTE and use them to ioremap_prot() (in order to support ptrace
      of VM_IO | VM_PFNMAP as per Rik's patch).
      
      This moves a couple of flag checks around in the ioremap implementations
      of arch/powerpc.  There's a side effect of allowing non-cacheable and
      non-guarded mappings on ppc32 which before would always have _PAGE_GUARDED
      set whenever _PAGE_NO_CACHE is.
      
      (standard ioremap will still set _PAGE_GUARDED, but ioremap_prot will be
      capable of setting such a non guarded mapping).
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1f242ff
    • Rik van Riel's avatar
      use generic_access_phys for /dev/mem mappings · 7ae8ed50
      Rik van Riel authored
      Use generic_access_phys as the access_process_vm access function for
      /dev/mem mappings.  This makes it possible to debug the X server.
      
      [akpm@linux-foundation.org: repair all the architectures which broke]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ae8ed50
    • Rik van Riel's avatar
      access_process_vm device memory infrastructure · 28b2ee20
      Rik van Riel authored
      In order to be able to debug things like the X server and programs using
      the PPC Cell SPUs, the debugger needs to be able to access device memory
      through ptrace and /proc/pid/mem.
      
      This patch:
      
      Add the generic_access_phys access function and put the hooks in place
      to allow access_process_vm to access device or PPC Cell SPU memory.
      
      [riel@redhat.com: Add documentation for the vm_ops->access function]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarBenjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28b2ee20
    • Nick Piggin's avatar
      mm: remove nopfn · 0d71d10a
      Nick Piggin authored
      There are no users of nopfn in the tree. Remove it.
      
      [hugh@veritas.com: fix build error]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d71d10a
    • Christoph Hellwig's avatar
      kill generic_file_direct_IO() · a969e903
      Christoph Hellwig authored
      generic_file_direct_IO is a common helper around the invocation of
      ->direct_IO.  But there's almost nothing shared between the read and write
      side, so we're better off without this helper.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a969e903
    • Adrian Bunk's avatar
      mm/hugetlb.c: fix duplicate variable · 75353bed
      Adrian Bunk authored
      It's confusing that set_max_huge_pages() contained two different
      variables named "ret", and although the code works correctly this should
      be fixed.
      
      The inner of the two variables can simply be removed.
      
      Spotted by sparse.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Cc:  "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75353bed
    • Adrian Bunk's avatar
      mm/vmstat.c: proper externs · c748e134
      Adrian Bunk authored
      This patch adds proper extern declarations for five variables in
      include/linux/vmstat.h
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c748e134
    • Adrian Bunk's avatar
      mm/migrate.c should #include <linux/syscalls.h> · 4f5ca265
      Adrian Bunk authored
      Every file should include the headers containing the externs for its
      global functions (in this case for sys_move_pages()).
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f5ca265
    • KOSAKI Motohiro's avatar
      page allocator: inline some __alloc_pages() wrappers · e4048e5d
      KOSAKI Motohiro authored
      Two zonelist patch series rewrote __page_alloc() largely.  Now, it is just
      a wrapper function.  Inlining them will save a function call.
      
      [akpm@linux-foundation.org: export __alloc_pages_internal]
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4048e5d
    • Nick Piggin's avatar
      mspec: convert nopfn to fault · efe9e779
      Nick Piggin authored
      [akpm@linux-foundation.org: remove unused variable]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarJes Sorensen <jes@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efe9e779
    • Johannes Weiner's avatar
      mm: unexport __alloc_bootmem_core() · ffc6421f
      Johannes Weiner authored
      This function has no external callers, so unexport it.  Also fix its naming
      inconsistency.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffc6421f
    • Johannes Weiner's avatar
      mm: normalize internal argument passing of bootmem data · 8ae04463
      Johannes Weiner authored
      All _core functions only need the bootmem data, not the whole node descriptor.
      Adjust the two functions that take the node descriptor unneededly.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ae04463
    • Johannes Weiner's avatar
      mm: fix free_all_bootmem_core alignment check · 6b312c0e
      Johannes Weiner authored
      The check for node_boot_start is bogus because we start freeing at the
      corresponding pfn.  So check if the pfn is properly aligned instead in a more
      readable way and adjust the documentation.
      
      Also remove an unneeded accounting variable.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b312c0e
    • Johannes Weiner's avatar
      mm: move bootmem descriptors definition to a single place · b61bfa3c
      Johannes Weiner authored
      There are a lot of places that define either a single bootmem descriptor or an
      array of them.  Use only one central array with MAX_NUMNODES items instead.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Kyle McMartin <kyle@parisc-linux.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b61bfa3c
    • FUJITA Tomonori's avatar
      add a helper function to test if an object is on the stack · 8b05c7e6
      FUJITA Tomonori authored
      lib/debugobjects.c has a function to test if an object is on the stack.
      The block layer and ide needs it (they need to avoid DMA from/to stack
      buffers).  This patch moves the function to include/linux/sched.h so that
      everyone can use it.
      
      lib/debugobjects.c uses current->stack but this patch uses a
      task_stack_page() accessor, which is a preferable way to access the stack.
      Signed-off-by: default avatarFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b05c7e6
    • Mel Gorman's avatar
      mm: print out the zonelists on request for manual verification · 68ad8df4
      Mel Gorman authored
      This patch prints out the zonelists during boot for manual verification by the
      user if the mminit_loglevel is MMINIT_VERIFY or higher.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ad8df4
    • Mel Gorman's avatar
      mm: make defensive checks around PFN values registered for memory usage · 2dbb51c4
      Mel Gorman authored
      There are a number of different views to how much memory is currently active.
      There is the arch-independent zone-sizing view, the bootmem allocator and
      memory models view.
      
      Architectures register this information at different times and is not
      necessarily in sync particularly with respect to some SPARSEMEM limitations.
      
      This patch introduces mminit_validate_memmodel_limits() which is able to
      validate and correct PFN ranges with respect to the memory model.  It is only
      SPARSEMEM that currently validates itself.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dbb51c4
    • Mel Gorman's avatar
      mm: verify the page links and memory model · 708614e6
      Mel Gorman authored
      Print out information on how the page flags are being used if mminit_loglevel
      is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
      flags regardless of loglevel.
      
      When the page flags are updated with section, node and zone information, a
      check are made to ensure the values can be retrieved correctly.  Finally we
      confirm that pfn_to_page and page_to_pfn are the correct inverse functions.
      
      [akpm@linux-foundation.org: fix printk warnings]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      708614e6
    • Mel Gorman's avatar
      mm: add a basic debugging framework for memory initialisation · 6b74ab97
      Mel Gorman authored
      Boot initialisation is very complex, with significant numbers of
      architecture-specific routines, hooks and code ordering.  While significant
      amounts of the initialisation is architecture-independent, it trusts the data
      received from the architecture layer.  This is a mistake, and has resulted in
      a number of difficult-to-diagnose bugs.
      
      This patchset adds some validation and tracing to memory initialisation.  It
      also introduces a few basic defensive measures.  The validation code can be
      explicitly disabled for embedded systems.
      
      This patch:
      
      Add additional debugging and verification code for memory initialisation.
      
      Once enabled, the verification checks are always run and when required
      additional debugging information may be outputted via a mminit_loglevel=
      command-line parameter.
      
      The verification code is placed in a new file mm/mm_init.c.  Ideally other mm
      initialisation code will be moved here over time.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b74ab97
    • David Brownell's avatar
      add HAVE_CLK to Kconfig, for driver dependencies · 9483a578
      David Brownell authored
      Flag platforms as HAVE_CLK (or not) in Kconfig, based on whether they
      support <linux/clk.h> calls, so that otherwise portable drivers which need
      those calls can list that dependency.
      
      Something like this is a prerequisite for merging the musb_hdrc driver,
      currently used on platforms including Davinci, OMAP2430, OMAP3xx ...  and
      the discrete TUSB6010 chip, which doesn't have a natural platform
      dependency.  (Used with OMAP 2420 in current Nokia N8x0 tablets.)
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Acked-by: default avatarHaavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9483a578
    • Adrian Bunk's avatar
      remove is_tty() · d7ce20b2
      Adrian Bunk authored
      This patch removes the no longer used is_tty().
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarAlan Cox <alan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ce20b2
    • Clemens Ladisch's avatar
      hpet: clarify maintainer entry · d36e74c4
      Clemens Ladisch authored
      The existing HPET maintainer entries are somewhat unclear about which one
      applies to what part of the kernel.
      Signed-off-by: default avatarClemens Ladisch <clemens@ladisch.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d36e74c4
    • Akinobu Mita's avatar
      move memory_read_from_buffer() from fs.h to string.h · e108526e
      Akinobu Mita authored
      James Bottomley warns that inclusion of linux/fs.h in a low level
      driver was always a danger signal.  This patch moves
      memory_read_from_buffer() from fs.h to string.h and fixes includes in
      existing memory_read_from_buffer() users.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Bob Moore <robert.moore@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e108526e
    • Linus Torvalds's avatar
      Merge branch 'x86/auditsc' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland · 338b9bb3
      Linus Torvalds authored
      * 'x86/auditsc' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
        i386 syscall audit fast-path
        x86_64 ia32 syscall audit fast-path
        x86_64 syscall audit fast-path
        x86_64: remove bogus optimization in sysret_signal
      338b9bb3
    • Linus Torvalds's avatar
      Merge branch 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip · 7f9dce38
      Linus Torvalds authored
      * 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        sched: hrtick_enabled() should use cpu_active()
        sched, x86: clean up hrtick implementation
        sched: fix build error, provide partition_sched_domains() unconditionally
        sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
        cpu hotplug: Make cpu_active_map synchronization dependency clear
        cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
        sched: rework of "prioritize non-migratable tasks over migratable ones"
        sched: reduce stack size in isolated_cpu_setup()
        Revert parts of "ftrace: do not trace scheduler functions"
      
      Fixed up conflicts in include/asm-x86/thread_info.h (due to the
      TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
      kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
      introduction).
      7f9dce38
    • Linus Torvalds's avatar
      Merge branch 'cpus4096-for-linus' of... · 26dcce0f
      Linus Torvalds authored
      Merge branch 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
        NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
        cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
        NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
        NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
        NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
        NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
        NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
        NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
        cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
        cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
        cpumask: Provide a generic set of CPUMASK_ALLOC macros
        cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
        cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
        cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
        cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
        cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
        cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
        Revert "cpumask: introduce new APIs"
        cpumask: make for_each_cpu_mask a bit smaller
        net: Pass reference to cpumask variable in net/sunrpc/svc.c
        ...
      
      Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
      26dcce0f