1. 15 Dec, 2009 30 commits
    • David Rientjes's avatar
      mm: add gfp flags for NODEMASK_ALLOC slab allocations · bad44b5b
      David Rientjes authored
      Objects passed to NODEMASK_ALLOC() are relatively small in size and are
      backed by slab caches that are not of large order, traditionally never
      greater than PAGE_ALLOC_COSTLY_ORDER.
      
      Thus, using GFP_KERNEL for these allocations on large machines when
      CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
      the allocation attempt, each time invoking both direct reclaim or the oom
      killer.
      
      This is of particular interest when using NODEMASK_ALLOC() from a
      mempolicy context (either directly in mm/mempolicy.c or the mempolicy
      constrained hugetlb allocations) since the oom killer always kills current
      when allocations are constrained by mempolicies.  So for all present use
      cases in the kernel, current would end up being oom killed when direct
      reclaim fails.  That would allow the NODEMASK_ALLOC() to succeed but
      current would have sacrificed itself upon returning.
      
      This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
      CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
      All current use cases either directly from hugetlb code or indirectly via
      NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
      killer when the slab allocator needs to allocate additional pages.
      
      The side-effect of this change is that all current use cases of either
      NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
      when the allocation fails (never for CONFIG_NODES_SHIFT <= 8).  All
      current use cases were audited and do have appropriate error handling at
      this time.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bad44b5b
    • Lee Schermerhorn's avatar
      hugetlb: offload per node attribute registrations · 39da08cb
      Lee Schermerhorn authored
      Offload the registration and unregistration of per node hstate sysfs
      attributes to a worker thread rather than attempt the
      allocation/attachment or detachment/freeing of the attributes in the
      context of the memory hotplug handler.
      
      I don't know that this is absolutely required, but the registration can
      sleep in allocations and other mem hot plug handlers do it this way.  If
      it turns out this is NOT required, we can drop this patch.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39da08cb
    • Lee Schermerhorn's avatar
      hugetlb: handle memory hot-plug events · 4faf8d95
      Lee Schermerhorn authored
      Register per node hstate attributes only for nodes with memory.  As
      suggested by David Rientjes.
      
      With Memory Hotplug, memory can be added to a memoryless node and a node
      with memory can become memoryless.  Therefore, add a memory on/off-line
      notifier callback to [un]register a node's attributes on transition
      to/from memoryless state.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4faf8d95
    • David Rientjes's avatar
      mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined · 8fe23e05
      David Rientjes authored
      When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
      there are no present pages left.
      
      In such a situation, kswapd must also be stopped since it has nothing left
      to do.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fe23e05
    • Lee Schermerhorn's avatar
      hugetlb: use only nodes with memory for huge pages · 9b5e5d0f
      Lee Schermerhorn authored
      Register per node hstate sysfs attributes only for nodes with memory.
      Global replacement of 'all online nodes" with "all nodes with memory" in
      mm/hugetlb.c.  Suggested by David Rientjes.
      
      A subsequent patch will handle adding/removing of per node hstate sysfs
      attributes when nodes transition to/from memoryless state via memory
      hotplug.
      
      NOTE: this patch has not been tested with memoryless nodes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b5e5d0f
    • Lee Schermerhorn's avatar
      hugetlb: update hugetlb documentation for NUMA controls · 267b4c28
      Lee Schermerhorn authored
      Update the kernel huge tlb documentation to describe the numa memory
      policy based huge page management.  Additionaly, the patch includes a fair
      amount of rework to improve consistency, eliminate duplication and set the
      context for documenting the memory policy interaction.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      267b4c28
    • Lee Schermerhorn's avatar
      hugetlb: add per node hstate attributes · 9a305230
      Lee Schermerhorn authored
      Add the per huge page size control/query attributes to the per node
      sysdevs:
      
      /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
      	nr_hugepages       - r/w
      	free_huge_pages    - r/o
      	surplus_huge_pages - r/o
      
      The patch attempts to re-use/share as much of the existing global hstate
      attribute initialization and handling, and the "nodes_allowed" constraint
      processing as possible.
      
      Calling set_max_huge_pages() with no node indicates a change to global
      hstate parameters.  In this case, any non-default task mempolicy will be
      used to generate the nodes_allowed mask.  A valid node id indicates an
      update to that node's hstate parameters, and the count argument specifies
      the target count for the specified node.  From this info, we compute the
      target global count for the hstate and construct a nodes_allowed node mask
      contain only the specified node.
      
      Setting the node specific nr_hugepages via the per node attribute
      effectively ignores any task mempolicy or cpuset constraints.
      
      With this patch:
      
      (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
      ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
      
      Starting from:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     0
      Node 2 HugePages_Free:      0
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 0
      
      Allocate 16 persistent huge pages on node 2:
      (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
      
      [Note that this is equivalent to:
      	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
      ]
      
      Yields:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:    16
      Node 2 HugePages_Free:     16
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 16
      
      Global controls work as expected--reduce pool to 8 persistent huge pages:
      (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     8
      Node 2 HugePages_Free:      8
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a305230
    • Lee Schermerhorn's avatar
      hugetlb: add generic definition of NUMA_NO_NODE · 4e25b257
      Lee Schermerhorn authored
      Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
      to generic header 'linux/numa.h' for use in generic code.  NUMA_NO_NODE
      replaces bare '-1' where it's used in this series to indicate "no node id
      specified".  Ultimately, it can be used to replace the -1 elsewhere where
      it is used similarly.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e25b257
    • Lee Schermerhorn's avatar
      hugetlb: derive huge pages nodes allowed from task mempolicy · 06808b08
      Lee Schermerhorn authored
      This patch derives a "nodes_allowed" node mask from the numa mempolicy of
      the task modifying the number of persistent huge pages to control the
      allocation, freeing and adjusting of surplus huge pages when the pool page
      count is modified via the new sysctl or sysfs attribute
      "nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:
      
      * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
        is produced.  This will cause the hugetlb subsystem to use
        node_online_map as the "nodes_allowed".  This preserves the
        behavior before this patch.
      * For "preferred" mempolicy, including explicit local allocation,
        a nodemask with the single preferred node will be produced.
        "local" policy will NOT track any internode migrations of the
        task adjusting nr_hugepages.
      * For "bind" and "interleave" policy, the mempolicy's nodemask
        will be used.
      * Other than to inform the construction of the nodes_allowed node
        mask, the actual mempolicy mode is ignored.  That is, all modes
        behave like interleave over the resulting nodes_allowed mask
        with no "fallback".
      
      See the updated documentation [next patch] for more information
      about the implications of this patch.
      
      Examples:
      
      Starting with:
      
      	Node 0 HugePages_Total:     0
      	Node 1 HugePages_Total:     0
      	Node 2 HugePages_Total:     0
      	Node 3 HugePages_Total:     0
      
      Default behavior [with or without this patch] balances persistent
      hugepage allocation across nodes [with sufficient contiguous memory]:
      
      	sysctl vm.nr_hugepages[_mempolicy]=32
      
      yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:     8
      	Node 3 HugePages_Total:     8
      
      Of course, we only have nr_hugepages_mempolicy with the patch,
      but with default mempolicy, nr_hugepages_mempolicy behaves the
      same as nr_hugepages.
      
      Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
      '--membind' because it allows multiple nodes to be specified
      and it's easy to type]--we can allocate huge pages on
      individual nodes or sets of nodes.  So, starting from the
      condition above, with 8 huge pages per node, add 8 more to
      node 2 using:
      
      	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
      
      This yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The incremental 8 huge pages were restricted to node 2 by the
      specified mempolicy.
      
      Similarly, we can use mempolicy to free persistent huge pages
      from specified nodes:
      
      	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
      
      yields:
      
      	Node 0 HugePages_Total:     4
      	Node 1 HugePages_Total:     4
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The 8 huge pages freed were balanced over nodes 0 and 1.
      
      [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06808b08
    • Lee Schermerhorn's avatar
      hugetlb: factor init_nodemask_of_node() · c1e6c8d0
      Lee Schermerhorn authored
      Factor init_nodemask_of_node() out of the nodemask_of_node() macro.
      
      This will be used to populate the huge pages "nodes_allowed" nodemask for
      a single node when basing nodes_allowed on a preferred/local mempolicy or
      when a persistent huge page pool page count is modified via a per node
      sysfs attribute.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1e6c8d0
    • Lee Schermerhorn's avatar
      hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functions · 6ae11b27
      Lee Schermerhorn authored
      In preparation for constraining huge page allocation and freeing by the
      controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
      to the allocate, free and surplus adjustment functions.  For now, pass
      NULL to indicate default behavior--i.e., use node_online_map.  A
      subsqeuent patch will derive a non-default mask from the controlling
      task's numa mempolicy.
      
      Note that this method of updating the global hstate nr_hugepages under the
      constraint of a nodemask simplifies keeping the global state
      consistent--especially the number of persistent and surplus pages relative
      to reservations and overcommit limits.  There are undoubtedly other ways
      to do this, but this works for both interfaces: mempolicy and per node
      attributes.
      
      [rientjes@google.com: fix HIGHMEM compile error]
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ae11b27
    • Lee Schermerhorn's avatar
      hugetlb: rework hstate_next_node_* functions · 9a76db09
      Lee Schermerhorn authored
      Modify the hstate_next_node* functions to allow them to be called to
      obtain the "start_nid".  Then, whereas prior to this patch we
      unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
      we successfully allocated/freed a huge page on the node, now we only call
      these functions on failure to alloc/free to advance to next allowed node.
      
      Factor out the next_node_allowed() function to handle wrap at end of
      node_online_map.  In this version, the allowed nodes include all of the
      online nodes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a76db09
    • David Rientjes's avatar
      nodemask: make NODEMASK_ALLOC more general · 4e7b8a6c
      David Rientjes authored
      This is a series of patches to provide control over the location of the
      allocation and freeing of persistent huge pages on a NUMA platform.
      Please consider for merging into mmotm.
      
      This series uses two mechanisms to constrain the nodes from which
      persistent huge pages are allocated: 1) the task NUMA mempolicy of the
      task modifying a new sysctl "nr_hugepages_mempolicy", based on a
      suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
      attributes have been added [in V4] to each node system device under:
      
      	/sys/devices/node/node[0-9]*/hugepages
      
      The per node attibutes allow direct assignment of a huge page count on a
      specific node, regardless of the task's mempolicy or cpuset constraints.
      
      This patch:
      
      NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
      It's perfectly reasonable to use this macro to allocate a nodemask_t,
      which is anonymous, either dynamically or on the stack depending on
      NODES_SHIFT.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e7b8a6c
    • KOSAKI Motohiro's avatar
      mm: move inc_zone_page_state(NR_ISOLATED) to just isolated place · 6d9c285a
      KOSAKI Motohiro authored
      Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
      in right after isolate_page().
      
      This patch does it.
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d9c285a
    • Wu Fengguang's avatar
      /dev/mem: remove redundant parameter from do_write_kmem() · ee32398f
      Wu Fengguang authored
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee32398f
    • Wu Fengguang's avatar
      /dev/mem: remove the "written" variable in write_kmem() · 80ad89a0
      Wu Fengguang authored
      Also rename "len" to "sz". No behavior change.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80ad89a0
    • Wu Fengguang's avatar
      /dev/mem: make size_inside_page() logic straight · 7fabaddd
      Wu Fengguang authored
      Also convert more size_inside_page() users.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fabaddd
    • Wu Fengguang's avatar
      /dev/mem: cleanup unxlate_dev_mem_ptr() calls · fa29e97b
      Wu Fengguang authored
      No behaviour change.
      
      [akpm@linux-foundation.org: cleanuplets]
      [akpm@linux-foundation.org: remove unused `ret']
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa29e97b
    • Wu Fengguang's avatar
      /dev/mem: introduce size_inside_page() · f222318e
      Wu Fengguang authored
      Introduce size_inside_page() to replace duplicate /dev/mem code.
      
      Also apply it to /dev/kmem, whose alignment logic was buggy.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f222318e
    • Wu Fengguang's avatar
      /dev/mem: remove redundant test on len · 4ea2f43f
      Wu Fengguang authored
      The len test in write_kmem() is always true, so can be reduced.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ea2f43f
    • KOSAKI Motohiro's avatar
      mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap() · 659ace58
      KOSAKI Motohiro authored
      On ia64, the following test program exit abnormally, because glibc thread
      library called abort().
      
       ========================================================
       (gdb) bt
       #0  0xa000000000010620 in __kernel_syscall_via_break ()
       #1  0x20000000003208e0 in raise () from /lib/libc.so.6.1
       #2  0x2000000000324090 in abort () from /lib/libc.so.6.1
       #3  0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
       #4  0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
       #5  0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
       ========================================================
      
      The fact is, glibc call munmap() when thread exitng time for freeing
      stack, and it assume munlock() never fail.  However, munmap() often make
      vma splitting and it with many mapcount make -ENOMEM.
      
      Oh well, that's crazy, because stack unmapping never increase mapcount.
      The maxcount exceeding is only temporary.  internal temporary exceeding
      shouldn't make ENOMEM.
      
      This patch does it.
      
       test_max_mapcount.c
       ==================================================================
        #include<stdio.h>
        #include<stdlib.h>
        #include<string.h>
        #include<pthread.h>
        #include<errno.h>
        #include<unistd.h>
      
        #define THREAD_NUM 30000
        #define MAL_SIZE (8*1024*1024)
      
       void *wait_thread(void *args)
       {
       	void *addr;
      
       	addr = malloc(MAL_SIZE);
       	sleep(10);
      
       	return NULL;
       }
      
       void *wait_thread2(void *args)
       {
       	sleep(60);
      
       	return NULL;
       }
      
       int main(int argc, char *argv[])
       {
       	int i;
       	pthread_t thread[THREAD_NUM], th;
       	int ret, count = 0;
       	pthread_attr_t attr;
      
       	ret = pthread_attr_init(&attr);
       	if(ret) {
       		perror("pthread_attr_init");
       	}
      
       	ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
       	if(ret) {
       		perror("pthread_attr_setdetachstate");
       	}
      
       	for (i = 0; i < THREAD_NUM; i++) {
       		ret = pthread_create(&th, &attr, wait_thread, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
      
       		ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
       	}
      
       	sleep(3600);
       	return 0;
       }
       ==================================================================
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      659ace58
    • Alex Chiang's avatar
      page-types: exit early when invoked with -d|--describe · bb86a733
      Alex Chiang authored
      On a system with large amount of memory (256GB), invoking page-types can
      take quite a long time, which is unreasonable considering the user only
      wants a description of the flags:
      
      	# time ./page-types -d 0x10
      	0x0000000000000010	____D_____________________________	dirty
      
      	real	0m34.285s
      	user	0m1.966s
      	sys	0m32.313s
      
      This is because we still walk the entire address range.
      
      Exiting early seems like a reasonble solution:
      
      # time ./page-types -d 0x10
      	0x0000000000000010	____D_____________________________	dirty
      
      	real	0m0.007s
      	user	0m0.001s
      	sys	0m0.005s
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Haicheng Li <haicheng.li@intel.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb86a733
    • Alex Chiang's avatar
      page-types: whitespace alignment · 9fdcd886
      Alex Chiang authored
      Align the output when page-type -h is invoked.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fdcd886
    • Alex Chiang's avatar
      page-types: learn to describe flags directly from command line · dcfe730c
      Alex Chiang authored
      Teach page-types to describe page flags directly from the command line.
      
      Why is this useful?  For instance, if you're using memory hotplug and see
      this in /var/log/messages:
      
      	kernel: removing from LRU failed 3836dd0/1/1e00000000000010
      
      It would be nice to decode those page flags without staring at the source.
      
      Example usage and output:
      
      # Documentation/vm/page-types -d 0x10
      0x0000000000000010	____D_____________________________	dirty
      
      # Documentation/vm/page-types -d anon
      0x0000000000001000	____________a_____________________	anonymous
      
      # Documentation/vm/page-types -d anon,0x10
      0x0000000000001010	____D_______a_____________________	dirty,anonymous
      
      [achiang@hp.com: documentation]
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Haicheng Li <haicheng.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcfe730c
    • Roel Kluin's avatar
      page-types: unsigned cannot be less than 0 in add_page() · f1327bf1
      Roel Kluin authored
      If not signed, testing of the read() return value in this function
      will not work.
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1327bf1
    • Tommi Rantala's avatar
      page-types: constify read only arrays · 3428838d
      Tommi Rantala authored
      Signed-off-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3428838d
    • David Rientjes's avatar
      oom: dump stack and VM state when oom killer panics · 1b604d75
      David Rientjes authored
      The oom killer header, including information such as the allocation order
      and gfp mask, current's cpuset and memory controller, call trace, and VM
      state information is currently only shown when the oom killer has selected
      a task to kill.
      
      This information is omitted, however, when the oom killer panics either
      because of panic_on_oom sysctl settings or when no killable task was
      found.  It is still relevant to know crucial pieces of information such as
      the allocation order and VM state when diagnosing such issues, especially
      at boot.
      
      This patch displays the oom killer header whenever it panics so that bug
      reports can include pertinent information to debug the issue, if possible.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b604d75
    • Michal Marek's avatar
      MAINTAINERS: new kbuild maintainer · 5ce45962
      Michal Marek authored
      Sam was fine with handing over kbuild maintainership to me. The git
      trees are already in linux-next, a merge request will follow shortly.
      Acked-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarMichal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ce45962
    • Amerigo Wang's avatar
      hfs: fix a potential buffer overflow · ec81aecb
      Amerigo Wang authored
      A specially-crafted Hierarchical File System (HFS) filesystem could cause
      a buffer overflow to occur in a process's kernel stack during a memcpy()
      call within the hfs_bnode_read() function (at fs/hfs/bnode.c:24).  The
      attacker can provide the source buffer and length, and the destination
      buffer is a local variable of a fixed length.  This local variable (passed
      as "&entry" from fs/hfs/dir.c:112 and allocated on line 60) is stored in
      the stack frame of hfs_bnode_read()'s caller, which is hfs_readdir().
      Because the hfs_readdir() function executes upon any attempt to read a
      directory on the filesystem, it gets called whenever a user attempts to
      inspect any filesystem contents.
      
      [amwang@redhat.com: modify this patch and fix coding style problems]
      Signed-off-by: default avatarWANG Cong <amwang@redhat.com>
      Cc: Eugene Teo <eteo@redhat.com>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec81aecb
    • Alexey Dobriyan's avatar
      bsdacct: fix uid/gid misreporting · 4b731d50
      Alexey Dobriyan authored
      commit d8e180dc "bsdacct: switch
      credentials for writing to the accounting file" introduced credential
      switching during final acct data collecting.  However, uid/gid pair
      continued to be collected from current which became credentials of who
      created acct file, not who exits.
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reported-by: default avatarJuho K. Juopperi <jkj@kapsi.fi>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b731d50
  2. 14 Dec, 2009 10 commits