1. 09 Oct, 2012 40 commits
    • Wen Congyang's avatar
      memory-hotplug: preparation to notify memory block's state at memory hot remove · a16cee10
      Wen Congyang authored
      remove_memory() is called in two cases:
      1. echo offline >/sys/devices/system/memory/memoryXX/state
      2. hot remove a memory device
      
      In the 1st case, the memory block's state is changed and the notification
      that memory block's state changed is sent to userland after calling
      remove_memory().  So user can notice memory block is changed.
      
      But in the 2nd case, the memory block's state is not changed and the
      notification is not also sent to userspcae even if calling
      remove_memory().  So user cannot notice memory block is changed.
      
      For adding the notification at memory hot remove, the patch just prepare
      as follows:
      1st case uses offline_pages() for offlining memory.
      2nd case uses remove_memory() for offlining memory and changing memory block's
          state and notifing the information.
      
      The patch does not implement notification to remove_memory().
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a16cee10
    • Raghavendra D Prabhu's avatar
      mm: avoid section mismatch warning for memblock_type_name · c2233116
      Raghavendra D Prabhu authored
      Following section mismatch warning is thrown during build;
      
          WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
          The function memblock_type_name() references
          the variable __meminitdata memblock.
          This is often because memblock_type_name lacks a __meminitdata
          annotation or the annotation of memblock is wrong.
      
      This is because memblock_type_name makes reference to memblock variable
      with attribute __meminitdata.  Hence, the warning (even if the function is
      inline).
      
      [akpm@linux-foundation.org: remove inline]
      Signed-off-by: default avatarRaghavendra D Prabhu <rprabhu@wnohang.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2233116
    • Glauber Costa's avatar
      make GFP_NOTRACK definition unconditional · 3e648ebe
      Glauber Costa authored
      There was a general sentiment in a recent discussion (See
      https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
      defined unconditionally.  Currently, the only offender is GFP_NOTRACK,
      which is conditional to KMEMCHECK.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e648ebe
    • Minchan Kim's avatar
      cma: decrease cc.nr_migratepages after reclaiming pagelist · beb51eaa
      Minchan Kim authored
      reclaim_clean_pages_from_list() reclaims clean pages before migration so
      cc.nr_migratepages should be updated.  Currently, there is no problem but
      it can be wrong if we try to use the value in future.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      beb51eaa
    • Minchan Kim's avatar
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim authored
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • Naoya Horiguchi's avatar
      kpageflags: fix wrong KPF_THP on non-huge compound pages · 7a71932d
      Naoya Horiguchi authored
      KPF_THP can be set on non-huge compound pages (like slab pages or pages
      allocated by drivers with __GFP_COMP) because PageTransCompound only
      checks PG_head and PG_tail.  Obviously this is a bug and breaks user space
      applications which look for thp via /proc/kpageflags.
      
      This patch rules out setting KPF_THP wrongly by additionally checking
      PageLRU on the head pages.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a71932d
    • Yan Hong's avatar
      fs/fs-writeback.c: remove unneccesary parameter of __writeback_single_inode() · cd8ed2a4
      Yan Hong authored
      The parameter 'wb' is never used in this function.
      Signed-off-by: default avatarYan Hong <clouds.yan@gmail.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd8ed2a4
    • Robert P. J. Day's avatar
    • Hugh Dickins's avatar
      mm: remove unevictable_pgs_mlockfreed · 8befedfe
      Hugh Dickins authored
      Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
      from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
      have been used by any tool, and of course we can restore it easily enough
      if that turns out to be wrong.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8befedfe
    • Minchan Kim's avatar
      memory-hotplug: fix zone stat mismatch · 5a883813
      Minchan Kim authored
      During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
      causing the kernel to hang.  When the system doesn't have enough free
      pages, it enters reclaim but never reclaim any pages due to
      too_many_isolated()==true and loops forever.
      
      The cause is that when we do memory-hotadd after memory-remove,
      __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
      although the vm_stat_diff of all CPUs still have values.
      
      In addtion, when we offline all pages of the zone, we reset them in
      zone_pcp_reset without draining so we loss some zone stat item.
      Reviewed-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a883813
    • Minchan Kim's avatar
      mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") · 08270807
      Minchan Kim authored
      Revert commit 0def08e3 because check_range can't fail in
      migrate_to_node with considering current usecases.
      
      Quote from Johannes
      
      : I think it makes sense to revert.  Not because of the semantics, but I
      : just don't see how check_range() could even fail for this callsite:
      :
      : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
      :    find_vma()
      :
      : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
      :    and so can not fail
      :
      : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
      :    continue until addr == end, so we never fail with -EIO
      
      And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
      which might pass to MPOL_MF_STRICT.
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vasiliy Kulikov <segooon@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08270807
    • Haggai Eran's avatar
      mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end · 6bdb913f
      Haggai Eran authored
      In order to allow sleeping during invalidate_page mmu notifier calls, we
      need to avoid calling when holding the PT lock.  In addition to its direct
      calls, invalidate_page can also be called as a substitute for a change_pte
      call, in case the notifier client hasn't implemented change_pte.
      
      This patch drops the invalidate_page call from change_pte, and instead
      wraps all calls to change_pte with invalidate_range_start and
      invalidate_range_end calls.
      
      Note that change_pte still cannot sleep after this patch, and that clients
      implementing change_pte should not take action on it in case the number of
      outstanding invalidate_range_start calls is larger than one, otherwise
      they might miss a later invalidation.
      Signed-off-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Cc: Andrea Arcangeli <andrea@qumranet.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bdb913f
    • Sagi Grimberg's avatar
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg authored
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: default avatarAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • Michal Hocko's avatar
      hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach · 36e4f20a
      Michal Hocko authored
      Commit 0c176d52 ("mm: hugetlb: fix pgoff computation when unmapping
      page from vma") fixed pgoff calculation but it has replaced it by
      vma_hugecache_offset() which is not approapriate for offsets used for
      vma_prio_tree_foreach() because that one expects index in page units
      rather than in huge_page_shift.
      
      Johannes said:
      
      : The resulting index may not be too big, but it can be too small: assume
      : hpage size of 2M and the address to unmap to be 0x200000.  This is regular
      : page index 512 and hpage index 1.  If you have a VMA that maps the file
      : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
      : you ask for offset 1 and miss it even though it does map the page of
      : interest.  hugetlb_cow() will try to unmap, miss the vma, and retry the
      : cow until the allocation succeeds or the skipped vma(s) go away.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36e4f20a
    • Andrea Arcangeli's avatar
      mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP · 027ef6c8
      Andrea Arcangeli authored
      In many places !pmd_present has been converted to pmd_none.  For pmds
      that's equivalent and pmd_none is quicker so using pmd_none is better.
      
      However (unless we delete pmd_present) we should provide an accurate
      pmd_present too.  This will avoid the risk of code thinking the pmd is non
      present because it's under __split_huge_page_map, see the pmd_mknotpresent
      there and the comment above it.
      
      If the page has been mprotected as PROT_NONE, it would also lead to a
      pmd_present false negative in the same way as the race with
      split_huge_page.
      
      Because the PSE bit stays on at all times (both during split_huge_page and
      when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
      but checking the PROTNONE bit too is still good to remember pmd_present
      must always keep PROT_NONE into account.
      
      This explains a not reproducible BUG_ON that was seldom reported on the
      lists.
      
      The same issue is in pmd_large, it would go wrong with both PROT_NONE and
      if it races with split_huge_page.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      027ef6c8
    • Jiri Kosina's avatar
      memory.txt: remove stray information · 00ea8990
      Jiri Kosina authored
      Andi removed some outedated documentation from Documentation/memory.txt
      back in 2009 by commit 3b2b9a87 ("Documentation/memory.txt: remove
      some very outdated recommendations"), but the resulting document is not
      in a nice shape either.
      
      It seems to me like we are not losing anything by completely removing the
      file now.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00ea8990
    • David Rientjes's avatar
      mm, numa: reclaim from all nodes within reclaim distance · 957f822a
      David Rientjes authored
      RECLAIM_DISTANCE represents the distance between nodes at which it is
      deemed too costly to allocate from; it's preferred to try to reclaim from
      a local zone before falling back to allocating on a remote node with such
      a distance.
      
      To do this, zone_reclaim_mode is set if the distance between any two
      nodes on the system is greather than this distance.  This, however, ends
      up causing the page allocator to reclaim from every zone regardless of
      its affinity.
      
      What we really want is to reclaim only from zones that are closer than
      RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
      represents the set of nodes that are within this distance.  During the
      zone iteration, if the bit for a zone's node is set for the local node,
      then reclaim is attempted; otherwise, the zone is skipped.
      
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      957f822a
    • Hugh Dickins's avatar
      mm: remove free_page_mlock · a0c5e813
      Hugh Dickins authored
      We should not be seeing non-0 unevictable_pgs_mlockfreed any longer.  So
      remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
      already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
      checking it, reporting "BUG: Bad page state" if it's ever found set.
      Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0c5e813
    • Hugh Dickins's avatar
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins authored
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • Hugh Dickins's avatar
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins authored
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • Hugh Dickins's avatar
      mm: fix invalidate_complete_page2() lock ordering · ec4d9f62
      Hugh Dickins authored
      In fuzzing with trinity, lockdep protested "possible irq lock inversion
      dependency detected" when isolate_lru_page() reenabled interrupts while
      still holding the supposedly irq-safe tree_lock:
      
      invalidate_inode_pages2
        invalidate_complete_page2
          spin_lock_irq(&mapping->tree_lock)
          clear_page_mlock
            isolate_lru_page
              spin_unlock_irq(&zone->lru_lock)
      
      isolate_lru_page() is correct to enable interrupts unconditionally:
      invalidate_complete_page2() is incorrect to call clear_page_mlock() while
      holding tree_lock, which is supposed to nest inside lru_lock.
      
      Both truncate_complete_page() and invalidate_complete_page() call
      clear_page_mlock() before taking tree_lock to remove page from radix_tree.
       I guess invalidate_complete_page2() preferred to test PageDirty (again)
      under tree_lock before committing to the munlock; but since the page has
      already been unmapped, its state is already somewhat inconsistent, and no
      worse if clear_page_mlock() moved up.
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Deciphered-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec4d9f62
    • Michal Hocko's avatar
      memcg: move mem_cgroup_is_root upwards · 7ffc0edc
      Michal Hocko authored
      kmem code uses this function and it is better to not use forward
      declarations for static inline functions as some (older) compilers don't
      like it:
      
      gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)
      
        mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
        mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Sachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ffc0edc
    • Michal Hocko's avatar
      memcg: cleanup kmem tcp ifdefs · 4bd2c1ee
      Michal Hocko authored
      TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
      the code is not used if !CONFIG_INET so we should rather test for both.
      The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
      let's keep those outside of any ifdefs because it is considered safer wrt.
       future maintainability.
      
      Tested with
      - CONFIG_INET && CONFIG_MEMCG_KMEM
      - !CONFIG_INET && CONFIG_MEMCG_KMEM
      - CONFIG_INET && !CONFIG_MEMCG_KMEM
      - !CONFIG_INET && !CONFIG_MEMCG_KMEM
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bd2c1ee
    • Michael Kerrisk's avatar
      memcg: trivial fixes for Documentation/cgroups/memory.txt · 1939c557
      Michael Kerrisk authored
      While reading through Documentation/cgroups/memory.txt, I found a number
      of minor wordos and typos.  The patch below is a conservative handling of
      some of these: it provides just a number of "obviously correct" fixes to
      the English that improve the readability of the document somewhat.
      Obviously some more significant fixes need to be made to the document, but
      some of those may not be in the "obvious correct" category.
      Signed-off-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1939c557
    • Jianguo Wu's avatar
      mm: fix-up zone present pages · 7f1290f2
      Jianguo Wu authored
      I think zone->present_pages indicates pages that buddy system can management,
      it should be:
      
      	zone->present_pages = spanned pages - absent pages - bootmem pages,
      
      but is now:
      	zone->present_pages = spanned pages - absent pages - memmap pages.
      
      spanned pages: total size, including holes.
      absent pages: holes.
      bootmem pages: pages used in system boot, managed by bootmem allocator.
      memmap pages: pages used by page structs.
      
      This may cause zone->present_pages less than it should be.  For example,
      numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
      bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
      present_pages should be spanned pages - absent pages, but now it also
      minus memmap pages(free_area_init_core), which are actually allocated from
      ZONE_MOVABLE.  When offlining all memory of a zone, this will cause
      zone->present_pages less than 0, because present_pages is unsigned long
      type, it is actually a very large integer, it indirectly caused
      zone->watermark[WMARK_MIN] becomes a large
      integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
      large integer(calculate_totalreserve_pages()), and finally cause memory
      allocating failure when fork process(__vm_enough_memory()).
      
      [root@localhost ~]# dmesg
      -bash: fork: Cannot allocate memory
      
      I think the bug described in
      
        http://marc.info/?l=linux-mm&m=134502182714186&w=2
      
      is also caused by wrong zone present pages.
      
      This patch intends to fix-up zone->present_pages when memory are freed to
      buddy system on x86_64 and IA64 platforms.
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Reported-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Tested-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f1290f2
    • Rik van Riel's avatar
      mm: enable CONFIG_COMPACTION by default · 05106e6a
      Rik van Riel authored
      Now that lumpy reclaim has been removed, compaction is the only way left
      to free up contiguous memory areas.  It is time to just enable
      CONFIG_COMPACTION by default.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05106e6a
    • Catalin Marinas's avatar
      mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c · eab1eef9
      Catalin Marinas authored
      The update_mmu_cache() takes a pointer (to pte_t by default) as the last
      argument but the huge_memory.c passes a pmd_t value.  The patch changes
      the argument to the pmd_t * pointer.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eab1eef9
    • Catalin Marinas's avatar
      mm: thp: fix the pmd_clear() arguments in pmdp_get_and_clear() · 2d28a227
      Catalin Marinas authored
      The CONFIG_TRANSPARENT_HUGEPAGE implementation of pmdp_get_and_clear()
      calls pmd_clear() with 3 arguments instead of 1.
      
      This happens only for !__HAVE_ARCH_PMDP_GET_AND_CLEAR which doesn't seem
      to happen because x86 defines this and it uses pmd_update.
      
      [mhocko@suse.cz: changelog addition]
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d28a227
    • Xiao Guangrong's avatar
      thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator · e3b4126c
      Xiao Guangrong authored
      If NUMA is enabled, the indicator is not reset if the previous page
      request failed, ausing us to trigger the BUG_ON() in
      khugepaged_alloc_page().
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3b4126c
    • Minchan Kim's avatar
      memory-hotplug: don't replace lowmem pages with highmem · 74c08f98
      Minchan Kim authored
      The changelog for commit 6a6dccba ("mm: cma: don't replace lowmem
      pages with highmem") mentioned that lowmem pages can be replaced by
      highmem pages during CMA migration.  6a6dccba fixed that issue.
      
      Quote from that changelog:
      
      :   The filesystem layer expects pages in the block device's mapping to not
      :   be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
      :   currently replace lowmem pages with highmem pages, leading to crashes in
      :   filesystem code such as the one below:
      :
      :     Unable to handle kernel NULL pointer dereference at virtual address 00000400
      :     pgd = c0c98000
      :     [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
      :     Internal error: Oops: 817 [#1] PREEMPT SMP ARM
      :     CPU: 0    Not tainted  (3.5.0-rc5+ #80)
      :     PC is at __memzero+0x24/0x80
      :     ...
      :     Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
      :     Backtrace:
      :     [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
      :     [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
      :      r4:c15337f0
      :     [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
      :     [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
      :      r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
      :     [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
      :      r6:beccdcf0 r5:00074000 r4:beccdbbc
      :     [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
      
      Memory-hotplug has same problem as CMA has so the same fix can be applied
      to memory-hotplug as well.
      
      Fix it by reusing.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74c08f98
    • Minchan Kim's avatar
      mm/page_alloc: refactor out __alloc_contig_migrate_alloc() · 723a0644
      Minchan Kim authored
      __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
      it out (move + rename as a common name) into page_isolation.c.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      723a0644
    • Sachin Kamat's avatar
    • Mel Gorman's avatar
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman authored
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • Mel Gorman's avatar
      mm: compaction: Restart compaction from near where it left off · c89511ab
      Mel Gorman authored
      This is almost entirely based on Rik's previous patches and discussions
      with him about how this might be implemented.
      
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
       This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      This patch caches where the migration and free scanner should start from
      on subsequent compaction invocations using the pageblock-skip information.
       When compaction starts it begins from the cached restart points and will
      update the cached restart points until a page is isolated or a pageblock
      is skipped that would have been scanned by synchronous compaction.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c89511ab
    • Mel Gorman's avatar
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman authored
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb
    • Mel Gorman's avatar
      revert "mm: have order > 0 compaction start off where it left" · 753341a4
      Mel Gorman authored
      This reverts commit 7db8889a ("mm: have order > 0 compaction start
      off where it left") and commit de74f1cc ("mm: have order > 0 compaction
      start near a pageblock with free pages").  These patches were a good
      idea and tests confirmed that they massively reduced the amount of
      scanning but the implementation is complex and tricky to understand.  A
      later patch will cache what pageblocks should be skipped and
      reimplements the concept of compact_cached_free_pfn on top for both
      migration and free scanners.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      753341a4
    • Mel Gorman's avatar
      mm: compaction: acquire the zone->lock as late as possible · f40d1e42
      Mel Gorman authored
      Compaction's free scanner acquires the zone->lock when checking for
      PageBuddy pages and isolating them.  It does this even if there are no
      PageBuddy pages in the range.
      
      This patch defers acquiring the zone lock for as long as possible.  In the
      event there are no free pages in the pageblock then the lock will not be
      acquired at all which reduces contention on zone->lock.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Tested-by: default avatarPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f40d1e42
    • Mel Gorman's avatar
      mm: compaction: acquire the zone->lru_lock as late as possible · 2a1402aa
      Mel Gorman authored
      Richard Davies and Shaohua Li have both reported lock contention problems
      in compaction on the zone and LRU locks as well as significant amounts of
      time being spent in compaction.  This series aims to reduce lock
      contention and scanning rates to reduce that CPU usage.  Richard reported
      at https://lkml.org/lkml/2012/9/21/91 that this series made a big
      different to a problem he reported in August:
      
         http://marc.info/?l=kvm&m=134511507015614&w=2
      
      Patch 1 defers acquiring the zone->lru_lock as long as possible.
      
      Patch 2 defers acquiring the zone->lock as lock as possible.
      
      Patch 3 reverts Rik's "skip-free" patches as the core concept gets
      	reimplemented later and the remaining patches are easier to
      	understand if this is reverted first.
      
      Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
      	pageblocks should be skipped by the migrate and free scanners.
      	This drastically reduces the amount of scanning compaction has
      	to do.
      
      Patch 5 reimplements something similar to Rik's idea except it uses the
      	pageblock-skip information to decide where the scanners should
      	restart from and does not need to wrap around.
      
      I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were
      
      akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
      lesslock	Patches 1-6
      revert		Patches 1-7
      cachefail	Patches 1-8
      skipuseless	Patches 1-9
      
      Stress high-order allocation tests looked ok.  Success rates are more or
      less the same with the full series applied but there is an expectation
      that there is less opportunity to race with other allocation requests if
      there is less scanning.  The time to complete the tests did not vary that
      much and are uninteresting as were the vmstat statistics so I will not
      present them here.
      
      Using ftrace I recorded how much scanning was done by compaction and got this
      
                                  3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                                  akpm-20120920 lockless  revert-v2r2  cachefail skipuseless
      
      Total   free    scanned         360753976  515414028  565479007   17103281   18916589
      Total   free    isolated          2852429    3597369    4048601     670493     727840
      Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385%
      Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903
      Total   migrate isolated          2555324    3245937    3437501     616359     658616
      Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466%
      
      The efficiency is worthless because of the nature of the test and the
      number of failures.  The really interesting point as far as this patch
      series is concerned is the number of pages scanned.  Note that reverting
      Rik's patches massively increases the number of pages scanned indicating
      that those patches really did make a difference to CPU usage.
      
      However, caching what pageblocks should be skipped has a much higher
      impact.  With patches 1-8 applied, free page and migrate page scanning are
      both reduced by 95% in comparison to the akpm kernel.  If the basic
      concept of Rik's patches are implemened on top then scanning then the free
      scanner barely changed but migrate scanning was further reduced.  That
      said, tests on 3.6-rc5 indicated that the last patch had greater impact
      than what was measured here so it is a bit variable.
      
      One way or the other, this series has a large impact on the amount of
      scanning compaction does when there is a storm of THP allocations.
      
      This patch:
      
      Compaction's migrate scanner acquires the zone->lru_lock when scanning a
      range of pages looking for LRU pages to acquire.  It does this even if
      there are no LRU pages in the range.  If multiple processes are compacting
      then this can cause severe locking contention.  To make matters worse
      commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration") releases the lru_lock every
      SWAP_CLUSTER_MAX pages that are scanned.
      
      This patch makes two changes to how the migrate scanner acquires the LRU
      lock.  First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
      if the lock is contended.  This reduces the number of times it
      unnecessarily disables and re-enables IRQs.  The second is that it defers
      acquiring the LRU lock for as long as possible.  If there are no LRU pages
      or the only LRU pages are transhuge then the LRU lock will not be acquired
      at all which reduces contention on zone->lru_lock.
      
      [minchan@kernel.org: augment comment]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a1402aa
    • Mel Gorman's avatar
      mm: compaction: Update try_to_compact_pages()kerneldoc comment · 661c4cb9
      Mel Gorman authored
      Parameters were added without documentation, tut tut.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      661c4cb9
    • Mel Gorman's avatar
      mm: compaction: move fatal signal check out of compact_checklock_irqsave · 3cc668f4
      Mel Gorman authored
      Commit c67fe375 ("mm: compaction: Abort async compaction if locks
      are contended or taking too long") addressed a lock contention problem
      in compaction by introducing compact_checklock_irqsave() that effecively
      aborting async compaction in the event of compaction.
      
      To preserve existing behaviour it also moved a fatal_signal_pending()
      check into compact_checklock_irqsave() but that is very misleading.  It
      "hides" the check within a locking function but has nothing to do with
      locking as such.  It just happens to work in a desirable fashion.
      
      This patch moves the fatal_signal_pending() check to
      isolate_migratepages_range() where it belongs.  Arguably the same check
      should also happen when isolating pages for freeing but it's overkill.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cc668f4