1. 25 Jun, 2015 40 commits
    • Catalin Marinas's avatar
      mm: kmemleak: fix delete_object_*() race when called on the same memory block · e781a9ab
      Catalin Marinas authored
      Calling delete_object_*() on the same pointer is not a standard use case
      (unless there is a bug in the code calling kmemleak_free()).  However,
      during kmemleak disabling (error or user triggered via /sys), there is a
      potential race between kmemleak_free() calls on a CPU and
      __kmemleak_do_cleanup() on a different CPU.
      
      The current delete_object_*() implementation first performs a look-up
      holding kmemleak_lock, increments the object->use_count and then
      re-acquires kmemleak_lock to remove the object from object_tree_root and
      object_list.
      
      This patch simplifies the delete_object_*() mechanism to both look up
      and remove an object from the object_tree_root and object_list
      atomically (guarded by kmemleak_lock).  This allows safe concurrent
      calls to delete_object_*() on the same pointer without additional
      locking for synchronising the kmemleak_free_enabled flag.
      
      A side effect is a slight improvement in the delete_object_*() performance
      by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
      object->use_count.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e781a9ab
    • Catalin Marinas's avatar
      mm: kmemleak: allow safe memory scanning during kmemleak disabling · c5f3b1a5
      Catalin Marinas authored
      The kmemleak scanning thread can run for minutes.  Callbacks like
      kmemleak_free() are allowed during this time, the race being taken care
      of by the object->lock spinlock.  Such lock also prevents a memory block
      from being freed or unmapped while it is being scanned by blocking the
      kmemleak_free() -> ...  -> __delete_object() function until the lock is
      released in scan_object().
      
      When a kmemleak error occurs (e.g.  it fails to allocate its metadata),
      kmemleak_enabled is set and __delete_object() is no longer called on
      freed objects.  If kmemleak_scan is running at the same time,
      kmemleak_free() no longer waits for the object scanning to complete,
      allowing the corresponding memory block to be freed or unmapped (in the
      case of vfree()).  This leads to kmemleak_scan potentially triggering a
      page fault.
      
      This patch separates the kmemleak_free() enabling/disabling from the
      overall kmemleak_enabled nob so that we can defer the disabling of the
      object freeing tracking until the scanning thread completed.  The
      kmemleak_free_part() is deliberately ignored by this patch since this is
      only called during boot before the scanning thread started.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Tested-by: default avatarVignesh Radhakrishnan <vigneshr@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5f3b1a5
    • Tejun Heo's avatar
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo authored
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      
      * mem_cgroup_oom_register_event()
      
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      
      * memcg_oom_recover()
      
        The lockless test doesn't have any explanation why this would be
        safe.
      
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2b42d3c
    • Tejun Heo's avatar
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo authored
      Since commit 49426420 ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b90b70
    • Dan Streetman's avatar
      frontswap: allow multiple backends · d1dc6f1b
      Dan Streetman authored
      Change frontswap single pointer to a singly linked list of frontswap
      implementations.  Update Xen tmem implementation as register no longer
      returns anything.
      
      Frontswap only keeps track of a single implementation; any
      implementation that registers second (or later) will replace the
      previously registered implementation, and gets a pointer to the previous
      implementation that the new implementation is expected to pass all
      frontswap functions to if it can't handle the function itself.  However
      that method doesn't really make much sense, as passing that work on to
      every implementation adds unnecessary work to implementations; instead,
      frontswap should simply keep a list of all registered implementations
      and try each implementation for any function.  Most importantly, neither
      of the two currently existing frontswap implementations in the kernel
      actually do anything with any previous frontswap implementation that
      they replace when registering.
      
      This allows frontswap to successfully manage multiple implementations by
      keeping a list of them all.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1dc6f1b
    • Tony Luck's avatar
      x86, mirror: x86 enabling - find mirrored memory ranges · b05b9f5f
      Tony Luck authored
      UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
      address ranges.  See UEFI 2.5 spec pages 157-158:
      
        http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
      
      On EFI enabled systems scan the memory map and tell memblock about any
      mirrored ranges.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b05b9f5f
    • Tony Luck's avatar
      mm/memblock: allocate boot time data structures from mirrored memory · a3f5bafc
      Tony Luck authored
      Try to allocate all boot time kernel data structures from mirrored
      memory.
      
      If we run out of mirrored memory print warnings, but fall back to using
      non-mirrored memory to make sure that we still boot.
      
      By number of bytes, most of what we allocate at boot time is the page
      structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
      system memory.  For workloads where the bulk of memory is allocated to
      applications this may represent a useful improvement to system
      availability since 1.5% of total memory might be a third of the memory
      allocated to the kernel.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3f5bafc
    • Tony Luck's avatar
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck authored
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
    • Michal Hocko's avatar
      mm: do not ignore mapping_gfp_mask in page cache allocation paths · 6afdb859
      Michal Hocko authored
      page_cache_read, do_generic_file_read, __generic_file_splice_read and
      __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
      add_to_page_cache_lru which might cause recursion into fs down in the
      direct reclaim path if the mapping really relies on GFP_NOFS semantic.
      
      This doesn't seem to be the case now because page_cache_read (page fault
      path) doesn't seem to suffer from the reclaim recursion issues and
      do_generic_file_read and __generic_file_splice_read also shouldn't be
      called under fs locks which would deadlock in the reclaim path.  Anyway it
      is better to obey mapping gfp mask and prevent from later breakage.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6afdb859
    • Shailendra Verma's avatar
    • Wang Long's avatar
      mm/oom_kill.c: print points as unsigned int · f0d6647e
      Wang Long authored
      In oom_kill_process(), the variable 'points' is unsigned int.  Print it as
      such.
      Signed-off-by: default avatarWang Long <long.wanglong@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0d6647e
    • Mike Kravetz's avatar
      mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages · 33039678
      Mike Kravetz authored
      alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
      number of pages which will be added to the reserve map.  Subpool and
      global reserve counts are adjusted based on the output of region_chg.
      Before the pages are actually added to the reserve map, these routines
      could race and add fewer pages than expected.  If this happens, the
      subpool and global reserve counts are not correct.
      
      Compare the number of pages actually added (region_add) to those expected
      to added (region_chg).  If fewer pages are actually added, this indicates
      a race and adjust counters accordingly.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33039678
    • Mike Kravetz's avatar
      mm/hugetlb: compute/return the number of regions added by region_add() · cf3ad20b
      Mike Kravetz authored
      Modify region_add() to keep track of regions(pages) added to the reserve
      map and return this value.  The return value can be compared to the return
      value of region_chg() to determine if the map was modified between calls.
      
      Make vma_commit_reservation() also pass along the return value of
      region_add().  In the normal case, we want vma_commit_reservation to
      return the same value as the preceding call to vma_needs_reservation.
      Create a common __vma_reservation_common routine to help keep the special
      case return values in sync
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf3ad20b
    • Mike Kravetz's avatar
      mm/hugetlb: document the reserve map/region tracking routines · 1dd308a7
      Mike Kravetz authored
      While working on hugetlbfs fallocate support, I noticed the following race
      in the existing code.  It is unlikely that this race is hit very often in
      the current code.  However, if more functionality to add and remove pages
      to hugetlbfs mappings (such as fallocate) is added the likelihood of
      hitting this race will increase.
      
      alloc_huge_page and hugetlb_reserve_pages use information from the reserve
      map to determine if there are enough available huge pages to complete the
      operation, as well as adjust global reserve and subpool usage counts.  The
      order of operations is as follows:
      
      - call region_chg() to determine the expected change based on reserve map
      - determine if enough resources are available for this operation
      - adjust global counts based on the expected change
      - call region_add() to update the reserve map
      
      The issue is that reserve map could change between the call to region_chg
      and region_add.  In this case, the counters which were adjusted based on
      the output of region_chg will not be correct.
      
      In order to hit this race today, there must be an existing shared hugetlb
      mmap created with the MAP_NORESERVE flag.  A page fault to allocate a huge
      page via this mapping must occur at the same another task is mapping the
      same region without the MAP_NORESERVE flag.
      
      The patch set does not prevent the race from happening.  Rather, it adds
      simple functionality to detect when the race has occurred.  If a race is
      detected, then the incorrect counts are adjusted.
      
      Review comments pointed out the need for documentation of the existing
      region/reserve map routines.  This patch set also adds documentation in
      this area.
      
      This patch (of 3):
      
      This is a documentation only patch and does not modify any code.
      Descriptions of the routines used for reserve map/region tracking are
      added.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1dd308a7
    • Michal Hocko's avatar
      Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED behavior · 9b012a29
      Michal Hocko authored
      There is a very subtle difference between mmap()+mlock() vs
      mmap(MAP_LOCKED) semantic.  The former one fails if the population of the
      area fails while the later one doesn't.  This basically means that
      mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
      which is not the case for mlock.  mmap man page has already been altered
      but Documentation/vm/unevictable-lru.txt deserves a clarification as well.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b012a29
    • Leon Romanovsky's avatar
      mm: nommu: refactor debug and warning prints · 22cc877b
      Leon Romanovsky authored
      kenter/kleave/kdebug are wrapper macros to print functions flow and debug
      information.  This set was written before pr_devel() was introduced, so it
      was controlled by "#if 0" construction.  It is questionable if anyone is
      using them [1] now.
      
      This patch removes these macros, converts numerous printk(KERN_WARNING,
      ...) to use general pr_warn(...) and removes debug print line from
      validate_mmap_request() function.
      Signed-off-by: default avatarLeon Romanovsky <leon@leon.nu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22cc877b
    • Aneesh Kumar K.V's avatar
      mm: clarify that the function operates on hugepage pte · 8809aa2d
      Aneesh Kumar K.V authored
      We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
      _huge_ to pmdp_clear functions so that we are clear that they operate on
      hugepage pte.
      
      We don't bother about other functions like pmdp_set_wrprotect,
      pmdp_clear_flush_young, because they operate on PTE bits and hence
      indicate they are operating on hugepage ptes
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8809aa2d
    • Aneesh Kumar K.V's avatar
      powerpc/mm: use generic version of pmdp_clear_flush() · f28b6ff8
      Aneesh Kumar K.V authored
      Also move the pmd_trans_huge check to generic code.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f28b6ff8
    • Aneesh Kumar K.V's avatar
      mm/thp: split out pmd collapse flush into separate functions · 15a25b2e
      Aneesh Kumar K.V authored
      Architectures like ppc64 [1] need to do special things while clearing pmd
      before a collapse.  For them this operation is largely different from a
      normal hugepage pte clear.  Hence add a separate function to clear pmd
      before collapse.  After this patch pmdp_* functions operate only on
      hugepage pte, and not on regular pmd_t values pointing to page table.
      
      [1] ppc64 needs to invalidate all the normal page pte mappings we already
      have inserted in the hardware hash page table.  But before doing that we
      need to make sure there are no parallel hash page table insert going on.
      So we need to do a kick_all_cpus_sync() before flushing the older hash
      table entries.  By moving this to a separate function we capture these
      details and mention how it is different from a hugepage pte clear.
      
      This patch is a cleanup and only does code movement for clarity.  There
      should not be any change in functionality.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15a25b2e
    • Xie XiuQi's avatar
      tracing: add trace event for memory-failure · 97f0b134
      Xie XiuQi authored
      RAS user space tools like rasdaemon which base on trace event, could
      receive mce error event, but no memory recovery result event.  So, I want
      to add this event to make this scenario complete.
      
      This patch add a event at ras group for memory-failure.
      
      The output like below:
      #  tracer: nop
      #
      #  entries-in-buffer/entries-written: 2/2   #P:24
      #
      #                               _-----=> irqs-off
      #                              / _----=> need-resched
      #                             | / _---=> hardirq/softirq
      #                             || / _--=> preempt-depth
      #                             ||| /     delay
      #            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
      #               | |       |   ||||       |         |
             mce-inject-13150 [001] ....   277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed
      
      [xiexiuqi@huawei.com: fix build error]
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97f0b134
    • Xie XiuQi's avatar
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi authored
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • Xie XiuQi's avatar
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi authored
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17
    • Mel Gorman's avatar
      mm, memcg: Try charging a page before setting page up to date · eb3c24f3
      Mel Gorman authored
      Historically memcg overhead was high even if memcg was unused.  This has
      improved a lot but it still showed up in a profile summary as being a
      problem.
      
      /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
        mem_cgroup_try_charge                                                        2.950%   175781
        __mem_cgroup_count_vm_event                                                  1.431%    85239
        mem_cgroup_page_lruvec                                                       0.456%    27156
        mem_cgroup_commit_charge                                                     0.392%    23342
        uncharge_list                                                                0.323%    19256
        mem_cgroup_update_lru_size                                                   0.278%    16538
        memcg_check_events                                                           0.216%    12858
        mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
        try_charge                                                                   0.150%     8928
        commit_charge                                                                0.141%     8388
        get_mem_cgroup_from_mm                                                       0.121%     7184
      
      That is showing that 6.64% of system CPU cycles were in memcontrol.c and
      dominated by mem_cgroup_try_charge.  The annotation shows that the bulk
      of the cost was checking PageSwapCache which is expected to be cache hot
      but is very expensive.  The problem appears to be that __SetPageUptodate
      is called just before the check which is a write barrier.  It is
      required to make sure struct page and page data is written before the
      PTE is updated and the data visible to userspace.  memcg charging does
      not require or need the barrier but gets unfairly hit with the cost so
      this patch attempts the charging before the barrier.  Aside from the
      accidental cost to memcg there is the added benefit that the barrier is
      avoided if the page cannot be charged.  When applied the relevant
      profile summary is as follows.
      
      /usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c                  3.7907   223277
        __mem_cgroup_count_vm_event                                                  1.143%    67312
        mem_cgroup_page_lruvec                                                       0.465%    27403
        mem_cgroup_commit_charge                                                     0.381%    22452
        uncharge_list                                                                0.332%    19543
        mem_cgroup_update_lru_size                                                   0.284%    16704
        get_mem_cgroup_from_mm                                                       0.271%    15952
        mem_cgroup_try_charge                                                        0.237%    13982
        memcg_check_events                                                           0.222%    13058
        mem_cgroup_charge_statistics.isra.22                                         0.185%    10920
        commit_charge                                                                0.140%     8235
        try_charge                                                                   0.131%     7716
      
      That brings the overhead down to 3.79% and leaves the memcg fault
      accounting to the root cgroup but it's an improvement.  The difference
      in headline performance of the page fault microbench is marginal as
      memcg is such a small component of it.
      
      pft faults
                                             4.0.0                  4.0.0
                                           vanilla            chargefirst
      Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1509075.7561 (  4.56%)
      Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1339160.7113 ( -0.09%)
      Hmean    faults/cpu-5  875599.0222 (  0.00%)  874174.1255 ( -0.16%)
      Hmean    faults/cpu-7  601146.6726 (  0.00%)  601370.9977 (  0.04%)
      Hmean    faults/cpu-8  510728.2754 (  0.00%)  510598.8214 ( -0.03%)
      Hmean    faults/sec-1 1432084.7845 (  0.00%) 1497935.5274 (  4.60%)
      Hmean    faults/sec-3 3943818.1437 (  0.00%) 3941920.1520 ( -0.05%)
      Hmean    faults/sec-5 3877573.5867 (  0.00%) 3869385.7553 ( -0.21%)
      Hmean    faults/sec-7 3991832.0418 (  0.00%) 3992181.4189 (  0.01%)
      Hmean    faults/sec-8 3987189.8167 (  0.00%) 3986452.2204 ( -0.02%)
      
      It's only visible at single threaded. The overhead is there for higher
      threads but other factors dominate.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb3c24f3
    • Michal Hocko's avatar
      hugetlb: do not account hugetlb pages as NR_FILE_PAGES · 4165b9b4
      Michal Hocko authored
      hugetlb pages uses add_to_page_cache to track shared mappings.  This is
      OK from the data structure point of view but it is less so from the
      NR_FILE_PAGES accounting:
      
      	- huge pages are accounted as 4k which is clearly wrong
      	- this counter is used as the amount of the reclaimable page
      	  cache which is incorrect as well because hugetlb pages are
      	  special and not reclaimable
      	- the counter is then exported to userspace via /proc/meminfo
      	  (in Cached:), /proc/vmstat and /proc/zoneinfo as
      	  nr_file_pages which is confusing at least:
      	  Cached:          8883504 kB
      	  HugePages_Free:     8348
      	  ...
      	  Cached:          8916048 kB
      	  HugePages_Free:      156
      	  ...
      	  thats 8192 huge pages allocated which is ~16G accounted as 32M
      
      There are usually not that many huge pages in the system for this to
      make any visible difference e.g.  by fooling __vm_enough_memory or
      zone_pagecache_reclaimable.
      
      Fix this by special casing huge pages in both __delete_from_page_cache
      and __add_to_page_cache_locked.  replace_page_cache_page is currently
      only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
      is more robust to check for special casing there as well.
      
      Hugetlb pages shouldn't get to any other paths where we do accounting:
      	- migration - we have a special handling via
      	  hugetlbfs_migrate_page
      	- shmem - doesn't handle hugetlb pages directly even for
      	  SHM_HUGETLB resp. MAP_HUGETLB
      	- swapcache - hugetlb is not swapable
      
      This has a user visible effect but I believe it is reasonable because the
      previously exported number is simply bogus.
      
      An alternative would be to account hugetlb pages with their real size and
      treat them similar to shmem.  But this has some drawbacks.
      
      First we would have to special case in kernel users of NR_FILE_PAGES and
      considering how hugetlb is special we would have to do it everywhere.  We
      do not want Cached exported by /proc/meminfo to include it because the
      value would be even more misleading.
      
      __vm_enough_memory and zone_pagecache_reclaimable would have to do the
      same thing because those pages are simply not reclaimable.  The correction
      is even not trivial because we would have to consider all active hugetlb
      page sizes properly.  Users of the counter outside of the kernel would
      have to do the same.
      
      So the question is why to account something that needs to be basically
      excluded for each reasonable usage.  This doesn't make much sense to me.
      
      It seems that this has been broken since hugetlb was introduced but I
      haven't checked the whole history.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4165b9b4
    • Johannes Weiner's avatar
      mm: page_alloc: inline should_alloc_retry() · 9083905a
      Johannes Weiner authored
      The should_alloc_retry() function was meant to encapsulate retry
      conditions of the allocator slowpath, but there are still checks
      remaining in the main function, and much of how the retrying is
      performed also depends on the OOM killer progress.  The physical
      separation of those conditions make the code hard to follow.
      
      Inline the should_alloc_retry() checks.  Notes:
      
      - The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
        replace it with looping on OOM killer progress
      
      - The pm_suspended_storage() check is meant to skip the OOM killer
        when reclaim has no IO available, move to __alloc_pages_may_oom()
      
      - The order <= PAGE_ALLOC_COSTLY order is re-united with its original
        counterpart of checking whether reclaim actually made any progress
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9083905a
    • Johannes Weiner's avatar
      mm: oom_kill: simplify OOM killer locking · dc56401f
      Johannes Weiner authored
      The zonelist locking and the oom_sem are two overlapping locks that are
      used to serialize global OOM killing against different things.
      
      The historical zonelist locking serializes OOM kills from allocations with
      overlapping zonelists against each other to prevent killing more tasks
      than necessary in the same memory domain.  Only when neither tasklists nor
      zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
      bound to separate nodes) are OOM kills allowed to execute in parallel.
      
      The younger oom_sem is a read-write lock to serialize OOM killing against
      the PM code trying to disable the OOM killer altogether.
      
      However, the OOM killer is a fairly cold error path, there is really no
      reason to optimize for highly performant and concurrent OOM kills.  And
      the oom_sem is just flat-out redundant.
      
      Replace both locking schemes with a single global mutex serializing OOM
      kills regardless of context.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc56401f
    • Johannes Weiner's avatar
      mm: oom_kill: remove unnecessary locking in exit_oom_victim() · da51b14a
      Johannes Weiner authored
      Disabling the OOM killer needs to exclude allocators from entering, not
      existing victims from exiting.
      
      Right now the only waiter is suspend code, which achieves quiescence by
      disabling the OOM killer.  But later on we want to add waits that hold
      the lock instead to stop new victims from showing up.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da51b14a
    • Johannes Weiner's avatar
      mm: oom_kill: generalize OOM progress waitqueue · c38f1025
      Johannes Weiner authored
      It turns out that the mechanism to wait for exiting OOM victims is less
      generic than it looks: it won't issue wakeups unless the OOM killer is
      disabled.
      
      The reason this check was added was the thought that, since only the OOM
      disabling code would wait on this queue, wakeup operations could be
      saved when that specific consumer is known to be absent.
      
      However, this is quite the handgrenade.  Later attempts to reuse the
      waitqueue for other purposes will lead to completely unexpected bugs and
      the failure mode will appear seemingly illogical.  Generally, providers
      shouldn't make unnecessary assumptions about consumers.
      
      This could have been replaced with waitqueue_active(), but it only saves
      a few instructions in one of the coldest paths in the kernel.  Simply
      remove it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c38f1025
    • Johannes Weiner's avatar
      mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear · 46402778
      Johannes Weiner authored
      exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
      can clear it concurrently.  Use clear_thread_flag() directly.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46402778
    • Johannes Weiner's avatar
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner authored
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16e95196
    • Johannes Weiner's avatar
      mm: oom_kill: remove unnecessary locking in oom_enable() · 3f5ab8cf
      Johannes Weiner authored
      Setting oom_killer_disabled to false is atomic, there is no need for
      further synchronization with ongoing allocations trying to OOM-kill.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f5ab8cf
    • Gu Zheng's avatar
      mm/memory hotplug: init the zone's size when calculating node totalpages · febd5949
      Gu Zheng authored
      Init the zone's size when calculating node totalpages to avoid duplicated
      operations in free_area_init_core().
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      febd5949
    • Naoya Horiguchi's avatar
      mm/hugetlb: introduce minimum hugepage order · 641844f5
      Naoya Horiguchi authored
      Currently the initial value of order in dissolve_free_huge_page is 64 or
      32, which leads to the following warning in static checker:
      
        mm/hugetlb.c:1203 dissolve_free_huge_pages()
        warn: potential right shift more than type allows '9,18,64'
      
      This is a potential risk of infinite loop, because 1 << order (== 0) is used
      in for-loop like this:
      
        for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
            ...
      
      So this patch fixes it by using global minimum_order calculated at boot time.
      
          text    data     bss     dec     hex filename
         28313     469   84236  113018   1b97a mm/hugetlb.o
         28256     473   84236  112965   1b945 mm/hugetlb.o (patched)
      
      Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      641844f5
    • Vladimir Davydov's avatar
      rmap: fix theoretical race between do_wp_page and shrink_active_list · 414e2fb8
      Vladimir Davydov authored
      As noted by Paul the compiler is free to store a temporary result in a
      variable on stack, heap or global unless it is explicitly marked as
      volatile, see:
      
        http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#sample-optimizations
      
      This can result in a race between do_wp_page() and shrink_active_list()
      as follows.
      
      In do_wp_page() we can call page_move_anon_rmap(), which sets
      page->mapping as follows:
      
        anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
        page->mapping = (struct address_space *) anon_vma;
      
      The page in question may be on an LRU list, because nowhere in
      do_wp_page() we remove it from the list, neither do we take any LRU
      related locks.  Although the page is locked, shrink_active_list() can
      still call page_referenced() on it concurrently, because the latter does
      not require an anonymous page to be locked:
      
        CPU0                          CPU1
        ----                          ----
        do_wp_page                    shrink_active_list
         lock_page                     page_referenced
                                        PageAnon->yes, so skip trylock_page
         page_move_anon_rmap
          page->mapping = anon_vma
                                        rmap_walk
                                         PageAnon->no
                                         rmap_walk_file
                                          BUG
          page->mapping += PAGE_MAPPING_ANON
      
      This patch fixes this race by explicitly forbidding the compiler to split
      page->mapping store in page_move_anon_rmap() with the aid of WRITE_ONCE.
      
      [akpm@linux-foundation.org: tweak comment, per Minchan]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      414e2fb8
    • Naoya Horiguchi's avatar
      mm/memory-failure: me_huge_page() does nothing for thp · 2491ffee
      Naoya Horiguchi authored
      memory_failure() is supposed not to handle thp itself, but to split it.
      But if something were wrong and page_action() were called on thp,
      me_huge_page() (action routine for hugepages) should be better to take
      no action, rather than to take wrong action prepared for hugetlb (which
      triggers BUG_ON().)
      
      This change is for potential problems, but makes sense to me because thp
      is an actively developing feature and this code path can be open in the
      future.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2491ffee
    • Naoya Horiguchi's avatar
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi authored
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
    • Naoya Horiguchi's avatar
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi authored
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • Naoya Horiguchi's avatar
      mm/memory-failure: split thp earlier in memory error handling · 415c64c1
      Naoya Horiguchi authored
      memory_failure() doesn't handle thp itself at this time and need to split
      it before doing isolation.  Currently thp is split in the middle of
      hwpoison_user_mappings(), but there're corner cases where memory_failure()
      wrongly tries to handle thp without splitting.
      
      1) "non anonymous" thp, which is not a normal operating mode of thp,
         but a memory error could hit a thp before anon_vma is initialized.  In
         such case, split_huge_page() fails and me_huge_page() (intended for
         hugetlb) is called for thp, which triggers BUG_ON in page_hstate().
      
      2) !PageLRU case, where hwpoison_user_mappings() returns with
         SWAP_SUCCESS and the result is the same as case 1.
      
      memory_failure() can't avoid splitting, so let's split it more earlier,
      which also reduces code which are prepared for both of normal page and
      thp.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415c64c1
    • Zhihui Zhang's avatar
      mm: rename RECLAIM_SWAP to RECLAIM_UNMAP · 95bbc0c7
      Zhihui Zhang authored
      The name SWAP implies that we are dealing with anonymous pages only.  In
      fact, the original patch that introduced the min_unmapped_ratio logic
      was to fix an issue related to file pages.  Rename it to RECLAIM_UNMAP
      to match what does.
      
      Historically, commit a6dc60f8 ("vmscan: rename sc.may_swap to
      may_unmap") renamed .may_swap to .may_unmap, leaving RECLAIM_SWAP
      behind.  commit 2e2e4259 ("vmscan,memcg: reintroduce sc->may_swap")
      reintroduced .may_swap for memory controller.
      Signed-off-by: default avatarZhihui Zhang <zzhsuny@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95bbc0c7
    • Nishanth Aravamudan's avatar
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages · f012a84a
      Nishanth Aravamudan authored
      Based upon 675becce ("mm: vmscan: do not throttle based on pfmemalloc
      reserves if node has no ZONE_NORMAL") from Mel.
      
      We have a system with the following topology:
      
      # numactl -H
      available: 3 nodes (0,2-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
      23 24 25 26 27 28 29 30 31
      node 0 size: 28273 MB
      node 0 free: 27323 MB
      node 2 cpus:
      node 2 size: 16384 MB
      node 2 free: 0 MB
      node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 3 size: 30533 MB
      node 3 free: 13273 MB
      node distances:
      node   0   2   3
        0:  10  20  20
        2:  20  10  20
        3:  20  20  10
      
      Node 2 has no free memory, because:
      # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
      1
      
      This leads to the following zoneinfo:
      
      Node 2, zone      DMA
        pages free     0
              min      1840
              low      2300
              high     2760
              scanned  0
              spanned  262144
              present  262144
              managed  262144
      ...
        all_unreclaimable: 1
      
      If one then attempts to allocate some normal 16M hugepages via
      
      echo 37 > /proc/sys/vm/nr_hugepages
      
      The echo never returns and kswapd2 consumes CPU cycles.
      
      This is because throttle_direct_reclaim ends up calling
      wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
      pfmemalloc_watermark_ok() in turn checks all zones on the node if there
      are any reserves, and if so, then indicates the watermarks are ok, by
      seeing if there are sufficient free pages.
      
      675becce added a condition already for memoryless nodes.  In this case,
      though, the node has memory, it is just all consumed (and not
      reclaimable).  Effectively, though, the result is the same on this call to
      pfmemalloc_watermark_ok() and thus seems like a reasonable additional
      condition.
      
      With this change, the afore-mentioned 16M hugepage allocation attempt
      succeeds and correctly round-robins between Nodes 1 and 3.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f012a84a