1. 07 Aug, 2014 40 commits
    • Peter Feiner's avatar
      mm: softdirty: respect VM_SOFTDIRTY in PTE holes · 68b5a652
      Peter Feiner authored
      After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
      should report that the VMA's virtual pages are soft-dirty until
      VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
      /proc/pid/clear_refs).  However, pagemap ignores the VM_SOFTDIRTY flag
      for virtual addresses that fall in PTE holes (i.e., virtual addresses
      that don't have a PMD, PUD, or PGD allocated yet).
      
      To observe this bug, use mmap to create a VMA large enough such that
      there's a good chance that the VMA will occupy an unused PMD, then test
      the soft-dirty bit on its pages.  In practice, I found that a VMA that
      covered a PMD's worth of address space was big enough.
      
      This patch adds the necessary VMA lookup to the PTE hole callback in
      /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
      VM_SOFTDIRTY flag.
      Signed-off-by: default avatarPeter Feiner <pfeiner@google.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68b5a652
    • Kirill A. Shutemov's avatar
      mm: mark fault_around_bytes __read_mostly · 3a91053a
      Kirill A. Shutemov authored
      fault_around_bytes can only be changed via debugfs.  Let's mark it
      read-mostly.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a91053a
    • Kirill A. Shutemov's avatar
      mm: close race between do_fault_around() and fault_around_bytes_set() · aecd6f44
      Kirill A. Shutemov authored
      Things can go wrong if fault_around_bytes will be changed under
      do_fault_around(): between fault_around_mask() and fault_around_pages().
      
      Let's read fault_around_bytes only once during do_fault_around() and
      calculate mask based on the reading.
      
      Note: fault_around_bytes can only be updated via debug interface.  Also
      I've tried but was not able to trigger a bad behaviour without the
      patch.  So I would not consider this patch as urgent.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aecd6f44
    • Jerome Marchand's avatar
      memcg, vmscan: Fix forced scan of anonymous pages · 2ab051e1
      Jerome Marchand authored
      When memory cgoups are enabled, the code that decides to force to scan
      anonymous pages in get_scan_count() compares global values (free,
      high_watermark) to a value that is restricted to a memory cgroup (file).
      It make the code over-eager to force anon scan.
      
      For instance, it will force anon scan when scanning a memcg that is
      mainly populated by anonymous page, even when there is plenty of file
      pages to get rid of in others memcgs, even when swappiness == 0.  It
      breaks user's expectation about swappiness and hurts performance.
      
      This patch makes sure that forced anon scan only happens when there not
      enough file pages for the all zone, not just in one random memcg.
      
      [hannes@cmpxchg.org: cleanups]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ab051e1
    • Jerome Marchand's avatar
      mm, vmscan: fix an outdated comment still mentioning get_scan_ratio · 7c0db9e9
      Jerome Marchand authored
      Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
      however a comment in shrink_active_list() still mention it.  This patch
      fixes the outdated comment.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c0db9e9
    • David Rientjes's avatar
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes authored
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • Li Zhong's avatar
      mm: fix potential infinite loop in dissolve_free_huge_pages() · d0177639
      Li Zhong authored
      It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
      0 to indicate huge pages not supported.
      
      When this is the case, hugetlbfs could be disabled during boot time:
      hugetlbfs: disabling because there are no supported hugepage sizes
      
      Then in dissolve_free_huge_pages(), order is kept maximum (64 for
      64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
      end_pfn; pfn += 1 << order)
      
      As suggested by Naoya, below fix checks hugepages_supported() before
      calling dissolve_free_huge_pages().
      
      [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0177639
    • David Rientjes's avatar
      mm, thp: restructure thp avoidance of light synchronous migration · 8fe78048
      David Rientjes authored
      __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
      or not, has gained more users.  Their use is not necessarily wrong, they
      are trying to do a memory allocation that can easily fail without
      disturbing kswapd, so the bit has gained additional usecases.
      
      This restructures the check to determine whether MIGRATE_SYNC_LIGHT
      should be used for memory compaction in the page allocator.  Rather than
      testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
      for thp allocations.
      
      This also moves the check to be done only after the page allocator is
      aborted for deferred or contended memory compaction since setting
      migration_mode for this case is pointless.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fe78048
    • David Rientjes's avatar
      mm, oom: rename zonelist locking functions · e972a070
      David Rientjes authored
      try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
      to imply that they require locking semantics to avoid out_of_memory()
      being reordered.
      
      zone_scan_lock is required for both functions to ensure that there is
      proper locking synchronization.
      
      Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
      clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
      locking semantics.
      
      At the same time, convert oom_zonelist_trylock() to return bool instead
      of int since only success and failure are tested.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e972a070
    • David Rientjes's avatar
      mm, oom: ensure memoryless node zonelist always includes zones · 8d060bf4
      David Rientjes authored
      With memoryless node support being worked on, it's possible that for
      optimizations that a node may not have a non-NULL zonelist.  When
      CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
      for first_online_node may become NULL.
      
      The oom killer requires a zonelist that includes all memory zones for
      the sysrq trigger and pagefault out of memory handler.
      
      Ensure that a non-NULL zonelist is always passed to the oom killer.
      
      [akpm@linux-foundation.org: fix non-numa build]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d060bf4
    • Wang Nan's avatar
      memory-hotplug: sh: suitable memory should go to ZONE_MOVABLE · 6e90b58b
      Wang Nan authored
      This patch introduces zone_for_memory() to arch_add_memory() on sh to
      ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e90b58b
    • Wang Nan's avatar
      memory-hotplug: ppc: suitable memory should go to ZONE_MOVABLE · f51202de
      Wang Nan authored
      This patch introduces zone_for_memory() to arch_add_memory() on powerpc
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f51202de
    • Wang Nan's avatar
      memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLE · ed562ae6
      Wang Nan authored
      This patch introduces zone_for_memory() to arch_add_memory() on ia64 to
      ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed562ae6
    • Wang Nan's avatar
      memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLE · 03d4be64
      Wang Nan authored
      This patch introduces zone_for_memory() to arch_add_memory() on x86_32
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03d4be64
    • Wang Nan's avatar
      memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLE · 9bfc4113
      Wang Nan authored
      This patch introduces zone_for_memory() to arch_add_memory() on x86_64
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bfc4113
    • Wang Nan's avatar
      memory-hotplug: add zone_for_memory() for selecting zone for new memory · 63264400
      Wang Nan authored
      This series of patches fixes a problem when adding memory in bad manner.
      For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
      memory installed, following commands cause problem:
      
        # echo 0x40000000 > /sys/devices/system/memory/probe
       [   28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
        # echo 0x48000000 > /sys/devices/system/memory/probe
       [   28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
        # echo online_movable > /sys/devices/system/memory/memory9/state
        # echo 0x50000000 > /sys/devices/system/memory/probe
       [   29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
        # echo 0x58000000 > /sys/devices/system/memory/probe
       [   29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
        # echo online_movable > /sys/devices/system/memory/memory11/state
        # echo online> /sys/devices/system/memory/memory8/state
        # echo online> /sys/devices/system/memory/memory10/state
        # echo offline> /sys/devices/system/memory/memory9/state
       [   30.558819] Offlined Pages 32768
        # free
                    total       used       free     shared    buffers     cached
       Mem:        780588 18014398509432020     830552          0          0      51180
       -/+ buffers/cache: 18014398509380840     881732
       Swap:            0          0          0
      
      This is because the above commands probe higher memory after online a
      section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
      for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
      
      After the second online_movable, the problem can be observed from
      zoneinfo:
      
        # cat /proc/zoneinfo
        ...
        Node 0, zone  Movable
          pages free     65491
                min      250
                low      312
                high     375
                scanned  0
                spanned  18446744073709518848
                present  65536
                managed  65536
        ...
      
      This series of patches solve the problem by checking ZONE_MOVABLE when
      choosing zone for new memory.  If new memory is inside or higher than
      ZONE_MOVABLE, makes it go there instead.
      
      After applying this series of patches, following are free and zoneinfo
      result (after offlining memory9):
      
        bash-4.2# free
                      total       used       free     shared    buffers     cached
         Mem:        780956      80112     700844          0          0      51180
         -/+ buffers/cache:      28932     752024
         Swap:            0          0          0
      
        bash-4.2# cat /proc/zoneinfo
      
        Node 0, zone      DMA
          pages free     3389
                min      14
                low      17
                high     21
                scanned  0
                spanned  4095
                present  3998
                managed  3977
            nr_free_pages 3389
        ...
          start_pfn:         1
          inactive_ratio:    1
        Node 0, zone    DMA32
          pages free     73724
                min      341
                low      426
                high     511
                scanned  0
                spanned  98304
                present  98304
                managed  92958
            nr_free_pages 73724
          ...
          start_pfn:         4096
          inactive_ratio:    1
        Node 0, zone   Normal
          pages free     32630
                min      120
                low      150
                high     180
                scanned  0
                spanned  32768
                present  32768
                managed  32768
            nr_free_pages 32630
        ...
          start_pfn:         262144
          inactive_ratio:    1
        Node 0, zone  Movable
          pages free     65476
                min      241
                low      301
                high     361
                scanned  0
                spanned  98304
                present  65536
                managed  65536
            nr_free_pages 65476
        ...
          start_pfn:         294912
          inactive_ratio:    1
      
      This patch (of 7):
      
      Introduce zone_for_memory() in arch independent code for
      arch_add_memory() use.
      
      Many arch_add_memory() function simply selects ZONE_HIGHMEM or
      ZONE_NORMAL and add new memory into it.  However, with the existance of
      ZONE_MOVABLE, the selection method should be carefully considered: if
      new, higher memory is added after ZONE_MOVABLE is setup, the default
      zone and ZONE_MOVABLE may overlap each other.
      
      should_add_memory_movable() checks the status of ZONE_MOVABLE.  If it
      has already contain memory, compare the address of new memory and
      movable memory.  If new memory is higher than movable, it should be
      added into ZONE_MOVABLE instead of default zone.
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63264400
    • Vladimir Davydov's avatar
      slub: remove kmemcg id from create_unique_id · aee52cae
      Vladimir Davydov authored
      This function is never called for memcg caches, because they are
      unmergeable, so remove the dead code.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aee52cae
    • David Rientjes's avatar
      mm, writeback: prevent race when calculating dirty limits · 9ef0a0ff
      David Rientjes authored
      Setting vm_dirty_bytes and dirty_background_bytes is not protected by
      any serialization.
      
      Therefore, it's possible for either variable to change value after the
      test in global_dirty_limits() to determine whether available_memory
      needs to be initialized or not.
      
      Always ensure that available_memory is properly initialized.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ef0a0ff
    • David Rientjes's avatar
      mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode · 14a4e214
      David Rientjes authored
      Commit 9f1b868a ("mm: thp: khugepaged: add policy for finding target
      node") improved the previous khugepaged logic which allocated a
      transparent hugepages from the node of the first page being collapsed.
      
      However, it is still possible to collapse pages to remote memory which
      may suffer from additional access latency.  With the current policy, it
      is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
      remotely if the majority are allocated from that node.
      
      When zone_reclaim_mode is enabled, it means the VM should make every
      attempt to allocate locally to prevent NUMA performance degradation.  In
      this case, we do not want to collapse hugepages to remote nodes that
      would suffer from increased access latency.  Thus, when
      zone_reclaim_mode is enabled, only allow collapsing to nodes with
      RECLAIM_DISTANCE or less.
      
      There is no functional change for systems that disable
      zone_reclaim_mode.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14a4e214
    • Wang Sheng-Hui's avatar
      mm/shmem.c: remove the unused gfp arg to shmem_add_to_page_cache() · fed400a1
      Wang Sheng-Hui authored
      The gfp arg is not used in shmem_add_to_page_cache.  Remove this unused
      arg.
      Signed-off-by: default avatarWang Sheng-Hui <shhuiw@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed400a1
    • Paul Cassella's avatar
      mm: describe mmap_sem rules for __lock_page_or_retry() and callers · 9a95f3cf
      Paul Cassella authored
      Add a comment describing the circumstances in which
      __lock_page_or_retry() will or will not release the mmap_sem when
      returning 0.
      
      Add comments to lock_page_or_retry()'s callers (filemap_fault(),
      do_swap_page()) noting the impact on VM_FAULT_RETRY returns.
      
      Add comments on up the call tree, particularly replacing the false "We
      return with mmap_sem still held" comments.
      Signed-off-by: default avatarPaul Cassella <cassella@cray.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a95f3cf
    • Mel Gorman's avatar
      mm: page_alloc: reduce cost of the fair zone allocation policy · 4ffeaf35
      Mel Gorman authored
      The fair zone allocation policy round-robins allocations between zones
      within a node to avoid age inversion problems during reclaim.  If the
      first allocation fails, the batch counts are reset and a second attempt
      made before entering the slow path.
      
      One assumption made with this scheme is that batches expire at roughly
      the same time and the resets each time are justified.  This assumption
      does not hold when zones reach their low watermark as the batches will
      be consumed at uneven rates.  Allocation failure due to watermark
      depletion result in additional zonelist scans for the reset and another
      watermark check before hitting the slowpath.
      
      On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
      machine it's variable due to the variability of measuring overhead with
      the vmstat changes.  The system CPU overhead comparison looks like
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanilla   vmstat-v5 lowercost-v5
      User          746.94      774.56      802.00
      System      65336.22    32847.27    40852.33
      Elapsed     27553.52    27415.04    27368.46
      
      However it is worth noting that the overall benchmark still completed
      faster and intuitively it makes sense to take as few passes as possible
      through the zonelists.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ffeaf35
    • Mel Gorman's avatar
      mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered · f7b5d647
      Mel Gorman authored
      The purpose of numa_zonelist_order=zone is to preserve lower zones for
      use with 32-bit devices.  If locality is preferred then the
      numa_zonelist_order=node policy should be used.
      
      Unfortunately, the fair zone allocation policy overrides this by
      skipping zones on remote nodes until the lower one is found.  While this
      makes sense from a page aging and performance perspective, it breaks the
      expected zonelist policy.  This patch restores the expected behaviour
      for zone-list ordering.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7b5d647
    • Mel Gorman's avatar
      mm: vmscan: only update per-cpu thresholds for online CPU · bb0b6dff
      Mel Gorman authored
      When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
      to get more accurate counts to avoid breaching watermarks.  This
      threshold update iterates over all possible CPUs which is unnecessary.
      Only online CPUs need to be updated.  If a new CPU is onlined,
      refresh_zone_stat_thresholds() will set the thresholds correctly.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb0b6dff
    • Mel Gorman's avatar
      mm: move zone->pages_scanned into a vmstat counter · 0d5d823a
      Mel Gorman authored
      zone->pages_scanned is a write-intensive cache line during page reclaim
      and it's also updated during page free.  Move the counter into vmstat to
      take advantage of the per-cpu updates and do not update it in the free
      paths unless necessary.
      
      On a small UMA machine running tiobench the difference is marginal.  On
      a 4-node machine the overhead is more noticable.  Note that automatic
      NUMA balancing was disabled for this test as otherwise the system CPU
      overhead is unpredictable.
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5   vmstat-v5
      User          746.94      759.78      774.56
      System      65336.22    58350.98    32847.27
      Elapsed     27553.52    27282.02    27415.04
      
      Note that the overhead reduction will vary depending on where exactly
      pages are allocated and freed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d5d823a
    • Mel Gorman's avatar
      mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines · 3484b2de
      Mel Gorman authored
      The arrangement of struct zone has changed over time and now it has
      reached the point where there is some inappropriate sharing going on.
      On x86-64 for example
      
      o The zone->node field is shared with the zone lock and zone->node is
        accessed frequently from the page allocator due to the fair zone
        allocation policy.
      
      o span_seqlock is almost never used by shares a line with free_area
      
      o Some zone statistics share a cache line with the LRU lock so
        reclaim-intensive and allocator-intensive workloads can bounce the cache
        line on a stat update
      
      This patch rearranges struct zone to put read-only and read-mostly
      fields together and then splits the page allocator intensive fields, the
      zone statistics and the page reclaim intensive fields into their own
      cache lines.  Note that the type of lowmem_reserve changes due to the
      watermark calculations being signed and avoiding a signed/unsigned
      conversion there.
      
      On the test configuration I used the overall size of struct zone shrunk
      by one cache line.  On smaller machines, this is not likely to be
      noticable.  However, on a 4-node NUMA machine running tiobench the
      system CPU overhead is reduced by this patch.
      
                3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5r9
      User          746.94      759.78
      System      65336.22    58350.98
      Elapsed     27553.52    27282.02
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3484b2de
    • Mel Gorman's avatar
      mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated · 24b7e581
      Mel Gorman authored
      This was formerly the series "Improve sequential read throughput" which
      noted some major differences in performance of tiobench since 3.0.
      While there are a number of factors, two that dominated were the
      introduction of the fair zone allocation policy and changes to CFQ.
      
      The behaviour of fair zone allocation policy makes more sense than
      tiobench as a benchmark and CFQ defaults were not changed due to
      insufficient benchmarking.
      
      This series is what's left.  It's one functional fix to the fair zone
      allocation policy when used on NUMA machines and a reduction of overhead
      in general.  tiobench was used for the comparison despite its flaws as
      an IO benchmark as in this case we are primarily interested in the
      overhead of page allocator and page reclaim activity.
      
      On UMA, it makes little difference to overhead
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          383.61      386.77
      System        403.83      401.74
      Elapsed      5411.50     5413.11
      
      On a 4-socket NUMA machine it's a bit more noticable
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          746.94      802.00
      System      65336.22    40852.33
      Elapsed     27553.52    27368.46
      
      This patch (of 6):
      
      The LRU insertion and activate tracepoints take PFN as a parameter
      forcing the overhead to the caller.  Move the overhead to the tracepoint
      fast-assign method to ensure the cost is only incurred when the
      tracepoint is active.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24b7e581
    • Chen Yucong's avatar
      mm: trace-vmscan-postprocess.pl: report the number of file/anon pages respectively · 2c51856c
      Chen Yucong authored
      Until now, the reporting from trace-vmscan-postprocess.pl is not very
      useful because we cannot directly use this script for checking the
      file/anon ratio of scanning.  This patch aims to report respectively the
      number of file/anon pages which were scanned/reclaimed by kswapd or
      direct-reclaim.  Sample output is usually something like the following.
      
      Summary
      Direct reclaims:                          8823
      Direct reclaim pages scanned:             2438797
      Direct reclaim file pages scanned:        1315200
      Direct reclaim anon pages scanned:        1123597
      Direct reclaim pages reclaimed:           446139
      Direct reclaim file pages reclaimed:      378668
      Direct reclaim anon pages reclaimed:      67471
      Direct reclaim write file sync I/O:       0
      Direct reclaim write anon sync I/O:       0
      Direct reclaim write file async I/O:      0
      Direct reclaim write anon async I/O:      4240
      Wake kswapd requests:                     122310
      Time stalled direct reclaim:              13.78 seconds
      
      Kswapd wakeups:                           25817
      Kswapd pages scanned:                     170779115
      Kswapd file pages scanned:                162725123
      Kswapd anon pages scanned:                8053992
      Kswapd pages reclaimed:                   129065738
      Kswapd file pages reclaimed:              128500930
      Kswapd anon pages reclaimed:              564808
      Kswapd reclaim write file sync I/O:       0
      Kswapd reclaim write anon sync I/O:       0
      Kswapd reclaim write file async I/O:      36
      Kswapd reclaim write anon async I/O:      730730
      Time kswapd awake:                        1015.50 seconds
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c51856c
    • Wang Sheng-Hui's avatar
      mm: update the description for vm_total_pages · d0480be4
      Wang Sheng-Hui authored
      vm_total_pages is calculated by nr_free_pagecache_pages(), which counts
      the number of pages which are beyond the high watermark within all
      zones.  So vm_total_pages is not equal to total number of pages which
      the VM controls.
      Signed-off-by: default avatarWang Sheng-Hui <shhuiw@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0480be4
    • Cyrill Gorcunov's avatar
      mm/memory.c: don't forget to set softdirty on file mapped fault · 9aed8614
      Cyrill Gorcunov authored
      Otherwise we may not notice that pte was softdirty because
      pte_mksoft_dirty helper _returns_ new pte but doesn't modify the
      argument.
      
      In case if page fault happend on dirty filemapping the newly created pte
      may loose softdirty bit thus if a userspace program is tracking memory
      changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might
      miss modification of a memory page (which in worts case may lead to data
      inconsistency).
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9aed8614
    • Yasuaki Ishimatsu's avatar
      drivers/firmware/memmap.c: don't allocate firmware_map_entry of same memory range · f0093ede
      Yasuaki Ishimatsu authored
      When limiting memory by mem= and ACPI DSDT table has PNP0C80,
      firmware_map_entrys of same memory range are allocated and memmap X
      sysfses which have same memory range are created as follows:
      
        # cat /sys/firmware/memmap/0/*
        0x407ffffffff
        0x40000000000
        System RAM
        # cat /sys/firmware/memmap/33/*
        0x407ffffffff
        0x40000000000
        System RAM
        # cat /sys/firmware/memmap/35/*
        0x407ffffffff
        0x40000000000
        System RAM
      
      In this case, when hot-removing memory, kernel panic occurs, showing
      following call trace:
      
        BUG: unable to handle kernel paging request at 00000001003e000b
        IP: sysfs_open_file+0x46/0x2b0
        PGD 203a89fe067 PUD 0
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
          do_dentry_open+0x1ef/0x2a0
          finish_open+0x31/0x40
          do_last+0x57c/0x1220
          path_openat+0xc2/0x4c0
          do_filp_open+0x4b/0xb0
          do_sys_open+0xf3/0x1f0
          SyS_open+0x1e/0x20
          system_call_fastpath+0x16/0x1b
      
      The problem occurs as follows:
      
      When calling e820_reserve_resources(), firmware_map_entrys of all e820
      memory map are allocated.  And all firmware_map_entrys is added
      map_entries list as follows:
      
      map_entries
       -> +--- entry A --------+ -> ...
          | start 0x407ffffffff|
          | end   0x40000000000|
          | type  System RAM   |
          +--------------------+
      
      After that, if ACPI DSDT table has PNP0C80 and the memory range is
      limited by mem=, the PNP0C80 is hot-added.  Then firmware_map_entry of
      PNP0C80 is allocated and added map_entries list as follows:
      
      map_entries
       -> +--- entry A --------+ -> ... -> +--- entry B --------+
          | start 0x407ffffffff|           | start 0x407ffffffff|
          | end   0x40000000000|           | end   0x40000000000|
          | type  System RAM   |           | type  System RAM   |
          +--------------------+           +--------------------+
      
      Then memmap 0 sysfs for entry B is created.
      
      After that, firmware_memmap_init() creates memmap sysfses of all
      firmware_map_entrys in map_entries list.  As a result, memmap 33 sysfs
      for entry A and memmap 35 sysfs for entry B are created.  But kobject of
      entry B has been used by memmap 0 sysfs.  So when creating memmap 35
      sysfs, the kobject is broken.
      
      If hot-removing memory, memmap 0 sysfs is destroyed and kobject of
      memmap 0 sysfs is freed.  But the kobject can be accessed via memmap 35
      sysfs.  So when open memmap 35 sysfs, kernel panic occurs.
      
      This patch checks whether there is firmware_map_entry of same memory
      range in map_entries list and don't allocate firmware_map_entry of same
      memroy range.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0093ede
    • Yasuaki Ishimatsu's avatar
      drivers/firmware/memmap.c: pass the correct argument to firmware_map_find_entry_bootmem() · 49c8b24d
      Yasuaki Ishimatsu authored
      firmware_map_add_hotplug() calls firmware_map_find_entry_bootmem() to
      get free firmware_map_entry.  But end arguments is not correct.  So
      firmware_map_find_entry_bootmem() cannot not find firmware_map_entry.
      
      The patch passes the correct end argument to firmware_map_find_entry_bootmem().
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49c8b24d
    • WANG Chao's avatar
      mm/vmalloc.c: clean up map_vm_area third argument · f6f8ed47
      WANG Chao authored
      Currently map_vm_area() takes (struct page *** pages) as third argument,
      and after mapping, it moves (*pages) to point to (*pages +
      nr_mappped_pages).
      
      It looks like this kind of increment is useless to its caller these
      days.  The callers don't care about the increments and actually they're
      trying to avoid this by passing another copy to map_vm_area().
      
      The caller can always guarantee all the pages can be mapped into vm_area
      as specified in first argument and the caller only cares about whether
      map_vm_area() fails or not.
      
      This patch cleans up the pointer movement in map_vm_area() and updates
      its callers accordingly.
      Signed-off-by: default avatarWANG Chao <chaowang@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6f8ed47
    • Jerome Marchand's avatar
      mm: make copy_pte_range static again · 21bda264
      Jerome Marchand authored
      Commit 71e3aac0 ("thp: transparent hugepage core") adds
      copy_pte_range prototype to huge_mm.h.  I'm not sure why (or if) this
      function have been used outside of memory.c, but it currently isn't.
      This patch makes copy_pte_range() static again.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21bda264
    • David Rientjes's avatar
      mm, hugetlb: remove hugetlb_zero and hugetlb_infinity · ed4d4902
      David Rientjes authored
      They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
      passing extra2 == NULL is equivalent to infinity.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed4d4902
    • David Rientjes's avatar
      mm, hugetlb: generalize writes to nr_hugepages · 238d3c13
      David Rientjes authored
      Three different interfaces alter the maximum number of hugepages for an
      hstate:
      
       - /proc/sys/vm/nr_hugepages for global number of hugepages of the default
         hstate,
      
       - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages for global number of
         hugepages for a specific hstate, and
      
       - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages/mempolicy for number of
         hugepages for a specific hstate over the set of allowed nodes.
      
      Generalize the code so that a single function handles all of these
      writes instead of duplicating the code in two different functions.
      
      This decreases the number of lines of code, but also reduces the size of
      .text by about half a percent since set_max_huge_pages() can be inlined.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      238d3c13
    • Andi Kleen's avatar
      hwpoison: fix race with changing page during offlining · f37d4298
      Andi Kleen authored
      When a hwpoison page is locked it could change state due to parallel
      modifications.  The original compound page can be torn down and then
      this 4k page becomes part of a differently-size compound page is is a
      standalone regular page.
      
      Check after the lock if the page is still the same compound page.
      
      We could go back, grab the new head page and try again but it should be
      quite rare, so I thought this was safest.  A retry loop would be more
      difficult to test and may have more side effects.
      
      The hwpoison code by design only tries to handle cases that are
      reasonably common in workloads, as visible in page-flags.
      
      I'm not really that concerned about handling this (likely rare case),
      just not crashing on it.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f37d4298
    • Davidlohr Bueso's avatar
      mm,hugetlb: simplify error handling in hugetlb_cow() · ad4404a2
      Davidlohr Bueso authored
      When returning from hugetlb_cow(), we always (1) put back the refcount
      for each referenced page -- always 'old', and 'new' if allocation was
      successful.  And (2) retake the page table lock right before returning,
      as the callers expects.  This logic can be simplified and encapsulated,
      as proposed in this patch.  In addition to cleaner code, we also shave a
      few bytes off the instruction text:
      
         text    data     bss     dec     hex filename
        28399     462   41328   70189   1122d mm/hugetlb.o-baseline
        28367     462   41328   70157   1120d mm/hugetlb.o-patched
      
      Passes libhugetlbfs testcases.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad4404a2
    • Davidlohr Bueso's avatar
      mm,hugetlb: make unmap_ref_private() return void · 2f4612af
      Davidlohr Bueso authored
      This function always returns 1, thus no need to check return value in
      hugetlb_cow().  By doing so, we can get rid of the unnecessary WARN_ON
      call.  While this logic perhaps existed as a way of identifying future
      unmap_ref_private() mishandling, reality is it serves no apparent
      purpose.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f4612af
    • Hugh Dickins's avatar
      mm: replace init_page_accessed by __SetPageReferenced · eb39d618
      Hugh Dickins authored
      Do we really need an exported alias for __SetPageReferenced()? Its
      callers better know what they're doing, in which case the page would not
      be already marked referenced.  Kill init_page_accessed(), just
      __SetPageReferenced() inline.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb39d618