1. 06 Jul, 2017 40 commits
    • Vlastimil Babka's avatar
      mm, page_alloc: fix more premature OOM due to race with cpuset update · 902b6281
      Vlastimil Babka authored
      I would like to stress that this patchset aims to fix issues and cleanup
      the code *within the existing documented semantics*, i.e.  patch 1
      ignores mempolicy restrictions if the set of allowed nodes has no
      intersection with set of nodes allowed by cpuset.  I believe discussing
      potential changes of the semantics can be better done once we have a
      baseline with no known bugs of the current semantics.
      
      I've recently summarized the cpuset/mempolicy issues in a LSF/MM
      proposal [1] and the discussion itself [2].  I've been trying to rewrite
      the handling as proposed, with the idea that changing semantics to make
      all mempolicies static wrt cpuset updates (and discarding the relative
      and default modes) can be tried on top, as there's a high risk of being
      rejected/reverted because somebody might still care about the removed
      modes.
      
      However I haven't yet figured out how to properly:
      
      1) make mempolicies swappable instead of rebinding in place. I thought
         mbind() already works that way and uses refcounting to avoid
         use-after-free of the old policy by a parallel allocation, but turns
         out true refcounting is only done for shared (shmem) mempolicies, and
         the actual protection for mbind() comes from mmap_sem. Extending the
         refcounting means more overhead in allocator hot path. Also swapping
         whole mempolicies means that we have to allocate the new ones, which
         can fail, and reverting of the partially done work also means
         allocating (note that mbind() doesn't care and will just leave part
         of the range updated and part not updated when returning -ENOMEM...).
      
      2) make cpuset's task->mems_allowed also swappable (after converting it
         from nodemask to zonelist, which is the easy part) for mostly the
         same reasons.
      
      The good news is that while trying to do the above, I've at least
      figured out how to hopefully close the remaining premature OOM's, and do
      a buch of cleanups on top, removing quite some of the code that was also
      supposed to prevent the cpuset update races, but doesn't work anymore
      nowadays.  This should fix the most pressing concerns with this topic
      and give us a better baseline before either proceeding with the original
      proposal, or pushing a change of semantics that removes the problem 1)
      above.  I'd be then fine with trying to change the semantic first and
      rewrite later.
      
      Patchset has been tested with the LTP cpuset01 stress test.
      
      [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
      [2] https://lwn.net/Articles/717797/
      [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
      
      This patch (of 6):
      
      Commit e47483bc ("mm, page_alloc: fix premature OOM when racing with
      cpuset mems update") has fixed known recent regressions found by LTP's
      cpuset01 testcase.  I have however found that by modifying the testcase
      to use per-vma mempolicies via bind(2) instead of per-task mempolicies
      via set_mempolicy(2), the premature OOM still happens and the issue is
      much older.
      
      The root of the problem is that the cpuset's mems_allowed and
      mempolicy's nodemask can temporarily have no intersection, thus
      get_page_from_freelist() cannot find any usable zone.  The current
      semantic for empty intersection is to ignore mempolicy's nodemask and
      honour cpuset restrictions.  This is checked in node_zonelist(), but the
      racy update can happen after we already passed the check.  Such races
      should be protected by the seqlock task->mems_allowed_seq, but it
      doesn't work here, because 1) mpol_rebind_mm() does not happen under
      seqlock for write, and doing so would lead to deadlock, as it takes
      mmap_sem for write, while the allocation can have mmap_sem for read when
      it's taking the seqlock for read.  And 2) the seqlock cookie of callers
      of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is
      different than the one of __alloc_pages_slowpath(), so there's still a
      potential race window.
      
      This patch fixes the issue by having __alloc_pages_slowpath() check for
      empty intersection of cpuset and ac->nodemask before OOM or allocation
      failure.  If it's indeed empty, the nodemask is ignored and allocation
      retried, which mimics node_zonelist().  This works fine, because almost
      all callers of __alloc_pages_nodemask are obtaining the nodemask via
      node_zonelist().  The only exception is new_node_page() from hotplug,
      where the potential violation of nodemask isn't an issue, as there's
      already a fallback allocation attempt without any nodemask.  If there's
      a future caller that needs to have its specific nodemask honoured over
      task's cpuset restrictions, we'll have to e.g.  add a gfp flag for that.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      902b6281
    • Punit Agrawal's avatar
      mm: rmap: use correct helper when poisoning hugepages · 5fd27b8e
      Punit Agrawal authored
      Using set_pte_at() does not do the right thing when putting down
      HWPOISON swap entries for hugepages on architectures that support
      contiguous ptes.
      
      Fix this problem by using set_huge_swap_pte_at() which was introduced to
      fix exactly this problem.
      
      Link: http://lkml.kernel.org/r/20170522133604.11392-7-punit.agrawal@arm.comSigned-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fd27b8e
    • Punit Agrawal's avatar
      mm/hugetlb: introduce set_huge_swap_pte_at() helper · e5251fd4
      Punit Agrawal authored
      set_huge_pte_at(), an architecture callback to populate hugepage ptes,
      does not provide the range of virtual memory that is targeted.  This
      leads to ambiguity when dealing with swap entries on architectures that
      support hugepages consisting of contiguous ptes.
      
      Fix the problem by introducing an overridable helper that is called when
      populating the page tables with swap entries.  The size of the targeted
      region is provided to the helper to help determine the number of entries
      to be updated.
      
      Provide a default implementation that maintains the current behaviour.
      
      [punit.agrawal@arm.com: v4]
        Link: http://lkml.kernel.org/r/20170524115409.31309-8-punit.agrawal@arm.com
      [punit.agrawal@arm.com: add an empty definition for set_huge_swap_pte_at()]
        Link: http://lkml.kernel.org/r/20170525171331.31469-1-punit.agrawal@arm.com
      Link: http://lkml.kernel.org/r/20170522133604.11392-6-punit.agrawal@arm.comSigned-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5251fd4
    • Punit Agrawal's avatar
      mm/hugetlb: allow architectures to override huge_pte_clear() · 9386fac3
      Punit Agrawal authored
      When unmapping a hugepage range, huge_pte_clear() is used to clear the
      page table entries that are marked as not present.  huge_pte_clear()
      internally just ends up calling pte_clear() which does not correctly
      deal with hugepages consisting of contiguous page table entries.
      
      Add a size argument to address this issue and allow architectures to
      override huge_pte_clear() by wrapping it in a #ifndef block.
      
      Update s390 implementation with the size parameter as well.
      
      Note that the change only affects huge_pte_clear() - the other generic
      hugetlb functions don't need any change.
      
      Link: http://lkml.kernel.org/r/20170522162555.4313-1-punit.agrawal@arm.comSigned-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	[s390 bits]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9386fac3
    • Punit Agrawal's avatar
      mm/hugetlb: add size parameter to huge_pte_offset() · 7868a208
      Punit Agrawal authored
      A poisoned or migrated hugepage is stored as a swap entry in the page
      tables.  On architectures that support hugepages consisting of
      contiguous page table entries (such as on arm64) this leads to ambiguity
      in determining the page table entry to return in huge_pte_offset() when
      a poisoned entry is encountered.
      
      Let's remove the ambiguity by adding a size parameter to convey
      additional information about the requested address.  Also fixup the
      definition/usage of huge_pte_offset() throughout the tree.
      
      Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.comSigned-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
      Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7868a208
    • Punit Agrawal's avatar
      mm, gup: ensure real head page is ref-counted when using hugepages · d63206ee
      Punit Agrawal authored
      When speculatively taking references to a hugepage using
      page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
      page returned by pmd_page() is the head page.  Although normally true,
      this assumption doesn't hold when the hugepage comprises of successive
      page table entries such as when using contiguous bit on arm64 at PTE or
      PMD levels.
      
      This can be addressed by ensuring that the page passed to
      page_cache_add_speculative() is the real head or by de-referencing the
      head page within the function.
      
      We take the first approach to keep the usage pattern aligned with
      page_cache_get_speculative() where users already pass the appropriate
      page, i.e., the de-referenced head.
      
      Apply the same logic to fix gup_huge_[pud|pgd]() as well.
      
      [punit.agrawal@arm.com: fix arm64 ltp failure]
        Link: http://lkml.kernel.org/r/20170619170145.25577-5-punit.agrawal@arm.com
      Link: http://lkml.kernel.org/r/20170522133604.11392-3-punit.agrawal@arm.comSigned-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d63206ee
    • Will Deacon's avatar
      mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepages · a3e32855
      Will Deacon authored
      When operating on hugepages with DEBUG_VM enabled, the GUP code checks
      the compound head for each tail page prior to calling
      page_cache_add_speculative.  This is broken, because on the fast-GUP
      path (where we don't hold any page table locks) we can be racing with a
      concurrent invocation of split_huge_page_to_list.
      
      split_huge_page_to_list deals with this race by using page_ref_freeze to
      freeze the page and force concurrent GUPs to fail whilst the component
      pages are modified.  This modification includes clearing the
      compound_head field for the tail pages, so checking this prior to a
      successful call to page_cache_add_speculative can lead to false
      positives: In fact, page_cache_add_speculative *already* has this check
      once the page refcount has been successfully updated, so we can simply
      remove the broken calls to VM_BUG_ON_PAGE.
      
      Link: http://lkml.kernel.org/r/20170522133604.11392-2-punit.agrawal@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3e32855
    • Steve Capper's avatar
      arm64: hugetlb: remove spurious calls to huge_ptep_offset() · f0b38d65
      Steve Capper authored
      We don't need to call huge_ptep_offset as our accessors are already
      supplied with the pte_t *.  This patch removes those spurious calls.
      
      [punit.agrawal@arm.com: resolve rebase conflicts due to patch re-ordering]
      Link: http://lkml.kernel.org/r/20170524115409.31309-3-punit.agrawal@arm.comSigned-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: David Woods <dwoods@mellanox.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0b38d65
    • Steve Capper's avatar
      arm64: hugetlb: refactor find_num_contig() · bb9dd3df
      Steve Capper authored
      Patch series "Support for contiguous pte hugepages", v4.
      
      This patchset updates the hugetlb code to fix issues arising from
      contiguous pte hugepages (such as on arm64).  Compared to v3, This
      version addresses a build failure on arm64 by including two cleanup
      patches.  Other than the arm64 cleanups, the rest are generic code
      changes.  The remaining arm64 support based on these patches will be
      posted separately.  The patches are based on v4.12-rc2.  Previous
      related postings can be found at [0], [1], [2], and [3].
      
      The patches fall into three categories -
      
      * Patch 1-2 - arm64 cleanups required to greatly simplify changing
        huge_pte_offset() prototype in Patch 5.
      
        Catalin, Will - are you happy for these patches to go via mm?
      
      * Patches 3-4 address issues with gup
      
      * Patches 5-8 relate to passing a size argument to hugepage helpers to
        disambiguate the size of the referred page. These changes are
        required to enable arch code to properly handle swap entries for
        contiguous pte hugepages.
      
        The changes to huge_pte_offset() (patch 5) touch multiple
        architectures but I've managed to minimise these changes for the
        other affected functions - huge_pte_clear() and set_huge_pte_at().
      
      These patches gate the enabling of contiguous hugepages support on arm64
      which has been requested for systems using !4k page granule.
      
      The ARM64 architecture supports two flavours of hugepages -
      
      * Block mappings at the pud/pmd level
      
        These are regular hugepages where a pmd or a pud page table entry
        points to a block of memory. Depending on the PAGE_SIZE in use the
        following size of block mappings are supported -
      
                PMD	PUD
                ---	---
        4K:      2M	 1G
        16K:    32M
        64K:   512M
      
        For certain applications/usecases such as HPC and large enterprise
        workloads, folks are using 64k page size but the minimum hugepage size
        of 512MB isn't very practical.
      
      To overcome this ...
      
      * Using the Contiguous bit
      
        The architecture provides a contiguous bit in the translation table
        entry which acts as a hint to the mmu to indicate that it is one of a
        contiguous set of entries that can be cached in a single TLB entry.
      
        We use the contiguous bit in Linux to increase the mapping size at the
        pmd and pte (last) level.
      
        The number of supported contiguous entries varies by page size and
        level of the page table.
      
        Using the contiguous bit allows additional hugepage sizes -
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
          4K:         64K     2M         32M     1G
          16K:         2M    32M          1G
          64K:         2M   512M         16G
      
        Of these, 64K with 4K and 2M with 64K pages have been explicitly
        requested by a few different users.
      
      Entries with the contiguous bit set are required to be modified all
      together - which makes things like memory poisoning and migration
      impossible to do correctly without knowing the size of hugepage being
      dealt with - the reason for adding size parameter to a few of the
      hugepage helpers in this series.
      
      This patch (of 8):
      
      As we regularly check for contiguous pte's in the huge accessors, remove
      this extra check from find_num_contig.
      
      [punit.agrawal@arm.com: resolve rebase conflicts due to patch re-ordering]
      Link: http://lkml.kernel.org/r/20170524115409.31309-2-punit.agrawal@arm.comSigned-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: David Woods <dwoods@mellanox.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb9dd3df
    • Naoya Horiguchi's avatar
      mm: drop NULL return check of pte_offset_map_lock() · 8bc3c3fe
      Naoya Horiguchi authored
      pte_offset_map_lock() finds and takes ptl, and returns pte.  But some
      callers return without unlocking the ptl when pte == NULL, which seems
      weird.
      
      Git history said that !pte check in change_pte_range() was introduced in
      commit 1ad9f620 ("mm: numa: recheck for transhuge pages under lock
      during protection changes") and still remains after commit 175ad4f1
      ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock")
      which partially reverts 1ad9f620.  So I think that it's just dead
      code.
      
      Many other caller of pte_offset_map_lock() never check NULL return, so
      let's do likewise.
      
      Link: http://lkml.kernel.org/r/1495089737-1292-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bc3c3fe
    • Matthias Kaehlcke's avatar
      mm/page_alloc.c: mark bad_range() and meminit_pfn_in_nid() as __maybe_unused · d73d3c9f
      Matthias Kaehlcke authored
      The functions are not used in some configurations.  Adding the attribute
      fixes the following warnings when building with clang:
      
        mm/page_alloc.c:409:19: error: function 'bad_range' is not needed and
            will not be emitted [-Werror,-Wunneeded-internal-declaration]
      
        mm/page_alloc.c:1106:30: error: unused function 'meminit_pfn_in_nid'
            [-Werror,-Wunused-function]
      
      Link: http://lkml.kernel.org/r/20170518182030.165633-1-mka@chromium.orgSigned-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d73d3c9f
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hugetlb: add support for 1G huge pages · 40692eb5
      Aneesh Kumar K.V authored
      POWER9 supports hugepages of size 2M and 1G in radix MMU mode.  This
      patch enables the usage of 1G page size for hugetlbfs.  This also update
      the helper such we can do 1G page allocation at runtime.
      
      We still don't enable 1G page size on DD1 version.  This is to avoid
      doing workaround mentioned in commit 6d3a0379 ("powerpc/mm: Add
      radix__tlb_flush_pte_p9_dd1()").
      
      Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40692eb5
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: clean up ARCH_HAS_GIGANTIC_PAGE · e1073d1e
      Aneesh Kumar K.V authored
      This moves the #ifdef in C code to a Kconfig dependency.  Also we move
      the gigantic_page_supported() function to be arch specific.
      
      This allows architectures to conditionally enable runtime allocation of
      gigantic huge page.  Architectures like ppc64 supports different
      gigantic huge page size (16G and 1G) based on the translation mode
      selected.  This provides an opportunity for ppc64 to enable runtime
      allocation only w.r.t 1G hugepage.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1073d1e
    • Pavel Tatashin's avatar
      mm: adaptive hash table scaling · 9017217b
      Pavel Tatashin authored
      Allow hash tables to scale with memory but at slower pace, when
      HASH_ADAPT is provided every time memory quadruples the sizes of hash
      tables will only double instead of quadrupling as well.  This algorithm
      starts working only when memory size reaches a certain point, currently
      set to 64G.
      
      This is example of dentry hash table size, before and after four various
      memory configurations:
      
      MEMORY	   SCALE	 HASH_SIZE
      	old	new	old	new
          8G	 13	 13      8M      8M
         16G	 13	 13     16M     16M
         32G	 13	 13     32M     32M
         64G	 13	 13     64M     64M
        128G	 13	 14    128M     64M
        256G	 13	 14    256M    128M
        512G	 13	 15    512M    128M
       1024G	 13	 15   1024M    256M
       2048G	 13	 16   2048M    256M
       4096G	 13	 16   4096M    512M
       8192G	 13	 17   8192M    512M
      16384G	 13	 17  16384M   1024M
      32768G	 13	 18  32768M   1024M
      65536G	 13	 18  65536M   2048M
      
      The effect of this change on runtime is undetectable as filesystem
      growth is not proportional to machine memory size as is currently
      assumed.  The change effects only large memory machine.  Additional
      tuning might be needed, but that can be done by the clients of the
      kmem_cache_create interface, not the generic cache allocator itself.
      
      The adaptive hashing is disabled on 32 bit systems to avoid confusion of
      whether base should be different for smaller systems, and to avoid
      overflows.
      
      [mhocko@suse.com: drop HASH_ADAPT]
        Link: http://lkml.kernel.org/r/20170509094607.GG6481@dhcp22.suse.cz
      [pasha.tatashin@oracle.com: UL -> ULL fix]
        Link: http://lkml.kernel.org/r/1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com
      [pasha.tatashin@oracle.com: disable adaptive hash on 32 bit systems]
        Link: http://lkml.kernel.org/r/1495469329-755807-2-git-send-email-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/1488432825-92126-5-git-send-email-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Babu Moger <babu.moger@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9017217b
    • Pavel Tatashin's avatar
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin authored
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
    • Pavel Tatashin's avatar
      mm: zero hash tables in allocator · 3749a8f0
      Pavel Tatashin authored
      Add a new flag HASH_ZERO which when provided grantees that the hash
      table that is returned by alloc_large_system_hash() is zeroed.  In most
      cases that is what is needed by the caller.  Use page level allocator's
      __GFP_ZERO flags to zero the memory.  It is using memset() which is
      efficient method to zero memory and is optimized for most platforms.
      
      Link: http://lkml.kernel.org/r/1488432825-92126-3-git-send-email-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3749a8f0
    • Aneesh Kumar K.V's avatar
      powerpc/hugetlb: enable hugetlb migration for ppc64 · f7fb506f
      Aneesh Kumar K.V authored
      Link: http://lkml.kernel.org/r/1494926612-23928-10-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7fb506f
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hugetlb: remove follow_huge_addr for powerpc · 28c05716
      Aneesh Kumar K.V authored
      With generic code now handling hugetlb entries at pgd level and also
      supporting hugepage directory format, we can now remove the powerpc
      sepcific follow_huge_addr implementation.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-9-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28c05716
    • Aneesh Kumar K.V's avatar
      powerpc/hugetlb: add follow_huge_pd implementation for ppc64 · 50791e6d
      Aneesh Kumar K.V authored
      Link: http://lkml.kernel.org/r/1494926612-23928-8-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50791e6d
    • Aneesh Kumar K.V's avatar
      mm/follow_page_mask: add support for hugepage directory entry · 4dc71451
      Aneesh Kumar K.V authored
      Architectures like ppc64 supports hugepage size that is not mapped to
      any of of the page table levels.  Instead they add an alternate page
      table entry format called hugepage directory (hugepd).  hugepd indicates
      that the page table entry maps to a set of hugetlb pages.  Add support
      for this in generic follow_page_mask code.  We already support this
      format in the generic gup code.
      
      The default implementation prints warning and returns NULL.  We will add
      ppc64 support in later patches
      
      Link: http://lkml.kernel.org/r/1494926612-23928-7-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dc71451
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: move default definition of hugepd_t earlier in the header · e2299292
      Aneesh Kumar K.V authored
      This enable to use the hugepd_t type early.  No functional change in
      this patch.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-6-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2299292
    • Anshuman Khandual's avatar
      mm/follow_page_mask: add support for hugetlb pgd entries · faaa5b62
      Anshuman Khandual authored
      ppc64 supports pgd hugetlb entries.  Add code to handle hugetlb pgd
      entries to follow_page_mask so that ppc64 can switch to it to handle
      hugetlbe entries.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-5-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      faaa5b62
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: export hugetlb_entry_migration helper · d5ed7444
      Aneesh Kumar K.V authored
      We will be using this later from the ppc64 code.  Change the return type
      to bool.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-4-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5ed7444
    • Aneesh Kumar K.V's avatar
      mm/follow_page_mask: split follow_page_mask to smaller functions. · 080dbb61
      Aneesh Kumar K.V authored
      Makes code reading easy.  No functional changes in this patch.  In a
      followup patch, we will be updating the follow_page_mask to handle
      hugetlb hugepd format so that archs like ppc64 can switch to the generic
      version.  This split helps in doing that nicely.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      080dbb61
    • Aneesh Kumar K.V's avatar
      mm/hugetlb/migration: use set_huge_pte_at instead of set_pte_at · 383321ab
      Aneesh Kumar K.V authored
      Patch series "HugeTLB migration support for PPC64", v2.
      
      This patch (of 9):
      
      The right interface to use to set a hugetlb pte entry is set_huge_pte_at.
      Use that instead of set_pte_at.
      
      Link: http://lkml.kernel.org/r/1494926612-23928-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Kravetz <kravetz@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      383321ab
    • Anshuman Khandual's avatar
      mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level · 94310cbc
      Anshuman Khandual authored
      Though migrating gigantic HugeTLB pages does not sound much like real
      world use case, they can be affected by memory errors.  Hence migration
      at the PGD level HugeTLB pages should be supported just to enable soft
      and hard offline use cases.
      
      While allocating the new gigantic HugeTLB page, it should not matter
      whether new page comes from the same node or not.  There would be very
      few gigantic pages on the system afterall, we should not be bothered
      about node locality when trying to save a big page from crashing.
      
      This change renames dequeu_huge_page_node() function as dequeue_huge
      _page_node_exact() preserving it's original functionality.  Now the new
      dequeue_huge_page_node() function scans through all available online nodes
      to allocate a huge page for the NUMA_NO_NODE case and just falls back
      calling dequeu_huge_page_node_exact() for all other cases.
      
      [arnd@arndb.de: make hstate_is_gigantic() inline]
        Link: http://lkml.kernel.org/r/20170522124748.3911296-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170516100509.20122-1-khandual@linux.vnet.ibm.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94310cbc
    • Mike Rapoport's avatar
      fs/userfaultfd.c: drop dead code · f93ae364
      Mike Rapoport authored
      Calculation of start end end in __wake_userfault function are not used
      and can be removed.
      
      Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f93ae364
    • Mike Rapoport's avatar
      kernel/exit.c: don't include unused userfaultfd_k.h · 57ecbd38
      Mike Rapoport authored
      Commit dd0db88d ("userfaultfd: non-cooperative: rollback
      userfaultfd_exit") removed userfaultfd callback from exit() which makes
      the include of <linux/userfaultfd_k.h> unnecessary.
      
      Link: http://lkml.kernel.org/r/1494930907-3060-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57ecbd38
    • Michal Hocko's avatar
      mm, memory_hotplug: remove unused cruft after memory hotplug rework · 559bfc7d
      Michal Hocko authored
      zone_for_memory doesn't have any user anymore as well as the whole zone
      shifting infrastructure so drop them all.
      
      This shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      559bfc7d
    • Michal Hocko's avatar
      mm, memory_hotplug: fix the section mismatch warning · cdf72f25
      Michal Hocko authored
      Tobias has reported following section mismatches introduced by "mm,
      memory_hotplug: do not associate hotadded memory to zones until online".
      
        WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
      Both memmap_init_zone and init_currently_empty_zone are marked __meminit
      but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
      devm_memremap_pages) so we have to hide it from the checker by __ref
      annotation.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdf72f25
    • Michal Hocko's avatar
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko authored
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • Michal Hocko's avatar
      mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone · c246a213
      Michal Hocko authored
      Heiko Carstens has noticed that he can generate overlapping zones for
      ZONE_DMA and ZONE_NORMAL:
      
        DMA      [mem 0x0000000000000000-0x000000007fffffff]
        Normal   [mem 0x0000000080000000-0x000000017fffffff]
      
        $ cat /sys/devices/system/memory/block_size_bytes
        10000000
        $ cat /sys/devices/system/memory/memory5/valid_zones
        DMA
        $ echo 0 > /sys/devices/system/memory/memory5/online
        $ cat /sys/devices/system/memory/memory5/valid_zones
        Normal
        $ echo 1 > /sys/devices/system/memory/memory5/online
        Normal
      
        $ cat /proc/zoneinfo
        Node 0, zone      DMA
        spanned  524288        <-----
        present  458752
        managed  455078
        start_pfn:           0 <-----
      
        Node 0, zone   Normal
        spanned  720896
        present  589824
        managed  571648
        start_pfn:           327680 <-----
      
      The reason is that we assume that the default zone for kernel onlining
      is ZONE_NORMAL.  This was a simplification introduced by the memory
      hotplug rework and it is easily fixable by checking the range overlap in
      the zone order and considering the first matching zone as the default
      one.  If there is no such zone then assume ZONE_NORMAL as we have been
      doing so far.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c246a213
    • Michal Hocko's avatar
      mm, memory_hotplug: fix MMOP_ONLINE_KEEP behavior · a69578a1
      Michal Hocko authored
      Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently
      
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
      
        $ echo online_movable > memory34/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory35/valid_zones:Movable
        memory36/valid_zones:Movable
        memory37/valid_zones:Movable
      
        $ echo online > memory36/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
      
      so we have effectively punched a hole into the movable zone.
      
      The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is
      wrong.  It only checks whether the given range is already part of the
      movable zone which is not the case here as only memory34 is in the zone.
      Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if
      that is false then we can be sure that movable onlining is the right
      thing to do.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a69578a1
    • Michal Hocko's avatar
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko authored
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • Michal Hocko's avatar
      mm, vmstat: skip reporting offline pages in pagetypeinfo · d336e94e
      Michal Hocko authored
      pagetypeinfo_showblockcount_print skips over invalid pfns but it would
      report pages which are offline because those have a valid pfn.  Their
      migrate type is misleading at best.
      
      Now that we have pfn_to_online_page() we can use it instead of
      pfn_valid() and fix this.
      
      [mhocko@suse.com: fix build]
        Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d336e94e
    • Michal Hocko's avatar
      mm: __first_valid_page skip over offline pages · 2ce13640
      Michal Hocko authored
      __first_valid_page skips over invalid pfns in the range but it might
      still stumble over offline pages.  At least start_isolate_page_range
      will mark those set_migratetype_isolate.  This doesn't represent any
      immediate AFAICS because alloc_contig_range will fail to isolate those
      pages but it relies on not fully initialized page which will become a
      problem later when we stop associating offline pages to zones.  Use
      pfn_to_online_page to handle this.
      
      This is more a preparatory patch than a fix.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-10-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ce13640
    • Michal Hocko's avatar
      mm, compaction: skip over holes in __reset_isolation_suitable · ccbe1e4d
      Michal Hocko authored
      __reset_isolation_suitable walks the whole zone pfn range and it tries
      to jump over holes by checking the zone for each page.  It might still
      stumble over offline pages, though.  Skip those by checking
      pfn_to_online_page()
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-9-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ccbe1e4d
    • Michal Hocko's avatar
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko authored
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • Michal Hocko's avatar
      mm, memory_hotplug: consider offline memblocks removable · 8b0662f2
      Michal Hocko authored
      is_pageblock_removable_nolock() relies on having zone association to
      examine all the page blocks to check whether they are movable or free.
      This is just wasting of cycles when the memblock is offline.  Later
      patch in the series will also change the time when the page is
      associated with a zone so we let's bail out early if the memblock is
      offline.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b0662f2
    • Michal Hocko's avatar
      mm, memory_hotplug: split up register_one_node() · 9037a993
      Michal Hocko authored
      Memory hotplug (add_memory_resource) has to reinitialize node
      infrastructure if the node is offline (one which went through the
      complete add_memory(); remove_memory() cycle).  That involves node
      registration to the kobj infrastructure (register_node), the proper
      association with cpus (register_cpu_under_node) and finally creation of
      node<->memblock symlinks (link_mem_sections).
      
      The last part requires to know node_start_pfn and node_spanned_pages
      which we currently have but a leter patch will postpone this
      initialization to the onlining phase which happens later.  In fact we do
      not need to rely on the early pgdat initialization even now because the
      currently hot added pfn range is currently known.
      
      Split register_one_node into core which does all the common work for the
      boot time NUMA initialization and the hotplug (__register_one_node).
      register_one_node keeps the full initialization while hotplug calls
      __register_one_node and manually calls link_mem_sections for the proper
      range.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9037a993