1. 28 Dec, 2018 40 commits
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse authored
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
    • Michal Hocko's avatar
      hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined · b15c8726
      Michal Hocko authored
      We have received a bug report that an injected MCE about faulty memory
      prevents memory offline to succeed on 4.4 base kernel.  The underlying
      reason was that the HWPoison page has an elevated reference count and the
      migration keeps failing.  There are two problems with that.  First of all
      it is dubious to migrate the poisoned page because we know that accessing
      that memory is possible to fail.  Secondly it doesn't make any sense to
      migrate a potentially broken content and preserve the memory corruption
      over to a new location.
      
      Oscar has found out that 4.4 and the current upstream kernels behave
      slightly differently with his simply testcase
      
      ===
      
      int main(void)
      {
              int ret;
              int i;
              int fd;
              char *array = malloc(4096);
              char *array_locked = malloc(4096);
      
              fd = open("/tmp/data", O_RDONLY);
              read(fd, array, 4095);
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
              if (ret)
                      perror("mlock");
      
              sleep (20);
      
              ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
              if (ret)
                      perror("madvise");
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              return 0;
      }
      ===
      
      + offline this memory.
      
      In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
      list
      kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
      kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
      kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
      kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
      kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
      kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
      kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
      kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
      kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
      kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
      kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
      kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
      kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
      kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
      kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
      kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
      kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
      kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
      
      And that latter confuses the hotremove path because an LRU page is
      attempted to be migrated and that fails due to an elevated reference
      count.  It is quite possible that the reuse of the HWPoisoned page is some
      kind of fixed race condition but I am not really sure about that.
      
      With the upstream kernel the failure is slightly different.  The page
      doesn't seem to have LRU bit set but isolate_movable_page simply fails and
      do_migrate_range simply puts all the isolated pages back to LRU and
      therefore no progress is made and scan_movable_pages finds same set of
      pages over and over again.
      
      Fix both cases by explicitly checking HWPoisoned pages before we even try
      to get reference on the page, try to unmap it if it is still mapped.  As
      explained by Naoya:
      
      : Hwpoison code never unmapped those for no big reason because
      : Ksm pages never dominate memory, so we simply didn't have strong
      : motivation to save the pages.
      
      Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
      HWPoison pages which shouldn't happen but I couldn't convince myself about
      that.  Naoya has noted the following:
      
      : Theoretically no such gurantee, because try_to_unmap() doesn't have a
      : guarantee of success and then memory_failure() returns immediately
      : when hwpoison_user_mappings fails.
      : Or the following code (comes after hwpoison_user_mappings block) also impli=
      : es
      : that the target page can still have PageLRU flag.
      :
      :         /*
      :          * Torn down by someone else?
      :          */
      :         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
      :                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
      :                 res =3D -EBUSY;
      :                 goto out;
      :         }
      :
      : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
      : current version of your patch.
      
      Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.com>
      Debugged-by: default avatarOscar Salvador <osalvador@suse.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b15c8726
    • Oscar Salvador's avatar
      mm, kmemleak: little optimization while scanning · 9f1eb38e
      Oscar Salvador authored
      kmemleak_scan() goes through all online nodes and tries to scan all used
      pages.
      
      We can do better and use pfn_to_online_page(), so in case we have
      CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.  For
      boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
      will fallback to pfn_valid().
      
      Another little optimization is to check if the page belongs to the node we
      are currently checking, so in case we have nodes interleaved we will not
      check the same pfn multiple times.
      
      I ran some tests:
      
      Add some memory to node1 and node2 making it interleaved:
      
      (qemu) object_add memory-backend-ram,id=ram0,size=1G
      (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      (qemu) object_add memory-backend-ram,id=ram1,size=1G
      (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
      (qemu) object_add memory-backend-ram,id=ram2,size=1G
      (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1
      
      Then, we offline that memory:
       # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
       # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
       # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done
      
      And we run kmemleak_scan:
      
       # echo "scan" > /sys/kernel/debug/kmemleak
      
      before the patch:
      
      kmemleak: time spend: 41596 us
      
      after the patch:
      
      kmemleak: time spend: 34899 us
      
      [akpm@linux-foundation.org: remove stray newline, per Oscar]
      Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f1eb38e
    • Will Deacon's avatar
      lib/ioremap: ensure break-before-make is used for huge p4d mappings · 8e2d4340
      Will Deacon authored
      Whilst no architectures actually enable support for huge p4d mappings in
      the vmap area, the code that is implemented should be using
      break-before-make, as we do for pud and pmd huge entries.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-6-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e2d4340
    • Will Deacon's avatar
      lib/ioremap: ensure phys_addr actually corresponds to a physical address · 36ddc5a7
      Will Deacon authored
      The current ioremap() code uses a phys_addr variable at each level of page
      table, which is confusingly offset by subtracting the base virtual address
      being mapped so that adding the current virtual address back on when
      iterating through the page table entries gives back the corresponding
      physical address.
      
      This is fairly confusing and results in all users of phys_addr having to
      add the current virtual address back on.  Instead, this patch just updates
      phys_addr when iterating over the page table entries, ensuring that it's
      always up-to-date and doesn't require explicit offsetting.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-5-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36ddc5a7
    • Will Deacon's avatar
      x86/pgtable: drop pXd_none() checks from pXd_free_pYd_table() · 48e178ab
      Will Deacon authored
      The core code already has a check for pXd_none(), so remove it from the
      architecture implementation.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-4-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48e178ab
    • Will Deacon's avatar
      arm64: mmu: drop pXd_present() checks from pXd_free_pYd_table() · 9c006972
      Will Deacon authored
      The core code already has a check for pXd_none(), so remove it from the
      architecture implementation.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-3-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c006972
    • Will Deacon's avatar
      ioremap: rework pXd_free_pYd_page() API · d239865a
      Will Deacon authored
      The recently merged API for ensuring break-before-make on page-table
      entries when installing huge mappings in the vmalloc/ioremap region is
      fairly counter-intuitive, resulting in the arch freeing functions (e.g.
      pmd_free_pte_page()) being called even on entries that aren't present.
      This resulted in a minor bug in the arm64 implementation, giving rise to
      spurious VM_WARN messages.
      
      This patch moves the pXd_present() checks out into the core code,
      refactoring the callsites at the same time so that we avoid the complex
      conjunctions when determining whether or not we can put down a huge
      mapping.
      
      Link: http://lkml.kernel.org/r/1544120495-17438-2-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d239865a
    • Kirill Tkhai's avatar
    • Anthony Yznaga's avatar
      /proc/kpagecount: return 0 for special pages that are never mapped · 144552ff
      Anthony Yznaga authored
      Certain pages that are never mapped to userspace have a type indicated in
      the page_type field of their struct pages (e.g.  PG_buddy).  page_type
      overlaps with _mapcount so set the count to 0 and avoid calling
      page_mapcount() for these pages.
      
      [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox]
        Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com
      Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.comSigned-off-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      144552ff
    • Anthony Yznaga's avatar
      tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures · b6fb87b8
      Anthony Yznaga authored
      Because kpagecount_read() fakes success if map counts are not being
      collected, clamp the page count passed to it by walk_pfn() to the pages
      value returned by the preceding call to kpageflags_read().
      
      Link: http://lkml.kernel.org/r/1543962269-26116-1-git-send-email-anthony.yznaga@oracle.com
      Fixes: 7f1d23e6 ("tools/vm/page-types.c: include shared map counts")
      Signed-off-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6fb87b8
    • Oscar Salvador's avatar
      mm/page_alloc.c: drop uneeded __meminit and __meminitdata · bbe5d993
      Oscar Salvador authored
      Since commit 03e85f9d ("mm/page_alloc: Introduce
      free_area_init_core_hotplug"), some functions changed to only be called
      during system initialization.  Concretly, free_area_init_node() and the
      functions that hang from it.
      
      Also, some variables are no longer used after the system has gone
      through initialization.  So this could be considered as a late clean-up
      for that patch.
      
      This patch changes the functions from __meminit to __init, and the
      variables from __meminitdata to __initdata.
      
      In return, we get some KBs back:
      
      Before:
        Freeing unused kernel image memory: 2472K
      
      After:
        Freeing unused kernel image memory: 2480K
      
      Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbe5d993
    • Brian Foster's avatar
      mm/page-writeback.c: don't break integrity writeback on ->writepage() error · 3fa750dc
      Brian Foster authored
      write_cache_pages() is used in both background and integrity writeback
      scenarios by various filesystems.  Background writeback is mostly
      concerned with cleaning a certain number of dirty pages based on various
      mm heuristics.  It may not write the full set of dirty pages or wait for
      I/O to complete.  Integrity writeback is responsible for persisting a set
      of dirty pages before the writeback job completes.  For example, an
      fsync() call must perform integrity writeback to ensure data is on disk
      before the call returns.
      
      write_cache_pages() unconditionally breaks out of its processing loop in
      the event of a ->writepage() error.  This is fine for background
      writeback, which had no strict requirements and will eventually come
      around again.  This can cause problems for integrity writeback on
      filesystems that might need to clean up state associated with failed page
      writeouts.  For example, XFS performs internal delayed allocation
      accounting before returning a ->writepage() error, where applicable.  If
      the current writeback happens to be associated with an unmount and
      write_cache_pages() completes the writeback prematurely due to error, the
      filesystem is unmounted in an inconsistent state if dirty+delalloc pages
      still exist.
      
      To handle this problem, update write_cache_pages() to always process the
      full set of pages for integrity writeback regardless of ->writepage()
      errors.  Save the first encountered error and return it to the caller once
      complete.  This facilitates XFS (or any other fs that expects integrity
      writeback to process the entire set of dirty pages) to clean up its
      internal state completely in the event of persistent mapping errors.
      Background writeback continues to exit on the first error encountered.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.comSigned-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa750dc
    • Wei Yang's avatar
      lib/show_mem.c: drop pgdat_resize_lock in show_mem() · c3a5c77a
      Wei Yang authored
      Function show_mem() is used to print system memory status when user
      requires or fail to allocate memory.  Generally, this is a best effort
      information so any races with memory hotplug (or very theoretically an
      early initialization) should be tolerable and the worst that could happen
      is to print an imprecise node state.
      
      Drop the resize lock because this is the only place which might hold the
      lock from the interrupt context and so all other callers might use a
      simple spinlock.  Even though this doesn't solve any real issue it makes
      the code easier to follow and tiny more effective.
      
      Link: http://lkml.kernel.org/r/20181129235532.9328-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3a5c77a
    • YueHaibing's avatar
      mm/hmm.c: remove set but not used variable 'devmem' · 0ecea993
      YueHaibing authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      mm/hmm.c: In function 'hmm_devmem_ref_kill':
      mm/hmm.c:995:21: warning:
       variable 'devmem' set but not used [-Wunused-but-set-variable]
      
      It not used any more since 35d39f953d4e ("mm, hmm: replace
      hmm_devmem_pages_create() with devm_memremap_pages()")
      
      Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.comSigned-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarReviewed-by: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ecea993
    • Wei Yang's avatar
      mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection · fa004ab7
      Wei Yang authored
      During online_pages phase, pgdat->nr_zones will be updated in case this
      zone is empty.
      
      Currently the online_pages phase is protected by the global locks
      (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
      no contention during the update of nr_zones.
      
      These global locks introduces scalability issues (especially the second
      one), which slow down code relying on get_online_mems().  This is also a
      preparation for not having to rely on get_online_mems() but instead some
      more fine grained locks.
      
      The patch moves init_currently_empty_zone under both zone_span_writelock
      and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
      and the zone's start_pfn.  Also this patch changes the documentation of
      node_size_lock to include the protection of nr_zones.
      
      Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa004ab7
    • Wei Yang's avatar
      mm, sparse: pass nid instead of pgdat to sparse_add_one_section() · 4e0d2e7e
      Wei Yang authored
      Since the information needed in sparse_add_one_section() is node id to
      allocate proper memory, it is not necessary to pass its pgdat.
      
      This patch changes the prototype of sparse_add_one_section() to pass node
      id directly.  This is intended to reduce misleading that
      sparse_add_one_section() would touch pgdat.
      
      Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e0d2e7e
    • Wei Yang's avatar
      mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() · 83af6588
      Wei Yang authored
      pgdat_resize_lock is used to protect pgdat's memory region information
      like: node_start_pfn, node_present_pages, etc.  While in function
      sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
      initialization/release of one mem_section.  This looks not proper.
      
      These code paths are currently protected by mem_hotplug_lock currently but
      should there ever be any reason for locking at the sparse layer a
      dedicated lock should be introduced.
      
      Following is the current call trace of sparse_add/remove_one_section()
      
          mem_hotplug_begin()
          arch_add_memory()
             add_pages()
                 __add_pages()
                     __add_section()
                         sparse_add_one_section()
          mem_hotplug_done()
      
          mem_hotplug_begin()
          arch_remove_memory()
              __remove_pages()
                  __remove_section()
                      sparse_remove_one_section()
          mem_hotplug_done()
      
      The comment above the pgdat_resize_lock also mentions "Holding this will
      also guarantee that any pfn_valid() stays that way.", which is true with
      the current implementation and false after this patch.  But current
      implementation doesn't meet this comment.  There isn't any pfn walkers to
      take the lock so this looks like a relict from the past.  This patch also
      removes this comment.
      
      [richard.weiyang@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
      [mhocko@suse.com: changelog suggestion]
      Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83af6588
    • Yu Zhao's avatar
      mm: remove pte_lock_deinit() · 9e247bab
      Yu Zhao authored
      Pagetable page doesn't touch page->mapping or have any used field that
      overlaps with it.  No need to clear mapping in dtor.  In fact, doing so
      might mask problems that otherwise would be detected by bad_page().
      
      Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e247bab
    • Minchan Kim's avatar
      zram: writeback throttle · bb416d18
      Minchan Kim authored
      If there are lots of write IO with flash device, it could have a
      wearout problem of storage. To overcome the problem, admin needs
      to design write limitation to guarantee flash health
      for entire product life.
      
      This patch creates a new knob "writeback_limit" for zram.
      
      writeback_limit's default value is 0 so that it doesn't limit
      any writeback. If admin want to measure writeback count in a
      certain period, he could know it via /sys/block/zram0/bd_stat's
      3rd column.
      
      If admin want to limit writeback as per-day 400M, he could do it
      like below.
      
      	MB_SHIFT=20
      	4K_SHIFT=12
      	echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
      		/sys/block/zram0/writeback_limit.
      
      If admin want to allow further write again, he could do it like below
      
      	echo 0 > /sys/block/zram0/writeback_limit
      
      If admin want to see remaining writeback budget,
      
      	cat /sys/block/zram0/writeback_limit
      
      The writeback_limit count will reset whenever you reset zram (e.g., system
      reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
      happened until you reset the zram to allocate extra writeback budget in
      next setting is user's job.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb416d18
    • Minchan Kim's avatar
      zram: add bd_stat statistics · 23eddf39
      Minchan Kim authored
      bd_stat represents things that happened in the backing device.  Currently
      it supports bd_counts, bd_reads and bd_writes which are helpful to
      understand wearout of flash and memory saving.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23eddf39
    • Minchan Kim's avatar
      zram: support idle/huge page writeback · a939888e
      Minchan Kim authored
      Add a new feature "zram idle/huge page writeback".  In the zram-swap use
      case, zram usually has many idle/huge swap pages.  It's pointless to keep
      them in memory (ie, zram).
      
      To solve this problem, this feature introduces idle/huge page writeback to
      the backing device so the goal is to save more memory space on embedded
      systems.
      
      Normal sequence to use idle/huge page writeback feature is as follows,
      
      while (1) {
              # mark allocated zram slot to idle
              echo all > /sys/block/zram0/idle
              # leave system working for several hours
              # Unless there is no access for some blocks on zram,
      	# they are still IDLE marked pages.
      
              echo "idle" > /sys/block/zram0/writeback
      	or/and
      	echo "huge" > /sys/block/zram0/writeback
              # write the IDLE or/and huge marked slot into backing device
      	# and free the memory.
      }
      
      Per the discussion at
      https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
      
      This patch removes direct incommpressibe page writeback feature
      (d2afd25114f4 ("zram: write incompressible pages to backing device")).
      
      Below concerns from Sergey:
      == &< ==
      
      "IDLE writeback" is superior to "incompressible writeback".
      
      "incompressible writeback" is completely unpredictable and uncontrollable;
      it depens on data patterns and compression algorithms.  While "IDLE
      writeback" is predictable.
      
      I even suspect, that, *ideally*, we can remove "incompressible writeback".
      "IDLE pages" is a super set which also includes "incompressible" pages.
      So, technically, we still can do "incompressible writeback" from "IDLE
      writeback" path; but a much more reasonable one, based on a page idling
      period.
      
      I understand that you want to keep "direct incompressible writeback"
      around.  ZRAM is especially popular on devices which do suffer from flash
      wearout, so I can see "incompressible writeback" path becoming a dead
      code, long term.
      
      == &< ==
      
      Below concerns from Minchan:
      == &< ==
      
      My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
      both hugepage/idlepage writeck will turn on.  However someuser want to
      enable only idlepage writeback so we need to introduce turn on/off knob
      for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.  I
      don't want to make it complicated *if possible*.
      
      Long term, I imagine we need to make VM aware of new swap hierarchy a
      little bit different with as-is.  For example, first high priority swap
      can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
      swap device.  With that, hugepage writeback will work tranparently.
      
      So we could regard it as regression because incompressible pages doesn't
      go to backing storage automatically.  Instead, user should do it via "echo
      huge" > /sys/block/zram/writeback" manually.
      
      == &< ==
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a939888e
    • Minchan Kim's avatar
      zram: introduce ZRAM_IDLE flag · e82592c4
      Minchan Kim authored
      To support idle page writeback with upcoming patches, this patch
      introduces a new ZRAM_IDLE flag.
      
      Userspace can mark zram slots as "idle" via
      	"echo all > /sys/block/zramX/idle"
      which marks every allocated zram slot as ZRAM_IDLE.
      User could see it by /sys/kernel/debug/zram/zram0/block_state.
      
                300    75.033841 ...i
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Once there is IO for the slot, the mark will be disappeared.
      
      	  300    75.033841 ...
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Therefore, 300th block is idle zpage. With this feature,
      user can how many zram has idle pages which are waste of memory.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e82592c4
    • Minchan Kim's avatar
      zram: refactor flags and writeback stuff · 7e529283
      Minchan Kim authored
      Rename some variables and restructure some code for better readability in
      writeback and zs_free_page.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e529283
    • Minchan Kim's avatar
      zram: fix double free backing device · 5547932d
      Minchan Kim authored
      If blkdev_get fails, we shouldn't do blkdev_put.  Otherwise, kernel emits
      below log.  This patch fixes it.
      
        WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
        Modules linked in:
        CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        RIP: 0010:blkdev_put+0x105/0x120
        Call Trace:
          __x64_sys_swapoff+0x46d/0x490
          do_syscall_64+0x5a/0x190
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        irq event stamp: 4466
        hardirqs last  enabled at (4465):  __free_pages_ok+0x1e3/0x490
        hardirqs last disabled at (4466):  trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (3420):  __do_softirq+0x333/0x446
        softirqs last disabled at (3407):  irq_exit+0xd1/0xe0
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5547932d
    • Minchan Kim's avatar
      zram: fix lockdep warning of free block handling · 3c9959e0
      Minchan Kim authored
      Patch series "zram idle page writeback", v3.
      
      Inherently, swap device has many idle pages which are rare touched since
      it was allocated.  It is never problem if we use storage device as swap.
      However, it's just waste for zram-swap.
      
      This patchset supports zram idle page writeback feature.
      
      * Admin can define what is idle page "no access since X time ago"
      * Admin can define when zram should writeback them
      * Admin can define when zram should stop writeback to prevent wearout
      
      Details are in each patch's description.
      
      This patch (of 7):
      
        ================================
        WARNING: inconsistent lock state
        4.19.0+ #390 Not tainted
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
        00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
        {SOFTIRQ-ON-W} state was registered at:
          _raw_spin_lock+0x2c/0x40
          zram_make_request+0x755/0xdc9
          generic_make_request+0x373/0x6a0
          submit_bio+0x6c/0x140
          __swap_writepage+0x3a8/0x480
          shrink_page_list+0x1102/0x1a60
          shrink_inactive_list+0x21b/0x3f0
          shrink_node_memcg.constprop.99+0x4f8/0x7e0
          shrink_node+0x7d/0x2f0
          do_try_to_free_pages+0xe0/0x300
          try_to_free_pages+0x116/0x2b0
          __alloc_pages_slowpath+0x3f4/0xf80
          __alloc_pages_nodemask+0x2a2/0x2f0
          __handle_mm_fault+0x42e/0xb50
          handle_mm_fault+0x55/0xb0
          __do_page_fault+0x235/0x4b0
          page_fault+0x1e/0x30
        irq event stamp: 228412
        hardirqs last  enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
        hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
        softirqs last  enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
        softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&zram->bitmap_lock)->rlock);
          <Interrupt>
            lock(&(&zram->bitmap_lock)->rlock);
      
         *** DEADLOCK ***
      
        no locks held by zram_verify/2095.
      
        stack backtrace:
        CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x67/0x9b
         print_usage_bug+0x1bd/0x1d3
         mark_lock+0x4aa/0x540
         __lock_acquire+0x51d/0x1300
         lock_acquire+0x90/0x180
         _raw_spin_lock+0x2c/0x40
         put_entry_bdev+0x1e/0x50
         zram_free_page+0xf6/0x110
         zram_slot_free_notify+0x42/0xa0
         end_swap_bio_read+0x5b/0x170
         blk_update_request+0x8f/0x340
         scsi_end_request+0x2c/0x1e0
         scsi_io_completion+0x98/0x650
         blk_done_softirq+0x9e/0xd0
         __do_softirq+0xcc/0x427
         irq_exit+0xd1/0xe0
         do_IRQ+0x93/0x120
         common_interrupt+0xf/0xf
         </IRQ>
      
      With writeback feature, zram_slot_free_notify could be called in softirq
      context by end_swap_bio_read.  However, bitmap_lock is not aware of that
      so lockdep yell out:
      
        get_entry_bdev
        spin_lock(bitmap->lock);
        irq
        softirq
        end_swap_bio_read
        zram_slot_free_notify
        zram_slot_lock <-- deadlock prone
        zram_free_page
        put_entry_bdev
        spin_lock(bitmap->lock); <-- deadlock prone
      
      With akpm's suggestion (i.e.  bitmap operation is already atomic), we
      could remove bitmap lock.  It might fail to find a empty slot if serious
      contention happens.  However, it's not severe problem because huge page
      writeback has already possiblity to fail if there is severe memory
      pressure.  Worst case is just keeping the incompressible in memory, not
      storage.
      
      The other problem is zram_slot_lock in zram_slot_slot_free_notify.  To
      make it safe is this patch introduces zram_slot_trylock where
      zram_slot_free_notify uses it.  Although it's rare to be contented, this
      patch adds new debug stat "miss_free" to keep monitoring how often it
      happens.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c9959e0
    • Qian Cai's avatar
      mm/memblock.c: skip kmemleak for kasan_init() · fed84c78
      Qian Cai authored
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: default avatarQian Cai <cai@gmx.us>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fed84c78
    • Oscar Salvador's avatar
      kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · 65c78784
      Oscar Salvador authored
      This is a preparation for the next patch.
      
      Currently, we only call release_mem_region_adjustable() in __remove_pages
      if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
      are being released by themselves with devm_release_mem_region.
      
      Since we do not want to touch any zone/page stuff during the removing of
      the memory (but during the offlining), we do not want to check for the
      zone here.  So we need another way to tell release_mem_region_adjustable()
      to not realease the resource in case it belongs to HMM/devm.
      
      HMM/devm acquires/releases a resource through
      devm_request_mem_region/devm_release_mem_region.
      
      These resources have the flag IORESOURCE_MEM, while resources acquired by
      hot-add memory path (register_memory_resource()) contain
      IORESOURCE_SYSTEM_RAM.
      
      So, we can check for this flag in release_mem_region_adjustable, and if
      the resource does not contain such flag, we know that we are dealing with
      a HMM/devm resource, so we can back off.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65c78784
    • Oscar Salvador's avatar
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 2c2a5af6
      Oscar Salvador authored
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c2a5af6
    • Wei Yang's avatar
      mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init() · 23b68cfa
      Wei Yang authored
      When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
      each node's highest zone is initialized before defer stage.
      
      static_init_pgcnt is used to store the number of pages like this:
      
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                    pgdat->node_spanned_pages);
      
      because we don't want to overflow zone's range.
      
      But this is not necessary, since defer_init() is called like this:
      
        memmap_init_zone()
          for pfn in [start_pfn, end_pfn)
            defer_init(pfn, end_pfn)
      
      In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
      stop before calling defer_init().
      
      BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
      since nr_initialised is zone based instead of node based.  Even
      node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
      would have pages less than PAGES_PER_SECTION.
      
      Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23b68cfa
    • Hugh Dickins's avatar
      mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439
      Hugh Dickins authored
      Waiting on a page migration entry has used wait_on_page_locked() all along
      since 2006: but you cannot safely wait_on_page_locked() without holding a
      reference to the page, and that extra reference is enough to make
      migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
      on the entry before migrate_page_move_mapping() gets there.
      
      And that failure is retried nine times, amplifying the pain when trying to
      migrate a popular page.  With a single persistent faulter, migration
      sometimes succeeds; with two or three concurrent faulters, success becomes
      much less likely (and the more the page was mapped, the worse the overhead
      of unmapping and remapping it on each try).
      
      This is especially a problem for memory offlining, where the outer level
      retries forever (or until terminated from userspace), because a heavy
      refault workload can trigger an endless loop of migration failures.
      wait_on_page_locked() is the wrong tool for the job.
      
      David Herrmann (but was he the first?) noticed this issue in 2014:
      https://marc.info/?l=linux-mm&m=140110465608116&w=2
      
      Tim Chen started a thread in August 2017 which appears relevant:
      https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
      on to implicate __migration_entry_wait():
      https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
      up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
      list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
      wake_up_page_bit")
      
      Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
      https://marc.info/?l=linux-mm&m=154217936431300&w=2
      
      We have all assumed that it is essential to hold a page reference while
      waiting on a page lock: partly to guarantee that there is still a struct
      page when MEMORY_HOTREMOVE is configured, but also to protect against
      reuse of the struct page going to someone who then holds the page locked
      indefinitely, when the waiter can reasonably expect timely unlocking.
      
      But in fact, so long as wait_on_page_bit_common() does the put_page(), and
      is careful not to rely on struct page contents thereafter, there is no
      need to hold a reference to the page while waiting on it.  That does mean
      that this case cannot go back through the loop: but that's fine for the
      page migration case, and even if used more widely, is limited by the "Stop
      walking if it's locked" optimization in wake_page_function().
      
      Add interface put_and_wait_on_page_locked() to do this, using "behavior"
      enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
      No interruptible or killable variant needed yet, but they might follow: I
      have a vague notion that reporting -EINTR should take precedence over
      return from wait_on_page_bit_common() without knowing the page state, so
      arrange it accordingly - but that may be nothing but pedantic.
      
      __migration_entry_wait() still has to take a brief reference to the page,
      prior to calling put_and_wait_on_page_locked(): but now that it is dropped
      before waiting, the chance of impeding page migration is very much
      reduced.  Should we perhaps disable preemption across this?
      
      shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
      survived a lot of testing before that showed up.  PageWaiters may have
      been set by wait_on_page_bit_common(), and the reference dropped, just
      before shrink_page_list() succeeds in freezing its last page reference: in
      such a case, unlock_page() must be used.  Follow the suggestion from
      Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
      that optimization predates PageWaiters, and won't buy much these days; but
      we can reinstate it for the !PageWaiters case if anyone notices.
      
      It does raise the question: should vmscan.c's is_page_cache_freeable() and
      __remove_mapping() now treat a PageWaiters page as if an extra reference
      were held?  Perhaps, but I don't think it matters much, since
      shrink_page_list() already had to win its trylock_page(), so waiters are
      not very common there: I noticed no difference when trying the bigger
      change, and it's surely not needed while put_and_wait_on_page_locked() is
      only used for page migration.
      
      [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarBaoquan He <bhe@redhat.com>
      Tested-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a1ea439
    • yuzhoujian's avatar
      mm, oom: add oom victim's memcg to the oom context information · f0c867d9
      yuzhoujian authored
      The current oom report doesn't display victim's memcg context during the
      global OOM situation.  While this information is not strictly needed, it
      can be really helpful for containerized environments to locate which
      container has lost a process.  Now that we have a single line for the oom
      context, we can trivially add both the oom memcg (this can be either
      global_oom or a specific memcg which hits its hard limits) and task_memcg
      which is the victim's memcg.
      
      Below is the single line output in the oom report after this patch.
      
      - global oom context information:
      
      oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
      
      - memcg oom context information:
      
      oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
      
      [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
        Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
      Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.comSigned-off-by: default avataryuzhoujian <yuzhoujian@didichuxing.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0c867d9
    • yuzhoujian's avatar
      mm, oom: reorganize the oom report in dump_header · ef8444ea
      yuzhoujian authored
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: default avataryuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef8444ea
    • Alexey Dobriyan's avatar
    • Alexey Dobriyan's avatar
    • Alexey Dobriyan's avatar
    • Mel Gorman's avatar
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 1c30844d
      Mel Gorman authored
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c30844d
    • Mel Gorman's avatar
      mm: use alloc_flags to record if kswapd can wake · 0a79cdad
      Mel Gorman authored
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a79cdad
    • Mel Gorman's avatar
      mm: move zone watermark accesses behind an accessor · a9214443
      Mel Gorman authored
      This is a preparation patch only, no functional change.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9214443
    • Mel Gorman's avatar
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 6bb15450
      Mel Gorman authored
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bb15450