1. 26 Jul, 2016 40 commits
    • Vladimir Davydov's avatar
      af_unix: charge buffers to kmemcg · 3aa9799e
      Vladimir Davydov authored
      Unix sockets can consume a significant amount of system memory, hence
      they should be accounted to kmemcg.
      
      Since unix socket buffers are always allocated from process context, all
      we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
      sock->sk_allocation mask.
      
      Eric asked:
      
      > 1) What happens when a buffer, allocated from socket <A> lands in a
      > different socket <B>, maybe owned by another user/process.
      >
      > Who owns it now, in term of kmemcg accounting ?
      
      We never move memcg charges.  E.g.  if two processes from different
      cgroups are sharing a memory region, each page will be charged to the
      process which touched it first.  Or if two processes are working with
      the same directory tree, inodes and dentries will be charged to the
      first user.  The same is fair for unix socket buffers - they will be
      charged to the sender.
      
      > 2) Has performance impact been evaluated ?
      
      I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
      x2 HT box.  The results are below:
      
       # clients            bandwidth (10^6bits/sec)
                          base              patched
               1      67643 +-  725      64874 +-  353    - 4.0 %
               4     193585 +- 2516     186715 +- 1460    - 3.5 %
               8     194820 +-  377     187443 +- 1229    - 3.7 %
      
      So the accounting doesn't come for free - it takes ~4% of performance.
      I believe we could optimize it by using per cpu batching not only on
      charge, but also on uncharge in memcg core, but that's beyond the scope
      of this patch set - I'll take a look at this later.
      
      Anyway, if performance impact is found to be unacceptable, it is always
      possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
      or not use memory cgroups at runtime at all (thanks to jump labels
      there'll be no overhead even if they are compiled in).
      
      Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3aa9799e
    • Vladimir Davydov's avatar
      pipe: account to kmemcg · d86133bd
      Vladimir Davydov authored
      Pipes can consume a significant amount of system memory, hence they
      should be accounted to kmemcg.
      
      This patch marks pipe_inode_info and anonymous pipe buffer page
      allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
      Note, since a pipe buffer page can be "stolen" and get reused for other
      purposes, including mapping to userspace, we clear PageKmemcg thus
      resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
      is introduced by this patch.
      
      A note regarding anon_pipe_buf_steal implementation.  We allow to steal
      the page if its ref count equals 1.  It looks racy, but it is correct
      for anonymous pipe buffer pages, because:
      
       - We lock out all other pipe users, because ->steal is called with
         pipe_lock held, so the page can't be spliced to another pipe from
         under us.
      
       - The page is not on LRU and it never was.
      
       - Thus a parallel thread can access it only by PFN. Although this is
         quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
         this is not dangerous, because all such functions do is increase page
         ref count, check if the page is the one they are looking for, and
         decrease ref count if it isn't. Since our page is clean except for
         PageKmemcg mark, which doesn't conflict with other _mapcount users,
         the worst that can happen is we see page_count > 2 due to a transient
         ref, in which case we false-positively abort ->steal, which is still
         fine, because ->steal is not guaranteed to succeed.
      
      Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanzaSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d86133bd
    • Vladimir Davydov's avatar
      arch: x86: charge page tables to kmemcg · 3e79ec7d
      Vladimir Davydov authored
      Page tables can bite a relatively big chunk off system memory and their
      allocations are easy to trigger from userspace, so they should be
      accounted to kmemcg.
      
      This patch marks page table allocations as __GFP_ACCOUNT for x86.  Note
      we must not charge allocations of kernel page tables, because they can
      be shared among processes from different cgroups so accounting them to a
      particular one can pin other cgroups for indefinitely long.  So we clear
      __GFP_ACCOUNT flag if a page table is allocated for the kernel.
      
      Link: http://lkml.kernel.org/r/7d5c54f6a2bcbe76f03171689440003d87e6c742.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e79ec7d
    • Vladimir Davydov's avatar
      mm: memcontrol: teach uncharge_list to deal with kmem pages · 5e8d35f8
      Vladimir Davydov authored
      Page table pages are batched-freed in release_pages on most
      architectures.  If we want to charge them to kmemcg (this is what is
      done later in this series), we need to teach mem_cgroup_uncharge_list to
      handle kmem pages.
      
      Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e8d35f8
    • Vladimir Davydov's avatar
      mm: charge/uncharge kmemcg from generic page allocator paths · 4949148a
      Vladimir Davydov authored
      Currently, to charge a non-slab allocation to kmemcg one has to use
      alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
      this helper should finally be freed using free_kmem_pages, otherwise it
      won't be uncharged.
      
      This API suits its current users fine, but it turns out to be impossible
      to use along with page reference counting, i.e.  when an allocation is
      supposed to be freed with put_page, as it is the case with pipe or unix
      socket buffers.
      
      To overcome this limitation, this patch moves charging/uncharging to
      generic page allocator paths, i.e.  to __alloc_pages_nodemask and
      free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
      one can use any of the available page allocation functions to get the
      allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
      just like in case of kmalloc and friends.  A charged page will be
      automatically uncharged on free.
      
      To make it possible, we need to mark pages charged to kmemcg somehow.
      To avoid introducing a new page flag, we make use of page->_mapcount for
      marking such pages.  Since pages charged to kmemcg are not supposed to
      be mapped to userspace, it should work just fine.  There are other
      (ab)users of page->_mapcount - buddy and balloon pages - but we don't
      conflict with them.
      
      In case kmemcg is compiled out or not used at runtime, this patch
      introduces no overhead to generic page allocator paths.  If kmemcg is
      used, it will be plus one gfp flags check on alloc and plus one
      page->_mapcount check on free, which shouldn't hurt performance, because
      the data accessed are hot.
      
      Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4949148a
    • Vladimir Davydov's avatar
      mm: memcontrol: cleanup kmem charge functions · 45264778
      Vladimir Davydov authored
       - Handle memcg_kmem_enabled check out to the caller. This reduces the
         number of function definitions making the code easier to follow. At
         the same time it doesn't result in code bloat, because all of these
         functions are used only in one or two places.
      
       - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
         have to dive deep into memcg implementation to see which allocations
         are charged and which are not.
      
       - Refresh comments.
      
      Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45264778
    • Vladimir Davydov's avatar
      mm: clean up non-standard page->_mapcount users · 632c0a1a
      Vladimir Davydov authored
       - Add a proper comment to page->_mapcount.
      
       - Introduce a macro for generating helper functions.
      
       - Place all special page->_mapcount values next to each other so that
         readers can see all possible values and so we don't get duplicates.
      
      Link: http://lkml.kernel.org/r/502f49000e0b63e6c62e338fac6b420bf34fb526.1464079537.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      632c0a1a
    • Vladimir Davydov's avatar
      mm: remove pointless struct in struct page definition · 99691add
      Vladimir Davydov authored
      This patchset implements per kmemcg accounting of page tables
      (x86-only), pipe buffers, and unix socket buffers.
      
      Patches 1-3 are just cleanups that are not supposed to introduce any
      functional changes.  Patches 4 and 5 move charge/uncharge to generic
      page allocator paths for the sake of accounting pipe and unix socket
      buffers.  Patches 5-7 make x86 page tables, pipe buffers, and unix
      socket buffers accountable.
      
      This patch (of 8):
      
      ... to reduce indentation level thus leaving more space for comments.
      
      Link: http://lkml.kernel.org/r/f34ffe70fce2b0b9220856437f77972d67c14275.1464079537.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99691add
    • Aneesh Kumar K.V's avatar
      mm/mmu_gather: track page size with mmu gather and force flush if page size change · e77b0852
      Aneesh Kumar K.V authored
      This allows an arch which needs to do special handing with respect to
      different page size when flushing tlb to implement the same in mmu
      gather.
      
      Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e77b0852
    • Aneesh Kumar K.V's avatar
      mm: change the interface for __tlb_remove_page() · e9d55e15
      Aneesh Kumar K.V authored
      This updates the generic and arch specific implementation to return true
      if we need to do a tlb flush.  That means if a __tlb_remove_page
      indicate a flush is needed, the page we try to remove need to be tracked
      and added again after the flush.  We need to track it because we have
      already update the pte to none and we can't just loop back.
      
      This change is done to enable us to do a tlb_flush when we try to flush
      a range that consists of different page sizes.  For architectures like
      ppc64, we can do a range based tlb flush and we need to track page size
      for that.  When we try to remove a huge page, we will force a tlb flush
      and starts a new mmu gather.
      
      [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3]
        Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9d55e15
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: simplify hugetlb unmap · 31d49da5
      Aneesh Kumar K.V authored
      For hugetlb like THP (and unlike regular page), we do tlb flush after
      dropping ptl.  Because of the above, we don't need to track force_flush
      like we do now.  Instead we can simply call tlb_remove_page() which will
      do the flush if needed.
      
      No functionality change in this patch.
      
      Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31d49da5
    • Naoya Horiguchi's avatar
      mm: thp: check pmd_trans_unstable() after split_huge_pmd() · 337d9abf
      Naoya Horiguchi authored
      split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
      to pte entries, which can be checked with pmd_trans_unstable().  Some
      callers make this assertion and some do it differently and some not, so
      let's do it in a unified manner.
      
      Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      337d9abf
    • Joonsoo Kim's avatar
      mm/page_isolation: clean up confused code · e3a2713c
      Joonsoo Kim authored
      When there is an isolated_page, post_alloc_hook() is called with page
      but __free_pages() is called with isolated_page.  Since they are the
      same so no problem but it's very confusing.  To reduce it, this patch
      changes isolated_page to boolean type and uses page variable
      consistently.
      
      Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3a2713c
    • Joonsoo Kim's avatar
      mm/page_alloc: introduce post allocation processing on page allocator · 46f24fd8
      Joonsoo Kim authored
      This patch is motivated from Hugh and Vlastimil's concern [1].
      
      There are two ways to get freepage from the allocator.  One is using
      normal memory allocation API and the other is __isolate_free_page()
      which is internally used for compaction and pageblock isolation.  Later
      usage is rather tricky since it doesn't do whole post allocation
      processing done by normal API.
      
      One problematic thing I already know is that poisoned page would not be
      checked if it is allocated by __isolate_free_page().  Perhaps, there
      would be more.
      
      We could add more debug logic for allocated page in the future and this
      separation would cause more problem.  I'd like to fix this situation at
      this time.  Solution is simple.  This patch commonize some logic for
      newly allocated page and uses it on all sites.  This will solve the
      problem.
      
      [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E
      
      [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46f24fd8
    • Joonsoo Kim's avatar
      mm/page_owner: use stackdepot to store stacktrace · f2ca0b55
      Joonsoo Kim authored
      Currently, we store each page's allocation stacktrace on corresponding
      page_ext structure and it requires a lot of memory.  This causes the
      problem that memory tight system doesn't work well if page_owner is
      enabled.  Moreover, even with this large memory consumption, we cannot
      get full stacktrace because we allocate memory at boot time and just
      maintain 8 stacktrace slots to balance memory consumption.  We could
      increase it to more but it would make system unusable or change system
      behaviour.
      
      To solve the problem, this patch uses stackdepot to store stacktrace.
      It obviously provides memory saving but there is a drawback that
      stackdepot could fail.
      
      stackdepot allocates memory at runtime so it could fail if system has
      not enough memory.  But, most of allocation stack are generated at very
      early time and there are much memory at this time.  So, failure would
      not happen easily.  And, one failure means that we miss just one page's
      allocation stacktrace so it would not be a big problem.  In this patch,
      when memory allocation failure happens, we store special stracktrace
      handle to the page that is failed to save stacktrace.  With it, user can
      guess memory usage properly even if failure happens.
      
      Memory saving looks as following.  (4GB memory system with page_owner)
      (before the patch -> after the patch)
      
      static allocation:
      92274688 bytes -> 25165824 bytes
      
      dynamic allocation after boot + kernel build:
      0 bytes -> 327680 bytes
      
      total:
      92274688 bytes -> 25493504 bytes
      
      72% reduction in total.
      
      Note that implementation looks complex than someone would imagine
      because there is recursion issue.  stackdepot uses page allocator and
      page_owner is called at page allocation.  Using stackdepot in page_owner
      could re-call page allcator and then page_owner.  That is a recursion.
      To detect and avoid it, whenever we obtain stacktrace, recursion is
      checked and page_owner is set to dummy information if found.  Dummy
      information means that this page is allocated for page_owner feature
      itself (such as stackdepot) and it's understandable behavior for user.
      
      [iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2ca0b55
    • Joonsoo Kim's avatar
      tools/vm/page_owner: increase temporary buffer size · 37137675
      Joonsoo Kim authored
      Page owner will be changed to store more deep stacktrace so current
      temporary buffer size isn't enough.  Increase it.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-5-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37137675
    • Joonsoo Kim's avatar
      mm/page_owner: introduce split_page_owner and replace manual handling · a9627bc5
      Joonsoo Kim authored
      split_page() calls set_page_owner() to set up page_owner to each pages.
      But, it has a drawback that head page and the others have different
      stacktrace because callsite of set_page_owner() is slightly differnt.
      To avoid this problem, this patch copies head page's page_owner to the
      others.  It needs to introduce new function, split_page_owner() but it
      also remove the other function, get_page_owner_gfp() so looks good to
      do.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9627bc5
    • Joonsoo Kim's avatar
      mm/page_owner: copy last_migrate_reason in copy_page_owner() · a8efe1c9
      Joonsoo Kim authored
      Currently, copy_page_owner() doesn't copy all the owner information.  It
      skips last_migrate_reason because copy_page_owner() is used for
      migration and it will be properly set soon.  But, following patch will
      use copy_page_owner() and this skip will cause the problem that
      allocated page has uninitialied last_migrate_reason.  To prevent it,
      this patch also copy last_migrate_reason in copy_page_owner().
      
      Link: http://lkml.kernel.org/r/1464230275-25791-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8efe1c9
    • Joonsoo Kim's avatar
      mm/page_owner: initialize page owner without holding the zone lock · 83358ece
      Joonsoo Kim authored
      It's not necessary to initialized page_owner with holding the zone lock.
      It would cause more contention on the zone lock although it's not a big
      problem since it is just debug feature.  But, it is better than before
      so do it.  This is also preparation step to use stackdepot in page owner
      feature.  Stackdepot allocates new pages when there is no reserved space
      and holding the zone lock in this case will cause deadlock.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83358ece
    • Joonsoo Kim's avatar
      mm/compaction: split freepages without holding the zone lock · 66c64223
      Joonsoo Kim authored
      We don't need to split freepages with holding the zone lock.  It will
      cause more contention on zone lock so not desirable.
      
      [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66c64223
    • Minchan Kim's avatar
      zsmalloc: use OBJ_TAG_BIT for bit shifter · 3b1d9ca6
      Minchan Kim authored
      Static check warns using tag as bit shifter.  It doesn't break current
      working but not good for redability.  Let's use OBJ_TAG_BIT as bit
      shifter instead of OBJ_ALLOCATED_TAG.
      
      Link: http://lkml.kernel.org/r/20160607045146.GF26230@bboxSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b1d9ca6
    • Minchan Kim's avatar
      zram: use __GFP_MOVABLE for memory allocation · 9bc482d3
      Minchan Kim authored
      Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE from
      now on.
      
      I did test to see how it helps to make higher order pages.  Test
      scenario is as follows.
      
      KVM guest, 1G memory, ext4 formated zram block device,
      
        for i in `seq 1 8`;
        do
                dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
        done
      
        wait `pidof dd`
      
        for i in `seq 1 2 8`;
        do
                rm -rf mnt/test$i.txt
        done
        fstrim -v mnt
      
        echo "init"
        cat /proc/buddyinfo
      
        echo "compaction"
        echo 1 > /proc/sys/vm/compact_memory
        cat /proc/buddyinfo
      
      old:
      
        init
        Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
        Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
        compaction
        Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
        Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0
      
      new:
      
        init
        Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
        Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
        compaction
        Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
        Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0
      
      As you can see, compaction made so many high-order pages. Yay!
      
      Link: http://lkml.kernel.org/r/1464736881-24886-13-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bc482d3
    • Minchan Kim's avatar
      zsmalloc: page migration support · 48b4800a
      Minchan Kim authored
      This patch introduces run-time migration feature for zspage.
      
      For migration, VM uses page.lru field so it would be better to not use
      page.next field which is unified with page.lru for own purpose.  For
      that, firstly, we can get first object offset of the page via runtime
      calculation instead of using page.index so we can use page.index as link
      for page chaining instead of page.next.
      
      In case of huge object, it stores handle to page.index instead of next
      link of page chaining because huge object doesn't need to next link for
      page chaining.  So get_next_page need to identify huge object to return
      NULL.  For it, this patch uses PG_owner_priv_1 flag of the page flag.
      
      For migration, it supports three functions
      
      * zs_page_isolate
      
      It isolates a zspage which includes a subpage VM want to migrate from
      class so anyone cannot allocate new object from the zspage.
      
      We could try to isolate a zspage by the number of subpage so subsequent
      isolation trial of other subpage of the zpsage shouldn't fail.  For
      that, we introduce zspage.isolated count.  With that, zs_page_isolate
      can know whether zspage is already isolated or not for migration so if
      it is isolated for migration, subsequent isolation trial can be
      successful without trying further isolation.
      
      * zs_page_migrate
      
      First of all, it holds write-side zspage->lock to prevent migrate other
      subpage in zspage.  Then, lock all objects in the page VM want to
      migrate.  The reason we should lock all objects in the page is due to
      race between zs_map_object and zs_page_migrate.
      
        zs_map_object				zs_page_migrate
      
        pin_tag(handle)
        obj = handle_to_obj(handle)
        obj_to_location(obj, &page, &obj_idx);
      
      					write_lock(&zspage->lock)
      					if (!trypin_tag(handle))
      						goto unpin_object
      
        zspage = get_zspage(page);
        read_lock(&zspage->lock);
      
      If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be
      stale by migration so it goes crash.
      
      If it locks all of objects successfully, it copies content from old page
      to new one, finally, create new zspage chain with new page.  And if it's
      last isolated subpage in the zspage, put the zspage back to class.
      
      * zs_page_putback
      
      It returns isolated zspage to right fullness_group list if it fails to
      migrate a page.  If it find a zspage is ZS_EMPTY, it queues zspage
      freeing to workqueue.  See below about async zspage freeing.
      
      This patch introduces asynchronous zspage free.  The reason to need it
      is we need page_lock to clear PG_movable but unfortunately, zs_free path
      should be atomic so the apporach is try to grab page_lock.  If it got
      page_lock of all of pages successfully, it can free zspage immediately.
      Otherwise, it queues free request and free zspage via workqueue in
      process context.
      
      If zs_free finds the zspage is isolated when it try to free zspage, it
      delays the freeing until zs_page_putback finds it so it will free free
      the zspage finally.
      
      In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.  First
      of all, it will use ZS_EMPTY list for delay freeing.  And with adding
      ZS_FULL list, it makes to identify whether zspage is isolated or not via
      list_empty(&zspage->list) test.
      
      [minchan@kernel.org: zsmalloc: keep first object offset in struct page]
        Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org
      [minchan@kernel.org: zsmalloc: zspage sanity check]
        Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox
      Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48b4800a
    • Minchan Kim's avatar
      zsmalloc: use freeobj for index · bfd093f5
      Minchan Kim authored
      Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj
      in each zspage.  If we change it with index from first_page instead of
      position, it makes page migration simple because we don't need to
      correct other entries for linked list if a page is migrated out.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfd093f5
    • Minchan Kim's avatar
      zsmalloc: separate free_zspage from putback_zspage · 4aa409ca
      Minchan Kim authored
      Currently, putback_zspage does free zspage under class->lock if fullness
      become ZS_EMPTY but it makes trouble to implement locking scheme for new
      zspage migration.  So, this patch is to separate free_zspage from
      putback_zspage and free zspage out of class->lock which is preparation
      for zspage migration.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4aa409ca
    • Minchan Kim's avatar
      zsmalloc: introduce zspage structure · 3783689a
      Minchan Kim authored
      We have squeezed meta data of zspage into first page's descriptor.  So,
      to get meta data from subpage, we should get first page first of all.
      But it makes trouble to implment page migration feature of zsmalloc
      because any place where to get first page from subpage can be raced with
      first page migration.  IOW, first page it got could be stale.  For
      preventing it, I have tried several approahces but it made code
      complicated so finally, I concluded to separate metadata from first
      page.  Of course, it consumes more memory.  IOW, 16bytes per zspage on
      32bit at the moment.  It means we lost 1% at *worst case*(40B/4096B)
      which is not bad I think at the cost of maintenance.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3783689a
    • Minchan Kim's avatar
      zsmalloc: factor page chain functionality out · bdb0af7c
      Minchan Kim authored
      For page migration, we need to create page chain of zspage dynamically
      so this patch factors it out from alloc_zspage.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-8-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdb0af7c
    • Minchan Kim's avatar
      zsmalloc: use accessor · 4f42047b
      Minchan Kim authored
      Upcoming patch will change how to encode zspage meta so for easy review,
      this patch wraps code to access metadata as accessor.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-7-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f42047b
    • Minchan Kim's avatar
      zsmalloc: use bit_spin_lock · 1b8320b6
      Minchan Kim authored
      Use kernel standard bit spin-lock instead of custom mess.  Even, it has
      a bug which doesn't disable preemption.  The reason we don't have any
      problem is that we have used it during preemption disable section by
      class->lock spinlock.  So no need to go to stable.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-6-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b8320b6
    • Minchan Kim's avatar
      zsmalloc: keep max_object in size_class · 1fc6e27d
      Minchan Kim authored
      Every zspage in a size_class has same number of max objects so we could
      move it to a size_class.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-5-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fc6e27d
    • Minchan Kim's avatar
      mm: balloon: use general non-lru movable page feature · b1123ea6
      Minchan Kim authored
      Now, VM has a feature to migrate non-lru movable pages so balloon
      doesn't need custom migration hooks in migrate.c and compaction.c.
      
      Instead, this patch implements the page->mapping->a_ops->
      {isolate|migrate|putback} functions.
      
      With that, we could remove hooks for ballooning in general migration
      functions and make balloon compaction simple.
      
      [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
      Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.orgSigned-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1123ea6
    • Minchan Kim's avatar
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim authored
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
    • Minchan Kim's avatar
      mm: use put_page() to free page instead of putback_lru_page() · c6c919eb
      Minchan Kim authored
      Recently, I got many reports about perfermance degradation in embedded
      system(Android mobile phone, webOS TV and so on) and easy fork fail.
      
      The problem was fragmentation caused by zram and GPU driver mainly.
      With memory pressure, their pages were spread out all of pageblock and
      it cannot be migrated with current compaction algorithm which supports
      only LRU pages.  In the end, compaction cannot work well so reclaimer
      shrinks all of working set pages.  It made system very slow and even to
      fail to fork easily which requires order-[2 or 3] allocations.
      
      Other pain point is that they cannot use CMA memory space so when OOM
      kill happens, I can see many free pages in CMA area, which is not memory
      efficient.  In our product which has big CMA memory, it reclaims zones
      too exccessively to allocate GPU and zram page although there are lots
      of free space in CMA so system becomes very slow easily.
      
      To solve these problem, this patch tries to add facility to migrate
      non-lru pages via introducing new functions and page flags to help
      migration.
      
      struct address_space_operations {
      	..
      	..
      	bool (*isolate_page)(struct page *, isolate_mode_t);
      	void (*putback_page)(struct page *);
      	..
      }
      
      new page flags
      
      	PG_movable
      	PG_isolated
      
      For details, please read description in "mm: migrate: support non-lru
      movable page migration".
      
      Originally, Gioh Kim had tried to support this feature but he moved so I
      took over the work.  I took many code from his work and changed a little
      bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
      have many credit, too.
      
      And I should mention Chulmin who have tested this patchset heavily so I
      can find many bugs from him.  :)
      
      Thanks, Gioh, Konstantin and Chulmin!
      
      This patchset consists of five parts.
      
      1. clean up migration
        mm: use put_page to free page instead of putback_lru_page
      
      2. add non-lru page migration feature
        mm: migrate: support non-lru movable page migration
      
      3. rework KVM memory-ballooning
        mm: balloon: use general non-lru movable page feature
      
      4. zsmalloc refactoring for preparing page migration
        zsmalloc: keep max_object in size_class
        zsmalloc: use bit_spin_lock
        zsmalloc: use accessor
        zsmalloc: factor page chain functionality out
        zsmalloc: introduce zspage structure
        zsmalloc: separate free_zspage from putback_zspage
        zsmalloc: use freeobj for index
      
      5. zsmalloc page migration
        zsmalloc: page migration support
        zram: use __GFP_MOVABLE for memory allocation
      
      This patch (of 12):
      
      Procedure of page migration is as follows:
      
      First of all, it should isolate a page from LRU and try to migrate the
      page.  If it is successful, it releases the page for freeing.
      Otherwise, it should put the page back to LRU list.
      
      For LRU pages, we have used putback_lru_page for both freeing and
      putback to LRU list.  It's okay because put_page is aware of LRU list so
      if it releases last refcount of the page, it removes the page from LRU
      list.  However, It makes unnecessary operations (e.g., lru_cache_add,
      pagevec and flags operations.  It would be not significant but no worth
      to do) and harder to support new non-lru page migration because put_page
      isn't aware of non-lru page's data structure.
      
      To solve the problem, we can add new hook in put_page with PageMovable
      flags check but it can increase overhead in hot path and needs new
      locking scheme to stabilize the flag check with put_page.
      
      So, this patch cleans it up to divide two semantic(ie, put and putback).
      If migration is successful, use put_page instead of putback_lru_page and
      use putback_lru_page only on failure.  That makes code more readable and
      doesn't add overhead in put_page.
      
      Comment from Vlastimil
       "Yeah, and compaction (perhaps also other migration users) has to drain
        the lru pvec...  Getting rid of this stuff is worth even by itself."
      
      Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6c919eb
    • Sergey Senozhatsky's avatar
      zram: drop gfp_t from zcomp_strm_alloc() · 16d37725
      Sergey Senozhatsky authored
      We now allocate streams from CPU_UP hot-plug path, there are no
      context-dependent stream allocations anymore and we can schedule from
      zcomp_strm_alloc().  Use GFP_KERNEL directly and drop a gfp_t parameter.
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16d37725
    • Sergey Senozhatsky's avatar
      zram: add more compression algorithms · eb9f56d8
      Sergey Senozhatsky authored
      Add "deflate", "lz4hc", "842" algorithms to the list of known
      compression backends.  The real availability of those algorithms,
      however, depends on the corresponding CONFIG_CRYPTO_FOO config options.
      
      [sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb9f56d8
    • Sergey Senozhatsky's avatar
      zram: delete custom lzo/lz4 · ce1ed9f9
      Sergey Senozhatsky authored
      Remove lzo/lz4 backends, we use crypto API now.
      
      [sergey.senozhatsky@gmail.com: zram-delete-custom-lzo-lz4-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-6-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-7-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce1ed9f9
    • Sergey Senozhatsky's avatar
      zram: cosmetic: cleanup documentation · 69a30a8d
      Sergey Senozhatsky authored
      zram documentation is a mix of different styles: spaces, tabs, tabs +
      spaces, etc.  Clean it up.
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-6-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69a30a8d
    • Sergey Senozhatsky's avatar
      zram: use crypto api to check alg availability · 415403be
      Sergey Senozhatsky authored
      There is no way to get a string with all the crypto comp algorithms
      supported by the crypto comp engine, so we need to maintain our own
      backends list.  At the same time we additionally need to use
      crypto_has_comp() to make sure that the user has requested a compression
      algorithm that is recognized by the crypto comp engine.  Relying on
      /proc/crypto is not an options here, because it does not show
      not-yet-inserted compression modules.
      
      Example:
      
       modprobe zram
       cat /proc/crypto | grep -i lz4
       modprobe lz4
       cat /proc/crypto | grep -i lz4
      name         : lz4
      driver       : lz4-generic
      module       : lz4
      
      So the user can't tell exactly if the lz4 is really supported from
      /proc/crypto output, unless someone or something has loaded it.
      
      This patch also adds crypto_has_comp() to zcomp_available_show().  We
      store all the compression algorithms names in zcomp's `backends' array,
      regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
      are also supported by crypto engine.  This helps user to know the exact
      list of compression algorithms that can be used.
      
      Example:
        module lz4 is not loaded yet, but is supported by the crypto
        engine. /proc/crypto has no information on this module, while
        zram's `comp_algorithm' lists it:
      
       cat /proc/crypto | grep -i lz4
      
       cat /sys/block/zram0/comp_algorithm
      [lzo] lz4 deflate lz4hc 842
      
      We still use the `backends' array to determine if the requested
      compression backend is known to crypto api.  This array, however, may not
      contain some entries, therefore as the last step we call crypto_has_comp()
      function which attempts to insmod the requested compression algorithm to
      determine if crypto api supports it.  The advantage of this method is that
      now we permit the usage of out-of-tree crypto compression modules
      (implementing S/W or H/W compression).
      
      [sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415403be
    • Sergey Senozhatsky's avatar
      zram: switch to crypto compress API · ebaf9ab5
      Sergey Senozhatsky authored
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      (which could badly affect the performance of both paths) and at the same
      time opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      Joonsoo Kim [1] attempted to do this a while ago, but faced with the
      need of introducing a new crypto API interface.  The root cause was the
      fact that crypto API compression algorithms require a compression stream
      structure (in zram terminology) for both compression and decompression
      ops, while in reality only several of compression algorithms really need
      it.  This resulted in a concept of context-less crypto API compression
      backends [2].  Both write and read paths, though, would have been
      executed with the preemption enabled, which in the worst case could have
      resulted in a decreased worst-case performance, e.g.  consider the
      following case:
      
      	CPU0
      
      	zram_write()
      	  spin_lock()
      	    take the last idle stream
      	  spin_unlock()
      
      	<< preempted >>
      
      		zram_read()
      		  spin_lock()
      		   no idle streams
      			  spin_unlock()
      			  schedule()
      
      	resuming zram_write compression()
      
      but it took me some time to realize that, and it took even longer to
      evolve zram and to make it ready for crypto API.  The key turned out to be
      -- drop the idle streams list entirely.  Without the idle streams list we
      are free to use compression algorithms that require compression stream for
      decompression (read), because streams are now placed in per-cpu data and
      each write path has to disable preemption for compression op, almost
      completely eliminating the aforementioned case (technically, we still have
      a small chance, because write path has a fast and a slow paths and the
      slow path is executed with the preemption enabled; but the frequency of
      failed fast path is too low).
      
      TEST
      ====
      
      - 4 CPUs, x86_64 system
      - 3G zram, lzo
      - fio tests: read, randread, write, randwrite, rw, randrw
      
      test script [3] command:
       ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh
      
                         BASE           PATCHED
      jobs1
      READ:           2527.2MB/s	 2482.7MB/s
      READ:           2102.7MB/s	 2045.0MB/s
      WRITE:          1284.3MB/s	 1324.3MB/s
      WRITE:          1080.7MB/s	 1101.9MB/s
      READ:           430125KB/s	 437498KB/s
      WRITE:          430538KB/s	 437919KB/s
      READ:           399593KB/s	 403987KB/s
      WRITE:          399910KB/s	 404308KB/s
      jobs2
      READ:           8133.5MB/s	 7854.8MB/s
      READ:           7086.6MB/s	 6912.8MB/s
      WRITE:          3177.2MB/s	 3298.3MB/s
      WRITE:          2810.2MB/s	 2871.4MB/s
      READ:           1017.6MB/s	 1023.4MB/s
      WRITE:          1018.2MB/s	 1023.1MB/s
      READ:           977836KB/s	 984205KB/s
      WRITE:          979435KB/s	 985814KB/s
      jobs3
      READ:           13557MB/s	 13391MB/s
      READ:           11876MB/s	 11752MB/s
      WRITE:          4641.5MB/s	 4682.1MB/s
      WRITE:          4164.9MB/s	 4179.3MB/s
      READ:           1453.8MB/s	 1455.1MB/s
      WRITE:          1455.1MB/s	 1458.2MB/s
      READ:           1387.7MB/s	 1395.7MB/s
      WRITE:          1386.1MB/s	 1394.9MB/s
      jobs4
      READ:           20271MB/s	 20078MB/s
      READ:           18033MB/s	 17928MB/s
      WRITE:          6176.8MB/s	 6180.5MB/s
      WRITE:          5686.3MB/s	 5705.3MB/s
      READ:           2009.4MB/s	 2006.7MB/s
      WRITE:          2007.5MB/s	 2004.9MB/s
      READ:           1929.7MB/s	 1935.6MB/s
      WRITE:          1926.8MB/s	 1932.6MB/s
      jobs5
      READ:           18823MB/s	 19024MB/s
      READ:           18968MB/s	 19071MB/s
      WRITE:          6191.6MB/s	 6372.1MB/s
      WRITE:          5818.7MB/s	 5787.1MB/s
      READ:           2011.7MB/s	 1981.3MB/s
      WRITE:          2011.4MB/s	 1980.1MB/s
      READ:           1949.3MB/s	 1935.7MB/s
      WRITE:          1940.4MB/s	 1926.1MB/s
      jobs6
      READ:           21870MB/s	 21715MB/s
      READ:           19957MB/s	 19879MB/s
      WRITE:          6528.4MB/s	 6537.6MB/s
      WRITE:          6098.9MB/s	 6073.6MB/s
      READ:           2048.6MB/s	 2049.9MB/s
      WRITE:          2041.7MB/s	 2042.9MB/s
      READ:           2013.4MB/s	 1990.4MB/s
      WRITE:          2009.4MB/s	 1986.5MB/s
      jobs7
      READ:           21359MB/s	 21124MB/s
      READ:           19746MB/s	 19293MB/s
      WRITE:          6660.4MB/s	 6518.8MB/s
      WRITE:          6211.6MB/s	 6193.1MB/s
      READ:           2089.7MB/s	 2080.6MB/s
      WRITE:          2085.8MB/s	 2076.5MB/s
      READ:           2041.2MB/s	 2052.5MB/s
      WRITE:          2037.5MB/s	 2048.8MB/s
      jobs8
      READ:           20477MB/s	 19974MB/s
      READ:           18922MB/s	 18576MB/s
      WRITE:          6851.9MB/s	 6788.3MB/s
      WRITE:          6407.7MB/s	 6347.5MB/s
      READ:           2134.8MB/s	 2136.1MB/s
      WRITE:          2132.8MB/s	 2134.4MB/s
      READ:           2074.2MB/s	 2069.6MB/s
      WRITE:          2087.3MB/s	 2082.4MB/s
      jobs9
      READ:           19797MB/s	 19994MB/s
      READ:           18806MB/s	 18581MB/s
      WRITE:          6878.7MB/s	 6822.7MB/s
      WRITE:          6456.8MB/s	 6447.2MB/s
      READ:           2141.1MB/s	 2154.7MB/s
      WRITE:          2144.4MB/s	 2157.3MB/s
      READ:           2084.1MB/s	 2085.1MB/s
      WRITE:          2091.5MB/s	 2092.5MB/s
      jobs10
      READ:           19794MB/s	 19784MB/s
      READ:           18794MB/s	 18745MB/s
      WRITE:          6984.4MB/s	 6676.3MB/s
      WRITE:          6532.3MB/s	 6342.7MB/s
      READ:           2150.6MB/s	 2155.4MB/s
      WRITE:          2156.8MB/s	 2161.5MB/s
      READ:           2106.4MB/s	 2095.6MB/s
      WRITE:          2109.7MB/s	 2098.4MB/s
      
                                          BASE                       PATCHED
      jobs1                              perfstat
      stalled-cycles-frontend     102,480,595,419 (  41.53%)	  114,508,864,804 (  46.92%)
      stalled-cycles-backend       51,941,417,832 (  21.05%)	   46,836,112,388 (  19.19%)
      instructions                283,612,054,215 (    1.15)	  283,918,134,959 (    1.16)
      branches                     56,372,560,385 ( 724.923)	   56,449,814,753 ( 733.766)
      branch-misses                   374,826,000 (   0.66%)	      326,935,859 (   0.58%)
      jobs2                              perfstat
      stalled-cycles-frontend     155,142,745,777 (  40.99%)	  164,170,979,198 (  43.82%)
      stalled-cycles-backend       70,813,866,387 (  18.71%)	   66,456,858,165 (  17.74%)
      instructions                463,436,648,173 (    1.22)	  464,221,890,191 (    1.24)
      branches                     91,088,733,902 ( 760.088)	   91,278,144,546 ( 769.133)
      branch-misses                   504,460,363 (   0.55%)	      394,033,842 (   0.43%)
      jobs3                              perfstat
      stalled-cycles-frontend     201,300,397,212 (  39.84%)	  223,969,902,257 (  44.44%)
      stalled-cycles-backend       87,712,593,974 (  17.36%)	   81,618,888,712 (  16.19%)
      instructions                642,869,545,023 (    1.27)	  644,677,354,132 (    1.28)
      branches                    125,724,560,594 ( 690.682)	  126,133,159,521 ( 694.542)
      branch-misses                   527,941,798 (   0.42%)	      444,782,220 (   0.35%)
      jobs4                              perfstat
      stalled-cycles-frontend     246,701,197,429 (  38.12%)	  280,076,030,886 (  43.29%)
      stalled-cycles-backend      119,050,341,112 (  18.40%)	  110,955,641,671 (  17.15%)
      instructions                822,716,962,127 (    1.27)	  825,536,969,320 (    1.28)
      branches                    160,590,028,545 ( 688.614)	  161,152,996,915 ( 691.068)
      branch-misses                   650,295,287 (   0.40%)	      550,229,113 (   0.34%)
      jobs5                              perfstat
      stalled-cycles-frontend     298,958,462,516 (  38.30%)	  344,852,200,358 (  44.16%)
      stalled-cycles-backend      137,558,742,122 (  17.62%)	  129,465,067,102 (  16.58%)
      instructions              1,005,714,688,752 (    1.29)	1,007,657,999,432 (    1.29)
      branches                    195,988,773,962 ( 697.730)	  196,446,873,984 ( 700.319)
      branch-misses                   695,818,940 (   0.36%)	      624,823,263 (   0.32%)
      jobs6                              perfstat
      stalled-cycles-frontend     334,497,602,856 (  36.71%)	  387,590,419,779 (  42.38%)
      stalled-cycles-backend      163,539,365,335 (  17.95%)	  152,640,193,639 (  16.69%)
      instructions              1,184,738,177,851 (    1.30)	1,187,396,281,677 (    1.30)
      branches                    230,592,915,640 ( 702.902)	  231,253,802,882 ( 702.356)
      branch-misses                   747,934,786 (   0.32%)	      643,902,424 (   0.28%)
      jobs7                              perfstat
      stalled-cycles-frontend     396,724,684,187 (  37.71%)	  460,705,858,952 (  43.84%)
      stalled-cycles-backend      188,096,616,496 (  17.88%)	  175,785,787,036 (  16.73%)
      instructions              1,364,041,136,608 (    1.30)	1,366,689,075,112 (    1.30)
      branches                    265,253,096,936 ( 700.078)	  265,890,524,883 ( 702.839)
      branch-misses                   784,991,589 (   0.30%)	      729,196,689 (   0.27%)
      jobs8                              perfstat
      stalled-cycles-frontend     440,248,299,870 (  36.92%)	  509,554,793,816 (  42.46%)
      stalled-cycles-backend      222,575,930,616 (  18.67%)	  213,401,248,432 (  17.78%)
      instructions              1,542,262,045,114 (    1.29)	1,545,233,932,257 (    1.29)
      branches                    299,775,178,439 ( 697.666)	  300,528,458,505 ( 694.769)
      branch-misses                   847,496,084 (   0.28%)	      748,794,308 (   0.25%)
      jobs9                              perfstat
      stalled-cycles-frontend     506,269,882,480 (  37.86%)	  592,798,032,820 (  44.43%)
      stalled-cycles-backend      253,192,498,861 (  18.93%)	  233,727,666,185 (  17.52%)
      instructions              1,721,985,080,913 (    1.29)	1,724,666,236,005 (    1.29)
      branches                    334,517,360,255 ( 694.134)	  335,199,758,164 ( 697.131)
      branch-misses                   873,496,730 (   0.26%)	      815,379,236 (   0.24%)
      jobs10                             perfstat
      stalled-cycles-frontend     549,063,363,749 (  37.18%)	  651,302,376,662 (  43.61%)
      stalled-cycles-backend      281,680,986,810 (  19.07%)	  277,005,235,582 (  18.55%)
      instructions              1,901,859,271,180 (    1.29)	1,906,311,064,230 (    1.28)
      branches                    369,398,536,153 ( 694.004)	  370,527,696,358 ( 688.409)
      branch-misses                   967,929,335 (   0.26%)	      890,125,056 (   0.24%)
      
                                  BASE           PATCHED
      seconds elapsed        79.421641008	78.735285546
      seconds elapsed        61.471246133	60.869085949
      seconds elapsed        62.317058173	62.224188495
      seconds elapsed        60.030739363	60.081102518
      seconds elapsed        74.070398362	74.317582865
      seconds elapsed        84.985953007	85.414364176
      seconds elapsed        97.724553255	98.173311344
      seconds elapsed        109.488066758	110.268399318
      seconds elapsed        122.768189405	122.967164498
      seconds elapsed        135.130035105	136.934770801
      
      On my other system (8 x86_64 CPUs, short version of test results):
      
                                  BASE           PATCHED
      seconds elapsed        19.518065994	19.806320662
      seconds elapsed        15.172772749	15.594718291
      seconds elapsed        13.820925970	13.821708564
      seconds elapsed        13.293097816	14.585206405
      seconds elapsed        16.207284118	16.064431606
      seconds elapsed        17.958376158	17.771825767
      seconds elapsed        19.478009164	19.602961508
      seconds elapsed        21.347152811	21.352318709
      seconds elapsed        24.478121126	24.171088735
      seconds elapsed        26.865057442	26.767327618
      
      So performance-wise the numbers are quite similar.
      
      Also update zcomp interface to be more aligned with the crypto API.
      
      [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
      [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
      [3] https://github.com/sergey-senozhatsky/zram-perf-test
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebaf9ab5
    • Sergey Senozhatsky's avatar
      zram: rename zstrm find-release functions · 2aea8493
      Sergey Senozhatsky authored
      This has started as a 'add zlib support' work, but after some thinking I
      saw no blockers for a bigger change -- a switch to crypto API.
      
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      and opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      This patch set also eliminates the need of a new context-less crypto API
      interface, which was quite hard to sell, so we can move along faster.
      
      benchmarks:
      
      (x86_64, 4GB, zram-perf script)
      
      perf reported run-time fio (max jobs=3).  I performed fio test with the
      increasing number of parallel jobs (max to 3) on a 3G zram device, using
      `static' data and the following crypto comp algorithms:
      
      	842, deflate, lz4, lz4hc, lzo
      
      the output was:
      
       - test running time (which can tell us what algorithms performs faster)
      
      and
      
       - zram mm_stat (which tells the compressed memory size, max used memory, etc).
      
      It's just for information.  for example, LZ4HC has twice the running
      time of LZO, but the compressed memory size is: 23592960 vs 34603008
      bytes.
      
        test-fio-zram-842
           197.907655282 seconds time elapsed
           201.623142884 seconds time elapsed
           226.854291345 seconds time elapsed
        test-fio-zram-DEFLATE
           253.259516155 seconds time elapsed
           258.148563401 seconds time elapsed
           290.251909365 seconds time elapsed
        test-fio-zram-LZ4
            27.022598717 seconds time elapsed
            29.580522717 seconds time elapsed
            33.293463430 seconds time elapsed
        test-fio-zram-LZ4HC
            56.393954615 seconds time elapsed
            74.904659747 seconds time elapsed
           101.940998564 seconds time elapsed
        test-fio-zram-LZO
            28.155948075 seconds time elapsed
            30.390036330 seconds time elapsed
            34.455773159 seconds time elapsed
      
      zram mm_stat-s (max fio jobs=3)
      
        test-fio-zram-842
        mm_stat (jobs1): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs2): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs3): 3221225472 673185792 690266112        0 690266112        0        0
        test-fio-zram-DEFLATE
        mm_stat (jobs1): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  24379392  37761024        0  37761024        0        0
        test-fio-zram-LZ4
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZ4HC
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZO
        mm_stat (jobs1): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs2): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs3): 3221225472  34603008  50335744        0  50339840        0        0
      
      This patch (of 8):
      
      We don't perform any zstream idle list lookup anymore, so
      zcomp_strm_find()/zcomp_strm_release() names are not representative.
      
      Rename to zcomp_stream_get()/zcomp_stream_put().
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aea8493