1. 17 Mar, 2016 40 commits
    • Aneesh Kumar K.V's avatar
      mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range · 458aa76d
      Aneesh Kumar K.V authored
      We remove one instace of flush_tlb_range here.  That was added by commit
      f714f4f2 ("mm: numa: call MMU notifiers on THP migration").  But the
      pmdp_huge_clear_flush_notify should have done the require flush for us.
      Hence remove the extra flush.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      458aa76d
    • Kirill A. Shutemov's avatar
      mm, tracing: refresh __def_vmaflag_names · bcf66917
      Kirill A. Shutemov authored
      Get list of VMA flags up-to-date and sort it to match VM_* definition
      order.
      
      [vbabka@suse.cz: add a note above vmaflag definitions to update the names when changing]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bcf66917
    • Andrey Ryabinin's avatar
      mm: deduplicate memory overcommitment code · 39a1aa8e
      Andrey Ryabinin authored
      Currently we have two copies of the same code which implements memory
      overcommitment logic.  Let's move it into mm/util.c and hence avoid
      duplication.  No functional changes here.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39a1aa8e
    • Andrey Ryabinin's avatar
      mm: move max_map_count bits into mm.h · ea606cf5
      Andrey Ryabinin authored
      max_map_count sysctl unrelated to scheduler. Move its bits from
      include/linux/sched/sysctl.h to include/linux/mm.h.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea606cf5
    • Kirill A. Shutemov's avatar
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov authored
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • Vladimir Davydov's avatar
      mm: workingset: make shadow node shrinker memcg aware · 0a6b76dd
      Vladimir Davydov authored
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6b76dd
    • Vladimir Davydov's avatar
      mm: workingset: size shadow nodes lru basing on file cache size · cdcbb72e
      Vladimir Davydov authored
      A page is activated on refault if the refault distance stored in the
      corresponding shadow entry is less than the number of active file pages.
      Since active file pages can't occupy more than half memory, we assume
      that the maximal effective refault distance can't be greater than half
      the number of present pages and size the shadow nodes lru list
      appropriately.  Generally speaking, this assumption is correct, but it
      can result in wasting a considerable chunk of memory on stale shadow
      nodes in case the portion of file pages is small, e.g.  if a workload
      mostly uses anonymous memory.
      
      To sort this out, we need to compute the size of shadow nodes lru basing
      not on the maximal possible, but the current size of file cache.  We
      could take the size of active file lru for the maximal refault distance,
      but active lru is pretty unstable - it can shrink dramatically at
      runtime possibly disrupting workingset detection logic.
      
      Instead we assume that the maximal refault distance equals half the
      total number of file cache pages.  This will protect us against active
      file lru size fluctuations while still being correct, because size of
      active lru is normally maintained lower than size of inactive lru.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdcbb72e
    • Vladimir Davydov's avatar
      radix-tree: account radix_tree_node to memory cgroup · 58e698af
      Vladimir Davydov authored
      Allocation of radix_tree_node objects can be easily triggered from
      userspace, so we should account them to memory cgroup.  Besides, we need
      them accounted for making shadow node shrinker per memcg (see
      mm/workingset.c).
      
      A tricky thing about accounting radix_tree_node objects is that they are
      mostly allocated through radix_tree_preload(), so we can't just set
      SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
      lot of unrelated cgroups using objects from each other's caches.
      
      One way to overcome this would be making radix tree preloads per memcg,
      but that would probably look cumbersome and overcomplicated.
      
      Instead, we make radix_tree_node_alloc() first try to allocate from the
      cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
      and only if it fails fall back on using per cpu preloads.  This should
      make most allocations accounted.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58e698af
    • Vladimir Davydov's avatar
      mm: memcontrol: zap memcg_kmem_online helper · b6ecd2de
      Vladimir Davydov authored
      As kmem accounting is now either enabled for all cgroups or disabled
      system-wide, there's no point in having memcg_kmem_online() helper -
      instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
      shrink_slab() now does.
      
      There are only two places left where this helper is used -
      __memcg_kmem_charge() and memcg_create_kmem_cache().  The former can
      only be called if memcg_kmem_enabled() returned true.  Since the cgroup
      it operates on is online, mem_cgroup_is_root() check will be enough.
      
      memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
      of memcg_kmem_online(), because it relies on the fact that in
      memcg_offline_kmem() memcg->kmem_state is changed before
      memcg_deactivate_kmem_caches() is called, but there we can just
      open-code the check.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6ecd2de
    • Vladimir Davydov's avatar
      mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker · 0fc9f58a
      Vladimir Davydov authored
      It's just convenient to implement a memcg aware shrinker when you know
      that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
      false.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fc9f58a
    • Vladimir Davydov's avatar
      mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy · b313aeee
      Vladimir Davydov authored
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      
      The actual work is done in patch 6 of the series.  Patches 1 and 2
      prepare memcg/shrinker infrastructure for the change.  Patch 3 is just a
      collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
      necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
      shadow nodes overhead in case workload mostly uses anonymous pages.
      
      This patch:
      
      Currently, in the legacy hierarchy kmem accounting is off for all
      cgroups by default and must be enabled explicitly by writing something
      to memory.kmem.limit_in_bytes.  Since we don't support reclaim on
      hitting kmem limit, nor do we have any plans to implement it, this is
      likely to be -1, just to enable kmem accounting and limit kernel memory
      consumption by the memory.limit_in_bytes along with user memory.
      
      This user API was introduced when the implementation of kmem accounting
      lacked slab shrinker support and hence was useless in practice.  Things
      have changed since then - slab shrinkers were made memcg aware, the
      accounting overhead seems to be negligible, and a failure to charge a
      kmem allocation should not have critical consequences, because we only
      account those kernel objects that should be safe to fail.  That's why
      kmem accounting is enabled by default for all cgroups in the default
      hierarchy, which will eventually replace the legacy one.
      
      The ability to enable kmem accounting for some cgroups while keeping it
      disabled for others is getting difficult to maintain.  E.g.  to make
      shadow node shrinker memcg aware (see mm/workingset.c), we need to know
      the relationship between the number of shadow nodes allocated for a
      cgroup and the size of its lru list.  If kmem accounting is enabled for
      all cgroups there is no problem, but what should we do if kmem
      accounting is enabled only for half of cgroups? We've no other choice
      but use global lru stats while scanning root cgroup's shadow nodes, but
      that would be wrong if kmem accounting was enabled for all cgroups
      (which is the case if the unified hierarchy is used), in which case we
      should use lru stats of the root cgroup's lruvec.
      
      That being said, let's enable kmem accounting for all memory cgroups by
      default.  If one finds it unstable or too costly, it can always be
      disabled system-wide by passing cgroup.memory=nokmem to the kernel at
      boot time.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b313aeee
    • Denys Vlasenko's avatar
      include/linux/page-flags.h: force inlining of selected page flag modifications · 4b0f3261
      Denys Vlasenko authored
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
      
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      the following functions get deinlined many times.
      Examples of disassembly:
      
      <SetPageUptodate> (43 copies, 141 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 08             lock orb $0x8,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      
      <PagePrivate> (10 copies, 134 calls):
             48 8b 07                mov    (%rdi),%rax
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             48 c1 e8 0b             shr    $0xb,%rax
             83 e0 01                and    $0x1,%eax
             5d                      pop    %rbp
             c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~7k:
      
          text     data      bss       dec     hex filename
      92125002 20826048 36417536 149368586 8e72f0a vmlinux
      92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b0f3261
    • Denys Vlasenko's avatar
      bufferhead: force inlining of buffer head flag operations · ee91ef61
      Denys Vlasenko authored
      With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined.  See
      
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
      about 60 times. Examples of disassembly:
      
      <set_buffer_mapped> (14 copies, 43 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 20             lock orb $0x20,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      <buffer_mapped> (3 copies, 34 calls):
             48 8b 07                mov    (%rdi),%rax
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             48 c1 e8 05             shr    $0x5,%rax
             83 e0 01                and    $0x1,%eax
             5d                      pop    %rbp
             c3                      retq
      <set_buffer_new> (5 copies, 13 calls):
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             f0 80 0f 40             lock orb $0x40,(%rdi)
             5d                      pop    %rbp
             c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      This decreases vmlinux by about 3 kbytes.
      
          text	    data	     bss	      dec	    hex	filename
      88200439	19905208	36421632	144527279	89d4faf	vmlinux2
      88197239	19905240	36421632	144524111	89d434f	vmlinux
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee91ef61
    • Konstantin Khlebnikov's avatar
      tools/vm/page-types.c: add memory cgroup dumping and filtering · 075db150
      Konstantin Khlebnikov authored
      This adds two command line keys:
      
       -c|--cgroup path|@inode	Walk only pages owned by this memory cgroup
       -C|--list-cgroup		Show memory cgroup inodes
      
      [vdavydov@virtuozzo.com: opt_cgroup should be uint64_t.  Fix conflicts with "tools/vm/page-types.c: support swap entry"]
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      075db150
    • Vlastimil Babka's avatar
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka authored
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • Vlastimil Babka's avatar
      mm, memory hotplug: small cleanup in online_pages() · e888ca35
      Vlastimil Babka authored
      We can reuse the nid we've determined instead of repeated pfn_to_nid()
      usages.  Also zone_to_nid() should be a bit cheaper in general than
      pfn_to_nid().
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e888ca35
    • Vlastimil Babka's avatar
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka authored
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
    • Vlastimil Babka's avatar
      mm, kswapd: remove bogus check of balance_classzone_idx · 81c5857b
      Vlastimil Babka authored
      During work on kcompactd integration I have spotted a confusing check of
      balance_classzone_idx, which I believe is bogus.
      
      The balanced_classzone_idx is filled by balance_pgdat() as the highest
      zone it attempted to balance.  This was introduced by commit dc83edd9
      ("mm: kswapd: use the classzone idx that kswapd was using for
      sleeping_prematurely()").
      
      The intention is that (as expressed in today's function names), the
      value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
      as for the decisions in kswapd_try_to_sleep().
      
      An unwanted side-effect of that commit was breaking the checks in
      kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
      classzone_idx.  Commits 215ddd66 ("mm: vmscan: only read
      new_classzone_idx from pgdat when reclaiming successfully") and
      d2ebd0f6 ("kswapd: avoid unnecessary rebalance after an unsuccessful
      balancing") tried to fixed, but apparently introduced a bogus check that
      this patch removes.
      
      Consider zone indexes X < Y < Z, where:
      - Z is the value used for the first kswapd wakeup.
      - Y is returned as balanced_classzone_idx, which means zones with index higher
        than Y (including Z) were found to be unreclaimable.
      - X is the value used for the second kswapd wakeup
      
      The new wakeup with value X means that kswapd is now supposed to balance
      harder all zones with index <= X.  But instead, due to Y < Z, it will go
      sleep and won't read the new value X.  This is subtly wrong.
      
      The effect of this patch is that kswapd will react better in some
      situations, where e.g.  the first wakeup is for ZONE_DMA32, the second is
      for ZONE_DMA, and due to unreclaimable ZONE_NORMAL.  Before this patch,
      kswapd would go sleep instead of reclaiming ZONE_DMA harder.  I expect
      these situations are very rare, and more value is in better
      maintainability due to the removal of confusing and bogus check.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81c5857b
    • Joonsoo Kim's avatar
      tile: query dynamic DEBUG_PAGEALLOC setting · 21c64786
      Joonsoo Kim authored
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21c64786
    • Joonsoo Kim's avatar
      powerpc: query dynamic DEBUG_PAGEALLOC setting · e7df0d88
      Joonsoo Kim authored
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7df0d88
    • Joonsoo Kim's avatar
      sound: query dynamic DEBUG_PAGEALLOC setting · 505f6d22
      Joonsoo Kim authored
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f6d22
    • Joonsoo Kim's avatar
      mm/slub: query dynamic DEBUG_PAGEALLOC setting · 922d566c
      Joonsoo Kim authored
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: clean up code, per Christian]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      922d566c
    • Joonsoo Kim's avatar
      mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting · f48d97f3
      Joonsoo Kim authored
      As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
      we can optimize some cases by checking the enablement state.
      
      This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:
      
        https://lkml.org/lkml/2016/1/27/194
      
      Remaining work is to make sparc to be aware of this but it looks not
      easy for me so I skip that in this series.
      
      This patch (of 5):
      
      We can disable debug_pagealloc processing even if the code is complied
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: update comment, per David.  Adjust comment to use 80 cols]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f48d97f3
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: support swap entry · 0335ddd3
      Naoya Horiguchi authored
      /proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
      about swap entry, so let's make the in-kernel utility aware of it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0335ddd3
    • Naoya Horiguchi's avatar
      /proc/kpageflags: return KPF_SLAB for slab tail pages · 0a71649c
      Naoya Horiguchi authored
      Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
      pages, which is inconvenient when grasping how slab pages are
      distributed (userspace always needs to check which kind of tail pages by
      itself).  This patch sets KPF_SLAB for such pages.
      
      With this patch:
      
        $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
        Slab:              64880 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000080           16220       63  _______S__________________________________ slab
                     total           16220       63
      
      16220 pages equals to 64880 kB, so returned result is consistent with the
      global counter.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a71649c
    • Naoya Horiguchi's avatar
      /proc/kpageflags: return KPF_BUDDY for "tail" buddy pages · 832fc1de
      Naoya Horiguchi authored
      Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
      is inconvenient when grasping how free pages are distributed.  This
      patch sets KPF_BUDDY for such pages.
      
      With this patch:
      
        $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
        MemFree:         3134992 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000400          779272     3044  __________B_______________________________ buddy
        0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
                     total          783657     3061
      
      783657 pages is 3134628 kB (roughly consistent with the global counter,)
      so it's OK.
      
      [akpm@linux-foundation.org: update comment, per Naoya]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com&gt;>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      832fc1de
    • Vladimir Davydov's avatar
      mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b
      Vladimir Davydov authored
      Show how much memory is allocated to kernel stacks.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12580e4b
    • Vladimir Davydov's avatar
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov authored
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
    • Vladimir Davydov's avatar
      mm: memcontrol: make tree_{stat,events} fetch all stats · 72b54e73
      Vladimir Davydov authored
      Currently, tree_{stat,events} helpers can only get one stat index at a
      time, so when there are a lot of stats to be reported one has to call it
      over and over again (see memory_stat_show).  This is neither effective,
      nor does it look good.  Instead, let's make these helpers take a
      snapshot of all available counters.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b54e73
    • Vladimir Davydov's avatar
      mm: memcontrol: do not bypass slab charge if memcg is offline · fcff7d7e
      Vladimir Davydov authored
      Slab pages are charged in two steps.  First, an appropriate per memcg
      cache is selected (see memcg_kmem_get_cache) basing on the current
      context, then the new slab page is charged to the memory cgroup which
      the selected cache was created for (see memcg_charge_slab ->
      __memcg_kmem_charge_memcg).  It is OK to bypass kmemcg charge at step 1,
      but if step 1 succeeded and we successfully allocated a new slab page,
      step 2 must be performed, otherwise we would get a per memcg kmem cache
      which contains a slab that does not hold a reference to the memory
      cgroup owning the cache.  Since per memcg kmem caches are destroyed on
      memcg css free, this could result in freeing a cache while there are
      still active objects in it.
      
      However, currently we will bypass slab page charge if the memory cgroup
      owning the cache is offline (see __memcg_kmem_charge_memcg).  This is
      very unlikely to occur in practice, because for this to happen a process
      must be migrated to a different cgroup and the old cgroup must be
      removed while the process is in kmalloc somewhere between steps 1 and 2
      (e.g.  trying to allocate a new page).  Nevertheless, it's still better
      to eliminate such a possibility.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcff7d7e
    • Johannes Weiner's avatar
      mm: oom_kill: don't ignore oom score on exiting tasks · 6a618957
      Johannes Weiner authored
      When the OOM killer scans tasks and encounters a PF_EXITING one, it
      force-selects that task regardless of the score.  The problem is that if
      that task got stuck waiting for some state the allocation site is
      holding, the OOM reaper can not move on to the next best victim.
      
      Frankly, I don't even know why we check for exiting tasks in the OOM
      killer.  We've tried direct reclaim at least 15 times by the time we
      decide the system is OOM, there was plenty of time to exit and free
      memory; and a task might exit voluntarily right after we issue a kill.
      This is testing pure noise.  Remove it.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a618957
    • Joshua Hunt's avatar
      watchdog: don't run proc_watchdog_update if new value is same as old · a1ee1932
      Joshua Hunt authored
      While working on a script to restore all sysctl params before a series of
      tests I found that writing any value into the
      /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
      causes them to call proc_watchdog_update().
      
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      
      There doesn't appear to be a reason for doing this work every time a write
      occurs, so only do it when the values change.
      Signed-off-by: default avatarJosh Hunt <johunt@akamai.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.1.x+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1ee1932
    • Aaro Koskinen's avatar
      drivers/firmware/broadcom/bcm47xx_nvram.c: fix incorrect __ioread32_copy · 4c11e554
      Aaro Koskinen authored
      Commit 1f330c32 ("drivers/firmware/broadcom/bcm47xx_nvram.c: use
      __ioread32_copy() instead of open-coding") switched to use a generic
      copy function, but failed to notice that the header pointer is updated
      between the two copies, resulting in bogus data being copied in the
      latter one.  Fix by keeping the old header pointer.
      
      The patch fixes totally broken networking on WRT54GL router (both LAN and
      WLAN interfaces fail to probe).
      
      Fixes: 1f330c32 ("drivers/firmware/broadcom/bcm47xx_nvram.c: use __ioread32_copy() instead of open-coding")
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Reviewed-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Cc: <stable@vger.kernel.org>	[4.4.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c11e554
    • Luis R. Rodriguez's avatar
      ia64: define ioremap_uc() · b0f84ac3
      Luis R. Rodriguez authored
      All architectures now need ioremap_uc(), ia64 seems defines this already
      through its ioremap_nocache() and it already ensures it *only* uses UC.
      
      This is needed since v4.3 to complete an allyesconfig compile on ia64,
      there were others archs that needed this, and this one seems to have
      fallen through the cracks.
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f84ac3
    • Linus Torvalds's avatar
      Merge tag 'fbdev-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux · 09fd671c
      Linus Torvalds authored
      Pull fbdev updates from Tomi Valkeinen:
      
       - Miscallaneous small fixes to various fbdev drivers
      
       - Remove fb_rotate, which was never used
      
       - pmag fb improvements
      
      * tag 'fbdev-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (21 commits)
        xen kconfig: don't "select INPUT_XEN_KBDDEV_FRONTEND"
        video: fbdev: sis: remove unused variable
        drivers/video: make fbdev/sunxvr2500.c explicitly non-modular
        drivers/video: make fbdev/sunxvr1000.c explicitly non-modular
        drivers/video: make fbdev/sunxvr500.c explicitly non-modular
        video: exynos: fix modular build
        fbdev: da8xx-fb: fix videomodes of lcd panels
        fbdev: kill fb_rotate
        video: fbdev: bt431: Correct cursor format control macro
        video: fbdev: pmag-ba-fb: Optimize Bt455 colormap addressing
        video: fbdev: pmag-ba-fb: Fix and rework Bt455 colormap handling
        video: fbdev: bt455: Remove unneeded colormap helpers for cursor support
        video: fbdev: pmag-aa-fb: Report video timings
        video: fbdev: pmag-aa-fb: Enable building as a module
        video: fbdev: pmag-aa-fb: Adapt to current APIs
        video: fbdev: pmag-ba-fb: Fix the lower margin size
        fbdev: sh_mobile_lcdc: Use ARCH_RENESAS
        fbdev: n411: check return value
        fbdev: exynos: fix IS_ERR_VALUE usage
        video: Use bool instead int pointer for get_opt_bool() argument
        ...
      09fd671c
    • Linus Torvalds's avatar
      Merge tag 'media/v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · bace3db5
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
       - Added support for some new video formats
       - mn88473 DVB frontend driver got promoted from staging
       - several improvements at the VSP1 driver
       - several cleanups and improvements at the Media Controller
       - added Media Controller support to snd-usb-audio.  Currently, enabled
         only for au0828-based V4L2/DVB boards
       - Several improvements at nuvoton-cir: it now supports wake up codes
       - Add media controller support to em28xx and saa7134 drivers
       - coda driver now accepts NXP distributed firmware files
       - Some legacy SoC camera drivers will be moving to staging, as they're
         outdated and nobody so far is willing to fix and convert them to use
         the current media framework
       - As usual, lots of cleanups, improvements and new board additions.
      
      * tag 'media/v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (381 commits)
        media: au0828 disable tuner to demod link in au0828_media_device_register()
        [media] touptek: cast char types on %x printk
        [media] touptek: don't DMA at the stack
        [media] mceusb: use %*ph for small buffer dumps
        [media] v4l: exynos4-is: Drop unneeded check when setting up fimc-lite links
        [media] v4l: vsp1: Check if an entity is a subdev with the right function
        [media] hide unused functions for !MEDIA_CONTROLLER
        [media] em28xx: fix Terratec Grabby AC97 codec detection
        [media] media: add prefixes to interface types
        [media] media: rc: nuvoton: switch attribute wakeup_data to text
        [media] v4l2-ioctl: fix YUV422P pixel format description
        [media] media: fix null pointer dereference in v4l_vb2q_enable_media_source()
        [media] v4l2-mc.h: fix yet more compiler errors
        [media] staging/media: add missing TODO files
        [media] media.h: always start with 1 for the audio entities
        [media] sound/usb: Use meaninful names for goto labels
        [media] v4l2-mc.h: fix compiler warnings
        [media] media: au0828 audio mixer isn't connected to decoder
        [media] sound/usb: Use Media Controller API to share media resources
        [media] dw2102: add support for TeVii S662
        ...
      bace3db5
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8759957b
      Linus Torvalds authored
      Pull libnvdimm updates from Dan Williams:
      
       - Asynchronous address range scrub:
      
           Given the capacities of next generation persistent memory devices a
           scrub operation to find all poison may take 10s of seconds.  We
           want this scrub work to be done asynchronously with the rest of
           system initialization, so we move it out of line from the NFIT
           probing, i.e. acpi_nfit_add().
      
       - Clear poison:
      
           ACPI 6.1 introduces the ability to send "clear error" commands to
           the ACPI0012:00 device representing the root of an "nvdimm bus".
           Similar to relocating a bad block on a disk, this support clears
           media errors in response to a write.
      
       - Persistent memory resource tracking:
      
           A persistent memory range may be designated as simply "reserved" by
           platform firmware in the efi/e820 memory map.  Later when the NFIT
           driver loads it discovers that the range is "Persistent Memory".
      
           The NFIT bus driver inserts a resource to advertise that
           "persistent" attribute in the system resource tree for /proc/iomem
           and kernel-internal usages.
      
       - Miscellaneous cleanups and fixes:
      
           Workaround section misaligned pmem ranges when allocating a struct
           page memmap, fix handling of the read-only case in the ioctl path,
           and clean up block device major number allocation.
      
      * tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
        libnvdimm, pmem: clear poison on write
        libnvdimm, pmem: fix kmap_atomic() leak in error path
        nvdimm/btt: don't allocate unused major device number
        nvdimm/blk: don't allocate unused major device number
        pmem: don't allocate unused major device number
        ACPI: Change NFIT driver to insert new resource
        resource: Export insert_resource and remove_resource
        resource: Add remove_resource interface
        resource: Change __request_region to inherit from immediate parent
        libnvdimm, pmem: fix ia64 build, use PHYS_PFN
        nfit, libnvdimm: clear poison command support
        libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
        libnvdimm, pmem: adjust for section collisions with 'System RAM'
        libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
        libnvdimm: Fix security issue with DSM IOCTL.
        libnvdimm: Clean-up access mode check.
        tools/testing/nvdimm: expand ars unit testing
        nfit: disable userspace initiated ars during scrub
        nfit: scrub and register regions in a workqueue
        nfit, libnvdimm: async region scrub workqueue
        ...
      8759957b
    • Linus Torvalds's avatar
      Merge tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6968e6f8
      Linus Torvalds authored
      Pull device mapper updates from Mike Snitzer:
      
       - Most attention this cycle went to optimizing blk-mq request-based DM
         (dm-mq) that is used exclussively by DM multipath:
      
           - A stable fix for dm-mq that eliminates excessive context
             switching offers the biggest performance improvement (for both
             IOPs and throughput).
      
           - But more work is needed, during the next cycle, to reduce
             spinlock contention in DM multipath on large NUMA systems.
      
       - A stable fix for a NULL pointer seen when DM stats is enabled on a DM
         multipath device that must requeue an IO due to path failure.
      
       - A stable fix for DM snapshot to disallow the COW and origin devices
         from being identical.  This amounts to graceful failure in the face
         of userspace error because these devices shouldn't ever be identical.
      
       - Stable fixes for DM cache and DM thin provisioning to address crashes
         seen if/when their respective metadata device experiences failures
         that cause the transition to 'fail_io' mode.
      
       - The DM cache 'mq' policy is now an alias for the 'smq' policy.  The
         'smq' policy proved to be consistently better than 'mq'.  As such
         'mq', with all its complex user-facing tunables, has been eliminated.
      
       - Improve DM thin provisioning to consistently return -ENOSPC once the
         thin-pool's data volume is out of space.
      
       - Improve DM core to properly handle error propagation if
         bio_integrity_clone() fails in clone_bio().
      
       - Other small cleanups and improvements to DM core.
      
      * tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (41 commits)
        dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
        dm thin: consistently return -ENOSPC if pool has run out of data space
        dm cache: bump the target version
        dm cache: make sure every metadata function checks fail_io
        dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
        dm cache policy smq: clarify that mq registration failure was for 'mq'
        dm: return error if bio_integrity_clone() fails in clone_bio()
        dm thin metadata: don't issue prefetches if a transaction abort has failed
        dm snapshot: disallow the COW and origin devices from being identical
        dm cache: make the 'mq' policy an alias for 'smq'
        dm: drop unnecessary assignment of md->queue
        dm: reorder 'struct mapped_device' members to fix alignment and holes
        dm: remove dummy definition of 'struct dm_table'
        dm: add 'dm_numa_node' module parameter
        dm thin metadata: remove needless newline from subtree_dec() DMERR message
        dm mpath: cleanup reinstate_path() et al based on code review
        dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
        dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
        dm round robin: use percpu 'repeat_count' and 'current_path'
        dm path selector: remove 'repeat_count' return from .select_path hook
        ...
      6968e6f8
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · cae8da04
      Linus Torvalds authored
      Pull SCSI updates from James Bottomley:
       "This pull includes driver updates from the usual suspects (stex, hpsa,
        ncr5380, scsi_dh, qla2xxx, be2iscsi, hisi_sas, cxlflash, aacraid,
        mp3sas, megaraid_sas, ibmvscsi, ufs) plus an assortment of
        miscellaneous fixes.
      
        The major user visible change of this pull is that we've moved from
        monotonically increasing host number to an ida allocated one (meaning
        the numbers get re-used) because someone managed to wrap the count in
        an iscsi system.  We don't believe there will be any adverse
        consequences of this"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (230 commits)
        MAINTAINERS: use new email address for James Bottomley
        mpt3sas: Remove unnecessary synchronize_irq() before free_irq()
        sg: fix dxferp in from_to case
        cxlflash: Increase cmd_per_lun for better throughput
        cxlflash: Fix to avoid unnecessary scan with internal LUNs
        cxlflash: Reorder user context initialization
        cxlflash: Simplify attach path error cleanup
        cxlflash: Split out context initialization
        cxlflash: Unmap problem state area before detaching master context
        cxlflash: Simplify PCI registration
        scsi: storvsc: fix SRB_STATUS_ABORTED handling
        be2iscsi: set the boot_kset pointer to NULL in case of failure
        sd: Fix discard granularity when LBPRZ=1
        be2iscsi: Remove unnecessary synchronize_irq() before free_irq()
        scsi_sysfs: call 'device_add' after attaching device handler
        scsi_dh_emc: update 'access_state' field
        scsi_dh_rdac: update 'access_state' field
        scsi_dh_alua: update 'access_state' field
        scsi_dh_alua: use common definitions for ALUA state
        scsi: Add 'access_state' and 'preferred_path' attribute
        ...
      cae8da04
    • Linus Torvalds's avatar
      Merge branch 'stable/for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft · 7bb7a748
      Linus Torvalds authored
      Pull iscsi_ibft update from Konrad Rzeszutek Wilk:
       "A simple patch that had been rattling around in SuSE repo"
      
      * 'stable/for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
        iscsi_ibft: Add prefix-len attr and display netmask
      7bb7a748