1. 14 Jan, 2020 11 commits
    • Yang Shi's avatar
      mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE · 554913f6
      Yang Shi authored
      Commit 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem)
      FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
      the corresponding description for trace events were not added.
      
      Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      554913f6
    • Adrian Huang's avatar
      mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210
      Adrian Huang authored
      When booting with amd_iommu=off, the following WARNING message
      appears:
      
        AMD-Vi: AMD IOMMU disabled on kernel command-line
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
        Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
        RIP: 0010:flush_workqueue+0x42e/0x450
        Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
        Call Trace:
         kmem_cache_destroy+0x69/0x260
         iommu_go_to_state+0x40c/0x5ab
         amd_iommu_prepare+0x16/0x2a
         irq_remapping_prepare+0x36/0x5f
         enable_IR_x2apic+0x21/0x172
         default_setup_apic_routing+0x12/0x6f
         apic_intr_mode_init+0x1a1/0x1f1
         x86_late_time_init+0x17/0x1c
         start_kernel+0x480/0x53f
         secondary_startup_64+0xb6/0xc0
        ---[ end trace 30894107c3749449 ]---
        x2apic: IRQ remapping doesn't support X2APIC mode
        x2apic disabled
      
      The warning is caused by the calling of 'kmem_cache_destroy()'
      in free_iommu_resources(). Here is the call path:
      
        free_iommu_resources
          kmem_cache_destroy
            flush_memcg_workqueue
              flush_workqueue
      
      The root cause is that the IOMMU subsystem runs before the workqueue
      subsystem, which the variable 'wq_online' is still 'false'.  This leads
      to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
      'true'.
      
      Since the variable 'memcg_kmem_cache_wq' is not allocated during the
      time, it is unnecessary to call flush_memcg_workqueue().  This prevents
      the WARNING message triggered by flush_workqueue().
      
      Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reported-by: default avatarXiaochun Lee <lixc17@lenovo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fe20210
    • Wen Yang's avatar
      mm/page-writeback.c: improve arithmetic divisions · 0a5d1a7f
      Wen Yang authored
      Use div64_ul() instead of do_div() if the divisor is unsigned long, to
      avoid truncation to 32-bit on 64-bit platforms.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.comSigned-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a5d1a7f
    • Wen Yang's avatar
      mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide · d3ac946e
      Wen Yang authored
      The two variables 'numerator' and 'denominator', though they are
      declared as long, they should actually be unsigned long (according to
      the implementation of the fprop_fraction_percpu() function)
      
      And do_div() does a 64-by-32 division, while the divisor 'denominator'
      is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
      function to call is div64_ul().
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.comSigned-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3ac946e
    • Wen Yang's avatar
      mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() · 6d9e8c65
      Wen Yang authored
      Patch series "use div64_ul() instead of div_u64() if the divisor is
      unsigned long".
      
      We were first inspired by commit b0ab99e7 ("sched: Fix possible divide
      by zero in avg_atom () calculation"), then refer to the recently analyzed
      mm code, we found this suspicious place.
      
       201                 if (min) {
       202                         min *= this_bw;
       203                         do_div(min, tot_bw);
       204                 }
      
      And we also disassembled and confirmed it:
      
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
        0xffffffff811c37da <__wb_calc_thresh+234>:      xor    %r10d,%r10d
        0xffffffff811c37dd <__wb_calc_thresh+237>:      test   %rax,%rax
        0xffffffff811c37e0 <__wb_calc_thresh+240>:      je 0xffffffff811c3800 <__wb_calc_thresh+272>
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
        0xffffffff811c37e2 <__wb_calc_thresh+242>:      imul   %r8,%rax
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
        0xffffffff811c37e6 <__wb_calc_thresh+246>:      mov    %r9d,%r10d    ---> truncates it to 32 bits here
        0xffffffff811c37e9 <__wb_calc_thresh+249>:      xor    %edx,%edx
        0xffffffff811c37eb <__wb_calc_thresh+251>:      div    %r10
        0xffffffff811c37ee <__wb_calc_thresh+254>:      imul   %rbx,%rax
        0xffffffff811c37f2 <__wb_calc_thresh+258>:      shr    $0x2,%rax
        0xffffffff811c37f6 <__wb_calc_thresh+262>:      mul    %rcx
        0xffffffff811c37f9 <__wb_calc_thresh+265>:      shr    $0x2,%rdx
        0xffffffff811c37fd <__wb_calc_thresh+269>:      mov    %rdx,%r10
      
      This series uses div64_ul() instead of div_u64() if the divisor is
      unsigned long, to avoid truncation to 32-bit on 64-bit platforms.
      
      This patch (of 3):
      
      The variables 'min' and 'max' are unsigned long and do_div truncates
      them to 32 bits, which means it can test non-zero and be truncated to
      zero for division.  Fix this issue by using div64_ul() instead.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
      Fixes: 693108a8 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
      Signed-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d9e8c65
    • Vlastimil Babka's avatar
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka authored
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
    • Roman Gushchin's avatar
      mm: memcg/slab: fix percpu slab vmstats flushing · 4a87e2a2
      Roman Gushchin authored
      Currently slab percpu vmstats are flushed twice: during the memcg
      offlining and just before freeing the memcg structure.  Each time percpu
      counters are summed, added to the atomic counterparts and propagated up
      by the cgroup tree.
      
      The second flushing is required due to how recursive vmstats are
      implemented: counters are batched in percpu variables on a local level,
      and once a percpu value is crossing some predefined threshold, it spills
      over to atomic values on the local and each ascendant levels.  It means
      that without flushing some numbers cached in percpu variables will be
      dropped on floor each time a cgroup is destroyed.  And with uptime the
      error on upper levels might become noticeable.
      
      The first flushing aims to make counters on ancestor levels more
      precise.  Dying cgroups may resume in the dying state for a long time.
      After kmem_cache reparenting which is performed during the offlining
      slab counters of the dying cgroup don't have any chances to be updated,
      because any slab operations will be performed on the parent level.  It
      means that the inaccuracy caused by percpu batching will not decrease up
      to the final destruction of the cgroup.  By the original idea flushing
      slab counters during the offlining should minimize the visible
      inaccuracy of slab counters on the parent level.
      
      The problem is that percpu counters are not zeroed after the first
      flushing.  So every cached percpu value is summed twice.  It creates a
      small error (up to 32 pages per cpu, but usually less) which accumulates
      on parent cgroup level.  After creating and destroying of thousands of
      child cgroups, slab counter on parent level can be way off the real
      value.
      
      For now, let's just stop flushing slab counters on memcg offlining.  It
      can't be done correctly without scheduling a work on each cpu: reading
      and zeroing it during css offlining can race with an asynchronous
      update, which doesn't expect values to be changed underneath.
      
      With this change, slab counters on parent level will become eventually
      consistent.  Once all dying children are gone, values are correct.  And
      if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
      
      It's not perfect, as slab are reparented, so any updates after the
      reparenting will happen on the parent level.  It means that if a slab
      page was allocated, a counter on child level was bumped, then the page
      was reparented and freed, the annihilation of positive and negative
      counter values will not happen until the child cgroup is released.  It
      makes slab counters different from others, and it might want us to
      implement flushing in a correct form again.  But it's also a question of
      performance: scheduling a work on each cpu isn't free, and it's an open
      question if the benefit of having more accurate counters is worth it.
      
      We might also consider flushing all counters on offlining, not only slab
      counters.
      
      So let's fix the main problem now: make the slab counters eventually
      consistent, so at least the error won't grow with uptime (or more
      precisely the number of created and destroyed cgroups).  And think about
      the accuracy of counters separately.
      
      Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
      Fixes: bee07b33 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a87e2a2
    • Kirill A. Shutemov's avatar
      mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997
      Kirill A. Shutemov authored
      Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
      enabled.  But it doesn't work well with above-47bit hint address.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks THP alignment in shmem/tmp:
      shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
      *any* hint address specified.
      
      This can be fixed by requesting the aligned area if the we failed to
      allocated at user-specified hint address.  The request with inflated
      length will also take the user-specified hint address.  This way we will
      not lose an allocation request from the full address space.
      
      [kirill@shutemov.name: fold in a fixup]
        Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
      Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99158997
    • Kirill A. Shutemov's avatar
      mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment · 97d3d0f9
      Kirill A. Shutemov authored
      Patch series "Fix two above-47bit hint address vs.  THP bugs".
      
      The two get_unmapped_area() implementations have to be fixed to provide
      THP-friendly mappings if above-47bit hint address is specified.
      
      This patch (of 2):
      
      Filesystems use thp_get_unmapped_area() to provide THP-friendly
      mappings.  For DAX in particular.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks thp_get_unmapped_area(): the function
      would not try to allocate PMD-aligned area if *any* hint address
      specified.
      
      Modify the routine to handle it correctly:
      
       - Try to allocate the space at the specified hint address with length
         padding required for PMD alignment.
       - If failed, retry without length padding (but with the same hint
         address);
       - If the returned address matches the hint address return it.
       - Otherwise, align the address as required for THP and return.
      
      The user specified hint address is passed down to get_unmapped_area() so
      above-47bit hint address will be taken into account without breaking
      alignment requirements.
      
      Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarThomas Willhalm <thomas.willhalm@intel.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97d3d0f9
    • David Hildenbrand's avatar
      mm/memory_hotplug: don't free usage map when removing a re-added early section · 8068df3b
      David Hildenbrand authored
      When we remove an early section, we don't free the usage map, as the
      usage maps of other sections are placed into the same page.  Once the
      section is removed, it is no longer an early section (especially, the
      memmap is freed).  When we re-add that section, the usage map is reused,
      however, it is no longer an early section.  When removing that section
      again, we try to kfree() a usage map that was allocated during early
      boot - bad.
      
      Let's check against PageReserved() to see if we are dealing with an
      usage map that was allocated during boot.  We could also check against
      !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
      cleaner.
      
      Can be triggered using memtrace under ppc64/powernv:
      
        $ mount -t debugfs none /sys/kernel/debug/
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
         ------------[ cut here ]------------
         kernel BUG at mm/slub.c:3969!
         Oops: Exception in kernel mode, sig: 5 [#1]
         LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
         Modules linked in:
         CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
         NIP kfree+0x338/0x3b0
         LR section_deactivate+0x138/0x200
         Call Trace:
           section_deactivate+0x138/0x200
           __remove_pages+0x114/0x150
           arch_remove_memory+0x3c/0x160
           try_remove_memory+0x114/0x1a0
           __remove_memory+0x20/0x40
           memtrace_enable_set+0x254/0x850
           simple_attr_write+0x138/0x160
           full_proxy_write+0x8c/0x110
           __vfs_write+0x38/0x70
           vfs_write+0x11c/0x2a0
           ksys_write+0x84/0x140
           system_call+0x5c/0x68
         ---[ end trace 4b053cbd84e0db62 ]---
      
      The first invocation will offline+remove memory blocks.  The second
      invocation will first add+online them again, in order to offline+remove
      them again (usually we are lucky and the exact same memory blocks will
      get "reallocated").
      
      Tested on powernv with boot memory: The usage map will not get freed.
      Tested on x86-64 with DIMMs: The usage map will get freed.
      
      Using Dynamic Memory under a Power DLAPR can trigger it easily.
      
      Triggering removal (I assume after previously removed+re-added) of
      memory from the HMC GUI can crash the kernel with the same call trace
      and is fixed by this patch.
      
      Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
      Fixes: 326e1b8f ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarPingfan Liu <piliu@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8068df3b
    • Vlastimil Babka's avatar
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka authored
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  2. 13 Jan, 2020 2 commits
  3. 12 Jan, 2020 3 commits
  4. 11 Jan, 2020 2 commits
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 6327edce
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two driver bugfixes, a documentation fix, and a removal of a spec
        violation for the bus recovery algorithm in the core"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: fix bus recovery stop mode timing
        i2c: bcm2835: Store pointer to bus clock
        dt-bindings: i2c: at91: fix i2c-sda-hold-time-ns documentation for sam9x60
        i2c: at91: fix clk_offset for sam9x60
      6327edce
    • Linus Torvalds's avatar
      Merge tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 606e9ad2
      Linus Torvalds authored
      Pull thread fixes from Christian Brauner:
       "This contains a series of patches to fix CLONE_SETTLS when used with
        clone3().
      
        The clone3() syscall passes the tls argument through struct clone_args
        instead of a register. This means, all architectures that do not
        implement copy_thread_tls() but still support CLONE_SETTLS via
        copy_thread() expecting the tls to be located in a register argument
        based on clone() are currently unfortunately broken. Their tls value
        will be garbage.
      
        The patch series fixes this on all architectures that currently define
        __ARCH_WANT_SYS_CLONE3. It also adds a compile-time check to ensure
        that any architecture that enables clone3() in the future is forced to
        also implement copy_thread_tls().
      
        My ultimate goal is to get rid of the copy_thread()/copy_thread_tls()
        split and just have copy_thread_tls() at some point in the not too
        distant future (Maybe even renaming copy_thread_tls() back to simply
        copy_thread() once the old function is ripped from all arches). This
        is dependent now on all arches supporting clone3().
      
        While all relevant arches do that now there are still four missing:
        ia64, m68k, sh and sparc. They have the system call reserved, but not
        implemented. Once they all implement clone3() we can get rid of
        ARCH_WANT_SYS_CLONE3 and HAVE_COPY_THREAD_TLS.
      
        This series also includes a minor fix for the arm64 uapi headers which
        caused __NR_clone3 to be missing from the exported user headers.
      
        Unfortunately the series came in a little late especially given that
        it touches a range of architectures. Due to the holidays not all arch
        maintainers responded in time probably due to their backlog. Will and
        Arnd have thankfully acked the arm specific changes.
      
        Given that the changes are straightforward and rather minimal combined
        with the fact the that clone3() with CLONE_SETTLS is broken I decided
        to send them post rc3 nonetheless"
      
      * tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        um: Implement copy_thread_tls
        clone3: ensure copy_thread_tls is implemented
        xtensa: Implement copy_thread_tls
        riscv: Implement copy_thread_tls
        parisc: Implement copy_thread_tls
        arm: Implement copy_thread_tls
        arm64: Implement copy_thread_tls
        arm64: Move __ARCH_WANT_SYS_CLONE3 definition to uapi headers
      606e9ad2
  5. 10 Jan, 2020 19 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · ac61145a
      Linus Torvalds authored
      Pull HID fix from Jiri Kosina:
       "A regression fix for EPOLLOUT handling in hidraw and uhid"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: hidraw, uhid: Always report EPOLLOUT
      ac61145a
    • Linus Torvalds's avatar
      Merge tag 'usb-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 213356fe
      Linus Torvalds authored
      Pull USB/PHY fixes from Greg KH:
       "Here are a number of USB and PHY driver fixes for 5.5-rc6
      
        Nothing all that unusual, just the a bunch of small fixes for a lot of
        different reported issues. The PHY driver fixes are in here as they
        interacted with the usb drivers.
      
        Full details of the patches are in the shortlog, and all of these have
        been in linux-next with no reported issues"
      
      * tag 'usb-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (24 commits)
        usb: missing parentheses in USE_NEW_SCHEME
        usb: ohci-da8xx: ensure error return on variable error is set
        usb: musb: Disable pullup at init
        usb: musb: fix idling for suspend after disconnect interrupt
        usb: typec: ucsi: Fix the notification bit offsets
        USB: Fix: Don't skip endpoint descriptors with maxpacket=0
        USB-PD tcpm: bad warning+size, PPS adapters
        phy/rockchip: inno-hdmi: round clock rate down to closest 1000 Hz
        usb: chipidea: host: Disable port power only if previously enabled
        usb: cdns3: should not use the same dev_id for shared interrupt handler
        usb: dwc3: gadget: Fix request complete check
        usb: musb: dma: Correct parameter passed to IRQ handler
        usb: musb: jz4740: Silence error if code is -EPROBE_DEFER
        usb: udc: tegra: select USB_ROLE_SWITCH
        USB: core: fix check for duplicate endpoints
        phy: cpcap-usb: Drop extra write to usb2 register
        phy: cpcap-usb: Improve host vs docked mode detection
        phy: cpcap-usb: Prevent USB line glitches from waking up modem
        phy: mapphone-mdm6600: Fix uninitialized status value regression
        phy: cpcap-usb: Fix flakey host idling and enumerating of devices
        ...
      213356fe
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 9fb7007d
      Linus Torvalds authored
      Pull char/misc fix from Greg KH:
       "Here is a single fix, for the chrdev core, for 5.5-rc6
      
        There's been a long-standing race condition triggered by syzbot, and
        occasionally real people, in the chrdev open() path. Will finally took
        the time to track it down and fix it for real before the holidays.
      
        Here's that one patch, it's been in linux-next for a while with no
        reported issues and it does fix the reported problem"
      
      * tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        chardev: Avoid potential use-after-free in 'chrdev_open()'
      9fb7007d
    • Linus Torvalds's avatar
      Merge tag 'staging-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 7da37cd0
      Linus Torvalds authored
      Pull staging fixes from Greg KH:
       "Here are some small staging driver fixes for 5.5-rc6.
      
        Nothing major here, just some small fixes for a comedi driver, the
        vt6656 driver, and a new device id for the rtl8188eu driver.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'staging-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: rtl8188eu: Add device code for TP-Link TL-WN727N v5.21
        staging: comedi: adv_pci1710: fix AI channels 16-31 for PCI-1713
        staging: vt6656: set usb_set_intfdata on driver fail.
        staging: vt6656: remove bool from vnt_radio_power_on ret
        staging: vt6656: limit reg output to block size
        staging: vt6656: correct return of vnt_init_registers.
        staging: vt6656: Fix non zero logical return of, usb_control_msg
      7da37cd0
    • Linus Torvalds's avatar
      Merge tag 'tty-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 5a96c0bb
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are two tty/serial driver fixes for 5.5-rc6.
      
        The first fixes a much much reported issue with a previous tty port
        link patch that is in your tree, and the second fixes a problem where
        the serdev driver would claim ACPI devices that it shouldn't be
        claiming.
      
        Both have been in linux-next for a while with no reported issues"
      
      * tag 'tty-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serdev: Don't claim unsupported ACPI serial devices
        tty: always relink the port
      5a96c0bb
    • Linus Torvalds's avatar
      Merge tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block · 4e4cd21c
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes that should go into this round.
      
        This pull request contains two NVMe fixes via Keith, removal of a dead
        function, and a fix for the bio op for read truncates (Ming)"
      
      * tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
        nvmet: fix per feat data len for get_feature
        nvme: Translate more status codes to blk_status_t
        fs: move guard_bio_eod() after bio_set_op_attrs
        block: remove unused mp_bvec_last_segment
      4e4cd21c
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block · 30b6487d
      Linus Torvalds authored
      Pull io_uring fix from Jens Axboe:
       "Single fix for this series, fixing a regression with the short read
        handling.
      
        This just removes it, as it cannot safely be done for all cases"
      
      * tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
        io_uring: remove punt of short reads to async context
      30b6487d
    • Linus Torvalds's avatar
      Merge tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux · 4936ce17
      Linus Torvalds authored
      Pull MTD fixes from Miquel Raynal:
       "MTD:
         - sm_ftl: Fix NULL pointer warning.
      
        Raw NAND:
         - Cadence: fix compile testing.
         - STM32: Avoid locking.
      
        Onenand:
         - Fix several sparse/build warnings.
      
        SPI-NOR:
         - Add a flag to fix interaction with Micron parts"
      
      * tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
        mtd: spi-nor: Fix the writing of the Status Register on micron flashes
        mtd: sm_ftl: fix NULL pointer warning
        mtd: onenand: omap2: Pass correct flags for prep_dma_memcpy
        mtd: onenand: samsung: Fix iomem access with regular memcpy
        mtd: onenand: omap2: Fix errors in style
        mtd: cadence: Fix cast to pointer from integer of different size warning
        mtd: rawnand: stm32_fmc2: avoid to lock the CPU bus
      4936ce17
    • Linus Torvalds's avatar
      Merge tag 'sound-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · b1d198c0
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A few piled ASoC fixes and usual HD-audio and USB-audio fixups. Some
        of them are for ASoC core error-handling"
      
      * tag 'sound-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda: enable regmap internal locking
        ALSA: hda/realtek - Add quirk for the bass speaker on Lenovo Yoga X1 7th gen
        ALSA: hda/realtek - Set EAPD control to default for ALC222
        ALSA: usb-audio: Apply the sample rate quirk for Bose Companion 5
        ALSA: hda/realtek - Add new codec supported for ALCS1200A
        ASoC: Intel: boards: Fix compile-testing RT1011/RT5682
        ASoC: SOF: imx8: Fix dsp_box offset
        ASoC: topology: Prevent use-after-free in snd_soc_get_pcm_runtime()
        ASoC: fsl_audmix: add missed pm_runtime_disable
        ASoC: stm32: spdifrx: fix input pin state management
        ASoC: stm32: spdifrx: fix race condition in irq handler
        ASoC: stm32: spdifrx: fix inconsistent lock state
        ASoC: core: Fix access to uninitialized list heads
        ASoC: soc-core: Set dpcm_playback / dpcm_capture
        ASoC: SOF: imx8: fix memory allocation failure check on priv->pd_dev
        ASoC: SOF: Intel: hda: hda-dai: fix oops on hda_link .hw_free
        ASoC: SOF: fix fault at driver unload after failed probe
      b1d198c0
    • Linus Torvalds's avatar
      Merge tag 'thermal-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux · 658e1af5
      Linus Torvalds authored
      Pull thermal fix from Daniel Lezcano:
       "Fix backward compatibility with old DTBs on QCOM tsens (Amit
        Kucheria)"
      
      * tag 'thermal-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
        drivers: thermal: tsens: Work with old DTBs
      658e1af5
    • Linus Torvalds's avatar
      Merge tag 'pm-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c23e744b
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "Prevent the cpufreq-dt driver from probing Tegra20/30 (Dmitry
        Osipenko) and prevent the Intel RAPL power capping driver from
        crashing during CPU initialization due to a NULL pointer dereference
        if the processor model in use is not known to it (Harry Pan)"
      
      * tag 'pm-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()
        cpufreq: dt-platdev: Blacklist NVIDIA Tegra20 and Tegra30 SoCs
      c23e744b
    • Amit Engel's avatar
      nvmet: fix per feat data len for get_feature · e17016f6
      Amit Engel authored
      The existing implementation for the get_feature admin-cmd does not
      use per-feature data len. This patch introduces a new helper function
      nvmet_feat_data_len(), which is used to calculate per feature data len.
      Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID).
      
      Fixes: commit e9061c39 ("nvmet: Remove the data_len field from the nvmet_req struct")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAmit Engel <amit.engel@dell.com>
      [endiness, naming, and kernel style fixes]
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e17016f6
    • Keith Busch's avatar
      nvme: Translate more status codes to blk_status_t · 35038bff
      Keith Busch authored
      Decode interrupted command and not ready namespace nvme status codes to
      BLK_STS_TARGET. These are not generic IO errors and should use a non-path
      specific error so that it can use the non-failover retry path.
      Reported-by: default avatarJohn Meneghini <John.Meneghini@netapp.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35038bff
    • Jiri Kosina's avatar
      HID: hidraw, uhid: Always report EPOLLOUT · 9e635c28
      Jiri Kosina authored
      hidraw and uhid device nodes are always available for writing so we should
      always report EPOLLOUT and EPOLLWRNORM bits, not only in the cases when
      there is nothing to read.
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Fixes: be54e746 ("HID: uhid: Fix returning EPOLLOUT from uhid_char_poll")
      Fixes: 9f3b61dc ("HID: hidraw: Fix returning EPOLLOUT from hidraw_poll")
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      9e635c28
    • Rafael J. Wysocki's avatar
      Merge branch 'powercap' · 10674d97
      Rafael J. Wysocki authored
      * powercap:
        powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()
      10674d97
    • Linus Torvalds's avatar
      Merge tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · bef1d882
      Linus Torvalds authored
      Pull pstore fix from Kees Cook:
       "Cengiz Can forwarded a Coverity report about more problems with a rare
        pstore initialization error path, so the allocation lifetime was
        rearranged to avoid needing to share the kfree() responsibilities
        between caller and callee"
      
      * tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        pstore/ram: Regularize prz label allocation lifetime
      bef1d882
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2020-01-10' of git://anongit.freedesktop.org/drm/drm · 6d25ef77
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Pre-LCA pull request I'm not sure how things will look next week,
        myself and Daniel are at LCA and I'm speaking quite late, so if I get
        my talk finished I'll probably process fixes.
      
        This week has a bunch of i915 fixes, some amdgpu fixes, one sun4i, one
        core MST, and one core fb_helper fix. More details below:
      
        core:
         - mst Fix NO_STOP_BIT bit offset (Wayne)
      
        fb_helper:
         - fb_helper: Fix bits_per_pixel param set behavior to round up
           (Geert)
      
        sun4i:
         - Fix RGB_DIV clock min divider on old hardware (Chen-Yu)
      
        amdgpu:
         - Stability fix for raven
         - Reduce pixel encoding to if max clock is exceeded on HDMI to allow
           additional high res modes
         - enable DRIVER_SYNCOBJ_TIMELINE for amdgpu
      
        i915:
         - Fix GitLab issue #446 causing GPU hangs: Do not restore invalid RS
           state
         - Fix GitLab issue #846: Restore coarse power gating that was
           disabled by initial RC66 context corruption security fixes.
         - Revert f6ec9483 ("drm/i915: extend audio CDCLK>=2*BCLK
           constraint to more platforms") to avoid screen flicker
         - Fix to fill in unitialized uabi_instance in virtual engine uAPI
         - Add two missing W/As for ICL and EHL"
      
      * tag 'drm-fixes-2020-01-10' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: add DRIVER_SYNCOBJ_TIMELINE to amdgpu
        drm/amd/display: Reduce HDMI pixel encoding if max clock is exceeded
        Revert "drm/amdgpu: Set no-retry as default."
        drm/fb-helper: Round up bits_per_pixel if possible
        drm/sun4i: tcon: Set RGB DCLK min. divider based on hardware model
        drm/i915/dp: Disable Port sync mode correctly on teardown
        drm/i915: Add Wa_1407352427:icl,ehl
        drm/i915: Add Wa_1408615072 and Wa_1407596294 to icl,ehl
        drm/i915/gt: Restore coarse power gating
        drm/i915/gt: Do not restore invalid RS state
        drm/i915: Limit audio CDCLK>=2*BCLK constraint back to GLK only
        drm/i915/gt: Mark up virtual engine uabi_instance
        drm/dp_mst: correct the shifting in DP_REMOTE_I2C_READ
      6d25ef77
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 5e7c1b75
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "First RDMA subsystem updates for 5.5-rc. A very small set of fixes,
        most people seem to still be recovering from December!
      
        Five small driver fixes:
      
         - Fix error flow with MR allocation in bnxt_re
      
         - An errata work around for bnxt_re
      
         - Misuse of the workqueue API in hfi1
      
         - Protocol error in hfi1
      
         - Regression in 5.5 related to the mmap rework with i40iw"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        i40iw: Remove setting of VMA private data and use rdma_user_mmap_io
        IB/hfi1: Adjust flow PSN with the correct resync_psn
        IB/hfi1: Don't cancel unused work item
        RDMA/bnxt_re: Fix Send Work Entry state check while polling completions
        RDMA/bnxt_re: Avoid freeing MR resources if dereg fails
      5e7c1b75
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2020-01-09-1' of... · 023b3b0e
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2020-01-09-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
      
      - Fix GitLab issue #446 causing GPU hangs: Do not restore invalid RS state
      - Fix GitLab issue #846: Restore coarse power gating that was disabled
        by initial RC66 context corruption security fixes.
      - Revert f6ec9483 ("drm/i915: extend audio CDCLK>=2*BCLK constraint to more platforms")
        to avoid screen flicker
      - Fix to fill in unitialized uabi_instance in virtual engine uAPI
      - Add two missing W/As for ICL and EHL
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200109133458.GA15558@jlahtine-desk.ger.corp.intel.com
      023b3b0e
  6. 09 Jan, 2020 3 commits