Commits · e033e7d4a8081852b2cca53e530e2c0f4e6769c0 · nexedi / linux

14 Jan, 2020 17 commits

Merge branch 'dhowells' (patches from DavidH) · e033e7d4

Linus Torvalds authored Jan 14, 2020

Merge misc fixes from David Howells.

Two afs fixes and a key refcounting fix.

* dhowells:
  afs: Fix afs_lookup() to not clobber the version on a new dentry
  afs: Fix use-after-loss-of-ref
  keys: Fix request_key() cache

e033e7d4

afs: Fix afs_lookup() to not clobber the version on a new dentry · f52b83b0

David Howells authored Jan 14, 2020

Fix afs_lookup() to not clobber the version set on a new dentry by
afs_do_lookup() - especially as it's using the wrong version of the
version (we need to use the one given to us by whatever op the dir
contents correspond to rather than what's in the afs_vnode).

Fixes: 9dd0b82e ("afs: Fix missing dentry data version updating")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f52b83b0

afs: Fix use-after-loss-of-ref · 40a708bd

David Howells authored Jan 14, 2020

afs_lookup() has a tracepoint to indicate the outcome of
d_splice_alias(), passing it the inode to retrieve the fid from.
However, the function gave up its ref on that inode when it called
d_splice_alias(), which may have failed and dropped the inode.

Fix this by caching the fid.

Fixes: 80548b03 ("afs: Add more tracepoints")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

40a708bd

keys: Fix request_key() cache · 8379bb84

David Howells authored Jan 14, 2020

When the key cached by request_key() and co. is cleaned up on exit(),
the code looks in the wrong task_struct, and so clears the wrong cache.
This leads to anomalies in key refcounting when doing, say, a kernel
build on an afs volume, that then trigger kasan to report a
use-after-free when the key is viewed in /proc/keys.

Fix this by making exit_creds() look in the passed-in task_struct rather
than in current (the task_struct cleanup code is deferred by RCU and
potentially run in another task).

Fixes: 7743c48e ("keys: Cache result of request_key*() temporarily in task_struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8379bb84

Merge branch 'akpm' (patches from Andrew) · 3f1f9a9b

Linus Torvalds authored Jan 14, 2020

Merge misc fixes from Andrew Morton:
 "11 mm fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE
  mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
  mm/page-writeback.c: improve arithmetic divisions
  mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide
  mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
  mm, debug_pagealloc: don't rely on static keys too early
  mm: memcg/slab: fix percpu slab vmstats flushing
  mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
  mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
  mm/memory_hotplug: don't free usage map when removing a re-added early section
  mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

3f1f9a9b

Merge tag 'Intel-CVE-2019-14615' from bundle by Akeem Abodunrin. · 63d264fe

Linus Torvalds authored Jan 13, 2020

Merge Intel Gen9 graphics fix from Akeem Abodunrin:
 "Insufficient control flow in certain data structures for some Intel
  Processors with Intel Processor Graphics may allow an unauthenticated
  user to potentially enable information disclosure via local access

  This provides mitigation for Gen9 hardware. Note that Gen8 is not
  impacted due to a previously implemented workaround.

  The mitigation involves using an existing hardware feature to forcibly
  clear down all EU state at each context switch"

* tag 'Intel-CVE-2019-14615' of emailed bundle from Akeem G Abodunrin <akeem.g.abodunrin@intel.com>:
  drm/i915/gen9: Clear residual context state on context switch

63d264fe

mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE · 554913f6

Yang Shi authored Jan 13, 2020

Commit 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem)
FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
the corresponding description for trace events were not added.

Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

554913f6

mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210

Adrian Huang authored Jan 13, 2020

When booting with amd_iommu=off, the following WARNING message
appears:

  AMD-Vi: AMD IOMMU disabled on kernel command-line
  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
  Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
  RIP: 0010:flush_workqueue+0x42e/0x450
  Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
  Call Trace:
   kmem_cache_destroy+0x69/0x260
   iommu_go_to_state+0x40c/0x5ab
   amd_iommu_prepare+0x16/0x2a
   irq_remapping_prepare+0x36/0x5f
   enable_IR_x2apic+0x21/0x172
   default_setup_apic_routing+0x12/0x6f
   apic_intr_mode_init+0x1a1/0x1f1
   x86_late_time_init+0x17/0x1c
   start_kernel+0x480/0x53f
   secondary_startup_64+0xb6/0xc0
  ---[ end trace 30894107c3749449 ]---
  x2apic: IRQ remapping doesn't support X2APIC mode
  x2apic disabled

The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:

  free_iommu_resources
    kmem_cache_destroy
      flush_memcg_workqueue
        flush_workqueue

The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'.  This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.

Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue().  This prevents
the WARNING message triggered by flush_workqueue().

Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Xiaochun Lee <lixc17@lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2fe20210

mm/page-writeback.c: improve arithmetic divisions · 0a5d1a7f

Wen Yang authored Jan 13, 2020

Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.

Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.comSigned-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0a5d1a7f

mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide · d3ac946e

Wen Yang authored Jan 13, 2020

The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)

And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
function to call is div64_ul().

Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.comSigned-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d3ac946e

mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() · 6d9e8c65

Wen Yang authored Jan 13, 2020

Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".

We were first inspired by commit b0ab99e7 ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.

 201                 if (min) {
 202                         min *= this_bw;
 203                         do_div(min, tot_bw);
 204                 }

And we also disassembled and confirmed it:

  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
  0xffffffff811c37da <__wb_calc_thresh+234>:      xor    %r10d,%r10d
  0xffffffff811c37dd <__wb_calc_thresh+237>:      test   %rax,%rax
  0xffffffff811c37e0 <__wb_calc_thresh+240>:      je 0xffffffff811c3800 <__wb_calc_thresh+272>
  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
  0xffffffff811c37e2 <__wb_calc_thresh+242>:      imul   %r8,%rax
  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
  0xffffffff811c37e6 <__wb_calc_thresh+246>:      mov    %r9d,%r10d    ---> truncates it to 32 bits here
  0xffffffff811c37e9 <__wb_calc_thresh+249>:      xor    %edx,%edx
  0xffffffff811c37eb <__wb_calc_thresh+251>:      div    %r10
  0xffffffff811c37ee <__wb_calc_thresh+254>:      imul   %rbx,%rax
  0xffffffff811c37f2 <__wb_calc_thresh+258>:      shr    $0x2,%rax
  0xffffffff811c37f6 <__wb_calc_thresh+262>:      mul    %rcx
  0xffffffff811c37f9 <__wb_calc_thresh+265>:      shr    $0x2,%rdx
  0xffffffff811c37fd <__wb_calc_thresh+269>:      mov    %rdx,%r10

This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

This patch (of 3):

The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division.  Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6d9e8c65

mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac

Vlastimil Babka authored Jan 13, 2020

Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled.  It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.

However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch().  x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g.  ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.

To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable.  Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.

[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
  Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8e57f8ac

mm: memcg/slab: fix percpu slab vmstats flushing · 4a87e2a2

Roman Gushchin authored Jan 13, 2020

Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4a87e2a2

mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997

Kirill A. Shutemov authored Jan 13, 2020

Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
enabled.  But it doesn't work well with above-47bit hint address.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses.  It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits.  If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks THP alignment in shmem/tmp:
shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
*any* hint address specified.

This can be fixed by requesting the aligned area if the we failed to
allocated at user-specified hint address.  The request with inflated
length will also take the user-specified hint address.  This way we will
not lose an allocation request from the full address space.

[kirill@shutemov.name: fold in a fixup]
  Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

99158997

mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment · 97d3d0f9

Kirill A. Shutemov authored Jan 13, 2020

Patch series "Fix two above-47bit hint address vs.  THP bugs".

The two get_unmapped_area() implementations have to be fixed to provide
THP-friendly mappings if above-47bit hint address is specified.

This patch (of 2):

Filesystems use thp_get_unmapped_area() to provide THP-friendly
mappings.  For DAX in particular.

Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses.  It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.

Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits.  If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.

Unfortunately, this trick breaks thp_get_unmapped_area(): the function
would not try to allocate PMD-aligned area if *any* hint address
specified.

Modify the routine to handle it correctly:

 - Try to allocate the space at the specified hint address with length
   padding required for PMD alignment.
 - If failed, retry without length padding (but with the same hint
   address);
 - If the returned address matches the hint address return it.
 - Otherwise, align the address as required for THP and return.

The user specified hint address is passed down to get_unmapped_area() so
above-47bit hint address will be taken into account without breaking
alignment requirements.

Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Thomas Willhalm <thomas.willhalm@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

97d3d0f9

mm/memory_hotplug: don't free usage map when removing a re-added early section · 8068df3b

David Hildenbrand authored Jan 13, 2020

When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page.  Once the
section is removed, it is no longer an early section (especially, the
memmap is freed).  When we re-add that section, the usage map is reused,
however, it is no longer an early section.  When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.

Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot.  We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.

Can be triggered using memtrace under ppc64/powernv:

  $ mount -t debugfs none /sys/kernel/debug/
  $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
  $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
   ------------[ cut here ]------------
   kernel BUG at mm/slub.c:3969!
   Oops: Exception in kernel mode, sig: 5 [#1]
   LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
   Modules linked in:
   CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
   NIP kfree+0x338/0x3b0
   LR section_deactivate+0x138/0x200
   Call Trace:
     section_deactivate+0x138/0x200
     __remove_pages+0x114/0x150
     arch_remove_memory+0x3c/0x160
     try_remove_memory+0x114/0x1a0
     __remove_memory+0x20/0x40
     memtrace_enable_set+0x254/0x850
     simple_attr_write+0x138/0x160
     full_proxy_write+0x8c/0x110
     __vfs_write+0x38/0x70
     vfs_write+0x11c/0x2a0
     ksys_write+0x84/0x140
     system_call+0x5c/0x68
   ---[ end trace 4b053cbd84e0db62 ]---

The first invocation will offline+remove memory blocks.  The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").

Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.

Using Dynamic Memory under a Power DLAPR can trigger it easily.

Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.

Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8068df3b

mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32

Vlastimil Babka authored Jan 13, 2020

THP page faults now attempt a __GFP_THISNODE allocation first, which
should only compact existing free memory, followed by another attempt
that can allocate from any node using reclaim/compaction effort
specified by global defrag setting and madvise.

This patch makes the following changes to the scheme:

 - Before the patch, the first allocation relies on a check for
   pageblock order and __GFP_IO to prevent excessive reclaim. This
   however affects also the second attempt, which is not limited to
   single node.

   Instead of that, reuse the existing check for costly order
   __GFP_NORETRY allocations, and make sure the first THP attempt uses
   __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
   allocations will bail out if compaction needs reclaim, while
   previously they only bailed out when compaction was deferred due to
   previous failures.

   This should be still acceptable within the __GFP_NORETRY semantics.

 - Before the patch, the second allocation attempt (on all nodes) was
   passing __GFP_NORETRY. This is redundant as the check for pageblock
   order (discussed above) was stronger. It's also contrary to
   madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
   requested.

   After this patch, the second attempt doesn't pass __GFP_THISNODE nor
   __GFP_NORETRY.

To sum up, THP page faults now try the following attempts:

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
   allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cc638f32

13 Jan, 2020 2 commits

Linux 5.5-rc6 · b3a987b0
Linus Torvalds authored Jan 12, 2020

b3a987b0

Merge tag 'riscv/for-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 373adb73

Linus Torvalds authored Jan 12, 2020

Pull RISC-V fixes from Paul Walmsley:
 "Two fixes for RISC-V:

   - Clear FP registers during boot when FP support is present, rather
     than when they aren't present

   - Move the header files associated with the SiFive L2 cache
     controller to drivers/soc (where the code was recently moved)"

* tag 'riscv/for-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Fixup obvious bug for fp-regs reset
  riscv: move sifive_l2_cache.h to include/soc

373adb73

12 Jan, 2020 3 commits

riscv: Fixup obvious bug for fp-regs reset · dc6fcba7

Guo Ren authored Jan 05, 2020

CSR_MISA is defined in Privileged Architectures' spec: 3.1.1 Machine
ISA Register misa. Every bit:1 indicate a feature, so we should beqz
reset_done when there is no F/D bit in csr_misa register.
Signed-off-by: Guo Ren <ren_guo@c-sky.com>
[paul.walmsley@sifive.com: fix typo in commit message]
Fixes: 9e806356 ("riscv: clear the instruction cache and all registers when booting")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>

dc6fcba7

riscv: move sifive_l2_cache.h to include/soc · 13cf4cf0

Yash Shah authored Jan 07, 2020

The commit 9209fb51 ("riscv: move sifive_l2_cache.c to drivers/soc")
moves the sifive L2 cache driver to driver/soc. It did not move the
header file along with the driver. Therefore this patch moves the header
file to driver/soc
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: updated to fix the include guard]
Fixes: 9209fb51 ("riscv: move sifive_l2_cache.c to drivers/soc")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>

13cf4cf0

Merge tag 'iommu-fixes-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 040a3c33

Linus Torvalds authored Jan 12, 2020

Pull iommu fixes from Joerg Roedel:

 - Two fixes for VT-d and generic IOMMU code to fix teardown on error
   handling code paths.

 - Patch for the Intel VT-d driver to fix handling of non-PCI devices

 - Fix W=1 compile warning in dma-iommu code

* tag 'iommu-fixes-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/dma: fix variable 'cookie' set but not used
  iommu/vt-d: Unlink device if failed to add to group
  iommu: Remove device link to group on failure
  iommu/vt-d: Fix adding non-PCI devices to Intel IOMMU

040a3c33

11 Jan, 2020 2 commits

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 6327edce

Linus Torvalds authored Jan 11, 2020

Pull i2c fixes from Wolfram Sang:
 "Two driver bugfixes, a documentation fix, and a removal of a spec
  violation for the bus recovery algorithm in the core"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: fix bus recovery stop mode timing
  i2c: bcm2835: Store pointer to bus clock
  dt-bindings: i2c: at91: fix i2c-sda-hold-time-ns documentation for sam9x60
  i2c: at91: fix clk_offset for sam9x60

6327edce

Merge tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 606e9ad2

Linus Torvalds authored Jan 11, 2020

Pull thread fixes from Christian Brauner:
"This contains a series of patches to fix CLONE_SETTLS when used with
clone3().

The clone3() syscall passes the tls argument through struct clone_args
instead of a register. This means, all architectures that do not
implement copy_thread_tls() but still support CLONE_SETTLS via
copy_thread() expecting the tls to be located in a register argument
based on clone() are currently unfortunately broken. Their tls value
will be garbage.

The patch series fixes this on all architectures that currently define
__ARCH_WANT_SYS_CLONE3. It also adds a compile-time check to ensure
that any architecture that enables clone3() in the future is forced to
also implement copy_thread_tls().

My ultimate goal is to get rid of the copy_thread()/copy_thread_tls()
split and just have copy_thread_tls() at some point in the not too
distant future (Maybe even renaming copy_thread_tls() back to simply
copy_thread() once the old function is ripped from all arches). This
is dependent now on all arches supporting clone3().

While all relevant arches do that now there are still four missing:
ia64, m68k, sh and sparc. They have the system call reserved, but not
implemented. Once they all implement clone3() we can get rid of
ARCH_WANT_SYS_CLONE3 and HAVE_COPY_THREAD_TLS.

This series also includes a minor fix for the arm64 uapi headers which
caused __NR_clone3 to be missing from the exported user headers.

Unfortunately the series came in a little late especially given that
it touches a range of architectures. Due to the holidays not all arch
maintainers responded in time probably due to their backlog. Will and
Arnd have thankfully acked the arm specific changes.

Given that the changes are straightforward and rather minimal combined
with the fact the that clone3() with CLONE_SETTLS is broken I decided
to send them post rc3 nonetheless"

* tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
um: Implement copy_thread_tls
clone3: ensure copy_thread_tls is implemented
xtensa: Implement copy_thread_tls
riscv: Implement copy_thread_tls
parisc: Implement copy_thread_tls
arm: Implement copy_thread_tls
arm64: Implement copy_thread_tls
arm64: Move __ARCH_WANT_SYS_CLONE3 definition to uapi headers

606e9ad2

10 Jan, 2020 16 commits

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · ac61145a

Linus Torvalds authored Jan 10, 2020

Pull HID fix from Jiri Kosina:
 "A regression fix for EPOLLOUT handling in hidraw and uhid"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: hidraw, uhid: Always report EPOLLOUT

ac61145a

Merge tag 'usb-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 213356fe

Linus Torvalds authored Jan 10, 2020

Pull USB/PHY fixes from Greg KH:
 "Here are a number of USB and PHY driver fixes for 5.5-rc6

  Nothing all that unusual, just the a bunch of small fixes for a lot of
  different reported issues. The PHY driver fixes are in here as they
  interacted with the usb drivers.

  Full details of the patches are in the shortlog, and all of these have
  been in linux-next with no reported issues"

* tag 'usb-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (24 commits)
  usb: missing parentheses in USE_NEW_SCHEME
  usb: ohci-da8xx: ensure error return on variable error is set
  usb: musb: Disable pullup at init
  usb: musb: fix idling for suspend after disconnect interrupt
  usb: typec: ucsi: Fix the notification bit offsets
  USB: Fix: Don't skip endpoint descriptors with maxpacket=0
  USB-PD tcpm: bad warning+size, PPS adapters
  phy/rockchip: inno-hdmi: round clock rate down to closest 1000 Hz
  usb: chipidea: host: Disable port power only if previously enabled
  usb: cdns3: should not use the same dev_id for shared interrupt handler
  usb: dwc3: gadget: Fix request complete check
  usb: musb: dma: Correct parameter passed to IRQ handler
  usb: musb: jz4740: Silence error if code is -EPROBE_DEFER
  usb: udc: tegra: select USB_ROLE_SWITCH
  USB: core: fix check for duplicate endpoints
  phy: cpcap-usb: Drop extra write to usb2 register
  phy: cpcap-usb: Improve host vs docked mode detection
  phy: cpcap-usb: Prevent USB line glitches from waking up modem
  phy: mapphone-mdm6600: Fix uninitialized status value regression
  phy: cpcap-usb: Fix flakey host idling and enumerating of devices
  ...

213356fe

Merge tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 9fb7007d

Linus Torvalds authored Jan 10, 2020

Pull char/misc fix from Greg KH:
 "Here is a single fix, for the chrdev core, for 5.5-rc6

  There's been a long-standing race condition triggered by syzbot, and
  occasionally real people, in the chrdev open() path. Will finally took
  the time to track it down and fix it for real before the holidays.

  Here's that one patch, it's been in linux-next for a while with no
  reported issues and it does fix the reported problem"

* tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  chardev: Avoid potential use-after-free in 'chrdev_open()'

9fb7007d

Merge tag 'staging-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 7da37cd0

Linus Torvalds authored Jan 10, 2020

Pull staging fixes from Greg KH:
 "Here are some small staging driver fixes for 5.5-rc6.

  Nothing major here, just some small fixes for a comedi driver, the
  vt6656 driver, and a new device id for the rtl8188eu driver.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8188eu: Add device code for TP-Link TL-WN727N v5.21
  staging: comedi: adv_pci1710: fix AI channels 16-31 for PCI-1713
  staging: vt6656: set usb_set_intfdata on driver fail.
  staging: vt6656: remove bool from vnt_radio_power_on ret
  staging: vt6656: limit reg output to block size
  staging: vt6656: correct return of vnt_init_registers.
  staging: vt6656: Fix non zero logical return of, usb_control_msg

7da37cd0

Merge tag 'tty-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 5a96c0bb

Linus Torvalds authored Jan 10, 2020

Pull tty/serial fixes from Greg KH:
 "Here are two tty/serial driver fixes for 5.5-rc6.

  The first fixes a much much reported issue with a previous tty port
  link patch that is in your tree, and the second fixes a problem where
  the serdev driver would claim ACPI devices that it shouldn't be
  claiming.

  Both have been in linux-next for a while with no reported issues"

* tag 'tty-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serdev: Don't claim unsupported ACPI serial devices
  tty: always relink the port

5a96c0bb

Merge tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block · 4e4cd21c

Linus Torvalds authored Jan 10, 2020

Pull block fixes from Jens Axboe:
 "A few fixes that should go into this round.

  This pull request contains two NVMe fixes via Keith, removal of a dead
  function, and a fix for the bio op for read truncates (Ming)"

* tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
  nvmet: fix per feat data len for get_feature
  nvme: Translate more status codes to blk_status_t
  fs: move guard_bio_eod() after bio_set_op_attrs
  block: remove unused mp_bvec_last_segment

4e4cd21c

Merge tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block · 30b6487d

Linus Torvalds authored Jan 10, 2020

Pull io_uring fix from Jens Axboe:
 "Single fix for this series, fixing a regression with the short read
  handling.

  This just removes it, as it cannot safely be done for all cases"

* tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
  io_uring: remove punt of short reads to async context

30b6487d

Merge tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux · 4936ce17

Linus Torvalds authored Jan 10, 2020

Pull MTD fixes from Miquel Raynal:
 "MTD:
   - sm_ftl: Fix NULL pointer warning.

  Raw NAND:
   - Cadence: fix compile testing.
   - STM32: Avoid locking.

  Onenand:
   - Fix several sparse/build warnings.

  SPI-NOR:
   - Add a flag to fix interaction with Micron parts"

* tag 'mtd/fixes-for-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
  mtd: spi-nor: Fix the writing of the Status Register on micron flashes
  mtd: sm_ftl: fix NULL pointer warning
  mtd: onenand: omap2: Pass correct flags for prep_dma_memcpy
  mtd: onenand: samsung: Fix iomem access with regular memcpy
  mtd: onenand: omap2: Fix errors in style
  mtd: cadence: Fix cast to pointer from integer of different size warning
  mtd: rawnand: stm32_fmc2: avoid to lock the CPU bus

4936ce17

Merge tag 'sound-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · b1d198c0

Linus Torvalds authored Jan 10, 2020

Pull sound fixes from Takashi Iwai:
 "A few piled ASoC fixes and usual HD-audio and USB-audio fixups. Some
  of them are for ASoC core error-handling"

* tag 'sound-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda: enable regmap internal locking
  ALSA: hda/realtek - Add quirk for the bass speaker on Lenovo Yoga X1 7th gen
  ALSA: hda/realtek - Set EAPD control to default for ALC222
  ALSA: usb-audio: Apply the sample rate quirk for Bose Companion 5
  ALSA: hda/realtek - Add new codec supported for ALCS1200A
  ASoC: Intel: boards: Fix compile-testing RT1011/RT5682
  ASoC: SOF: imx8: Fix dsp_box offset
  ASoC: topology: Prevent use-after-free in snd_soc_get_pcm_runtime()
  ASoC: fsl_audmix: add missed pm_runtime_disable
  ASoC: stm32: spdifrx: fix input pin state management
  ASoC: stm32: spdifrx: fix race condition in irq handler
  ASoC: stm32: spdifrx: fix inconsistent lock state
  ASoC: core: Fix access to uninitialized list heads
  ASoC: soc-core: Set dpcm_playback / dpcm_capture
  ASoC: SOF: imx8: fix memory allocation failure check on priv->pd_dev
  ASoC: SOF: Intel: hda: hda-dai: fix oops on hda_link .hw_free
  ASoC: SOF: fix fault at driver unload after failed probe

b1d198c0

Merge tag 'thermal-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux · 658e1af5

Linus Torvalds authored Jan 10, 2020

Pull thermal fix from Daniel Lezcano:
 "Fix backward compatibility with old DTBs on QCOM tsens (Amit
  Kucheria)"

* tag 'thermal-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
  drivers: thermal: tsens: Work with old DTBs

658e1af5

Merge tag 'pm-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c23e744b

Linus Torvalds authored Jan 10, 2020

Pull power management fixes from Rafael Wysocki:
 "Prevent the cpufreq-dt driver from probing Tegra20/30 (Dmitry
  Osipenko) and prevent the Intel RAPL power capping driver from
  crashing during CPU initialization due to a NULL pointer dereference
  if the processor model in use is not known to it (Harry Pan)"

* tag 'pm-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()
  cpufreq: dt-platdev: Blacklist NVIDIA Tegra20 and Tegra30 SoCs

c23e744b

nvmet: fix per feat data len for get_feature · e17016f6

Amit Engel authored Jan 08, 2020

The existing implementation for the get_feature admin-cmd does not
use per-feature data len. This patch introduces a new helper function
nvmet_feat_data_len(), which is used to calculate per feature data len.
Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID).

Fixes: commit e9061c39 ("nvmet: Remove the data_len field from the nvmet_req struct")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amit Engel <amit.engel@dell.com>
[endiness, naming, and kernel style fixes]
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e17016f6

nvme: Translate more status codes to blk_status_t · 35038bff

Keith Busch authored Dec 06, 2019

Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.
Reported-by: John Meneghini <John.Meneghini@netapp.com>
Cc: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

35038bff

HID: hidraw, uhid: Always report EPOLLOUT · 9e635c28

Jiri Kosina authored Jan 10, 2020

hidraw and uhid device nodes are always available for writing so we should
always report EPOLLOUT and EPOLLWRNORM bits, not only in the cases when
there is nothing to read.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: be54e746 ("HID: uhid: Fix returning EPOLLOUT from uhid_char_poll")
Fixes: 9f3b61dc ("HID: hidraw: Fix returning EPOLLOUT from hidraw_poll")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

9e635c28

Merge branch 'powercap' · 10674d97

Rafael J. Wysocki authored Jan 10, 2020

* powercap:
  powercap: intel_rapl: add NULL pointer check to rapl_mmio_cpu_online()

10674d97

Merge tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · bef1d882

Linus Torvalds authored Jan 09, 2020

Pull pstore fix from Kees Cook:
 "Cengiz Can forwarded a Coverity report about more problems with a rare
  pstore initialization error path, so the allocation lifetime was
  rearranged to avoid needing to share the kfree() responsibilities
  between caller and callee"

* tag 'pstore-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Regularize prz label allocation lifetime

bef1d882