Commits · cefc7ef3c87d02fc9307835868ff721ea12cc597 · Kirill Smelkov / linux

01 Feb, 2019 26 commits

mm, oom: fix use-after-free in oom_kill_process · cefc7ef3

Shakeel Butt authored Feb 01, 2019

Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process.  On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk.  The easiest fix is to do
get/put across the for_each_thread() on the selected task.

Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg.  Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent.  I looked at the history but it seems like
this is there before git history.

Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
Fixes: 6b0c81b3 ("mm, oom: reduce dependency on tasklist_lock")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cefc7ef3

mm/hotplug: invalid PFNs from pfn_to_online_page() · b13bc351

Qian Cai authored Feb 01, 2019

On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
uninitialized.

This is due to the commit 9f1eb38e ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.

Historically, the commit 68709f45 ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().

Since the commit 2d070eab ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.

Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.

[1]
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
  Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
  Internal error: Oops: 96000005 [#1] SMP
  CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
  pstate: 60400009 (nZCv daif +PAN -UAO)
  pc : page_mapping+0x24/0x144
  lr : __dump_page+0x34/0x3dc
  sp : ffff00003a5cfd10
  x29: ffff00003a5cfd10 x28: 000000000000802f
  x27: 0000000000000000 x26: 0000000000277d00
  x25: ffff000010791f56 x24: ffff7fe000000000
  x23: ffff000010772f8b x22: ffff00001125f670
  x21: ffff000011311000 x20: ffff000010772f8b
  x19: fffffffffffffffe x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000
  x15: 0000000000000000 x14: ffff802698b19600
  x13: ffff802698b1a200 x12: ffff802698b16f00
  x11: ffff802698b1a400 x10: 0000000000001400
  x9 : 0000000000000001 x8 : ffff00001121a000
  x7 : 0000000000000000 x6 : ffff0000102c53b8
  x5 : 0000000000000000 x4 : 0000000000000003
  x3 : 0000000000000100 x2 : 0000000000000000
  x1 : ffff000010772f8b x0 : ffffffffffffffff
  Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
  Call trace:
   page_mapping+0x24/0x144
   __dump_page+0x34/0x3dc
   dump_page+0x28/0x4c
   kmemleak_scan+0x4ac/0x680
   kmemleak_scan_thread+0xb4/0xdc
   kthread+0x12c/0x13c
   ret_from_fork+0x10/0x18
  Code: d503201f f9400660 36000040 d1000413 (f9400661)
  ---[ end trace 4d4bd7f573490c8e ]---
  Kernel panic - not syncing: Fatal exception
  SMP: stopping secondary CPUs
  Kernel Offset: disabled
  CPU features: 0x002,20000c38
  Memory Limit: none
  ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
Fixes: 9f1eb38e ("mm, kmemleak: little optimization while scanning")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b13bc351

mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages · eeb0efd0

Oscar Salvador authored Feb 01, 2019

This is the same sort of error we saw in commit 17e2e7d7 ("mm,
page_alloc: fix has_unmovable_pages for HugePages").

Gigantic hugepages cross several memblocks, so it can be that the page
we get in scan_movable_pages() is a page-tail belonging to a
1G-hugepage.  If that happens, page_hstate()->size_to_hstate() will
return NULL, and we will blow up in hugepage_migration_supported().

The splat is as follows:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 1 PID: 1350 Comm: bash Tainted: G            E     5.0.0-rc1-mm1-1-default+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  RIP: 0010:__offline_pages+0x6ae/0x900
  Call Trace:
   memory_subsys_offline+0x42/0x60
   device_offline+0x80/0xa0
   state_store+0xab/0xc0
   kernfs_fop_write+0x102/0x180
   __vfs_write+0x26/0x190
   vfs_write+0xad/0x1b0
   ksys_write+0x42/0x90
   do_syscall_64+0x5b/0x180
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)

[akpm@linux-foundation.org: fix brace layout, per David.  Reduce indentation]
Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.deSigned-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eeb0efd0

psi: fix aggregation idle shut-off · 1b69ac6b

Johannes Weiner authored Feb 01, 2019

psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1b69ac6b

mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone · 24feb47c

Mikhail Zaslonko authored Feb 01, 2019

If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized.  This may lead
to VM_BUG_ON due to uninitialized struct pages access from
test_pages_in_a_zone() function triggered by memory_hotplug sysfs
handlers.

Here are the the panic examples:
 CONFIG_DEBUG_VM_PGFLAGS=y
 kernel parameter mem=2050M
 --------------------------
 page:000003d082008000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
   test_pages_in_a_zone+0xde/0x160
   show_valid_zones+0x5c/0x190
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   test_pages_in_a_zone+0xde/0x160
 Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix this by checking whether the pfn to check is within the zone.

[mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org

[mhocko@suse.com: separated this change from
http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24feb47c

mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone · efad4e47

Michal Hocko authored Feb 01, 2019

Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
[1].  I have pushed back on those fixes because I believed that it is
much better to plug the problem at the initialization time rather than
play whack-a-mole all over the hotplug code and find all the places
which expect the full memory section to be initialized.

We have ended up with commit 2830bf6f ("mm, memory_hotplug:
initialize struct pages for the full memory section") merged and cause a
regression [2][3].  The reason is that there might be memory layouts
when two NUMA nodes share the same memory section so the merged fix is
simply incorrect.

In order to plug this hole we really have to be zone range aware in
those handlers.  I have split up the original patch into two.  One is
unchanged (patch 2) and I took a different approach for `removable'
crash.

[1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
[3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

This patch (of 2):

Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
removable state of a memory block:

 page:000003d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
   is_mem_section_removable+0xb4/0x190
   show_mem_removable+0x9a/0xd8
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops

The reason is that the memory block spans the zone boundary and we are
stumbling over an unitialized struct page.  Fix this by enforcing zone
range in is_mem_section_removable so that we never run away from a zone.

Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.orgSigned-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

efad4e47

oom, oom_reaper: do not enqueue same task twice · 9bcdeb51

Tetsuo Handa authored Feb 01, 2019

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0.  It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b0183 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.

Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9bcdeb51

mm: migrate: make buffer_migrate_page_norefs() actually succeed · 80409c65

Jan Kara authored Feb 01, 2019

Currently, buffer_migrate_page_norefs() was constantly failing because
buffer_migrate_lock_buffers() grabbed reference on each buffer.  In
fact, there's no reason for buffer_migrate_lock_buffers() to grab any
buffer references as the page is locked during all our operation and
thus nobody can reclaim buffers from the page.

So remove grabbing of buffer references which also makes
buffer_migrate_page_norefs() succeed.

Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
Fixes: 89cb0888 "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

80409c65

kernel/exit.c: release ptraced tasks before zap_pid_ns_processes · 8fb335e0

Andrei Vagin authored Feb 01, 2019

Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.

zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list.  In this case, we will have one unkillable process with one or
more dead children.

Thanks to Oleg for the advice to release tasks in find_child_reaper().

Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd232 ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8fb335e0

x86_64: increase stack size for KASAN_EXTRA · a8e911d1

Qian Cai authored Feb 01, 2019

If the kernel is configured with KASAN_EXTRA, the stack size is
increasted significantly because this option sets "-fstack-reuse" to
"none" in GCC [1].  As a result, it triggers stack overrun quite often
with 32k stack size compiled using GCC 8.  For example, this reproducer

  https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c

triggers a "corrupted stack end detected inside scheduler" very reliably
with CONFIG_SCHED_STACK_END_CHECK enabled.

There are just too many functions that could have a large stack with
KASAN_EXTRA due to large local variables that have been called over and
over again without being able to reuse the stacks.  Some noticiable ones
are

  size
  7648 shrink_page_list
  3584 xfs_rmap_convert
  3312 migrate_page_move_mapping
  3312 dev_ethtool
  3200 migrate_misplaced_transhuge_page
  3168 copy_process

There are other 49 functions are over 2k in size while compiling kernel
with "-Wframe-larger-than=" even with a related minimal config on this
machine.  Hence, it is too much work to change Makefiles for each object
to compile without "-fsanitize-address-use-after-scope" individually.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23

Although there is a patch in GCC 9 to help the situation, GCC 9 probably
won't be released in a few months and then it probably take another
6-month to 1-year for all major distros to include it as a default.
Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020
when GCC 9 is everywhere.  Until then, this patch will help users avoid
stack overrun.

This has already been fixed for arm64 for the same reason via
6e883067 ("arm64: kasan: Increase stack size for KASAN_EXTRA").

Link: http://lkml.kernel.org/r/20190109215209.2903-1-cai@lca.pwSigned-off-by: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a8e911d1

mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT · 1ac25013

Andrea Arcangeli authored Feb 01, 2019

hugetlb needs the same fix as faultin_nopage (which was applied in
commit 96312e61 ("mm/gup.c: teach get_user_pages_unlocked to handle
FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
in the FOLL_NOWAIT case.

Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
Fixes: ce53053c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1ac25013

arch: unexport asm/shmparam.h for all architectures · 36c0f7f0

Masahiro Yamada authored Feb 01, 2019

Most architectures do not export shmparam.h to user-space.

  $ find arch -name shmparam.h  | sort
  arch/alpha/include/asm/shmparam.h
  arch/arc/include/asm/shmparam.h
  arch/arm64/include/asm/shmparam.h
  arch/arm/include/asm/shmparam.h
  arch/csky/include/asm/shmparam.h
  arch/ia64/include/asm/shmparam.h
  arch/mips/include/asm/shmparam.h
  arch/nds32/include/asm/shmparam.h
  arch/nios2/include/asm/shmparam.h
  arch/parisc/include/asm/shmparam.h
  arch/powerpc/include/asm/shmparam.h
  arch/s390/include/asm/shmparam.h
  arch/sh/include/asm/shmparam.h
  arch/sparc/include/asm/shmparam.h
  arch/x86/include/asm/shmparam.h
  arch/xtensa/include/asm/shmparam.h

Strangely, some users of the asm-generic wrapper export shmparam.h

  $ git grep 'generic-y += shmparam.h'
  arch/c6x/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/h8300/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/hexagon/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/m68k/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/microblaze/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/openrisc/include/uapi/asm/Kbuild:generic-y += shmparam.h
  arch/riscv/include/asm/Kbuild:generic-y += shmparam.h
  arch/unicore32/include/uapi/asm/Kbuild:generic-y += shmparam.h

The newly added riscv correctly creates the asm-generic wrapper
in the kernel space, but the others (c6x, h8300, hexagon, m68k,
microblaze, openrisc, unicore32) create the one in the uapi directory.

Digging into the git history, now I guess fcc8487d ("uapi:
export all headers under uapi directories") was the misconversion.
Prior to that commit, no architecture exported to shmparam.h
As its commit description said, that commit exported shmparam.h
for c6x, h8300, hexagon, m68k, openrisc, unicore32.

83f0124a ("microblaze: remove asm-generic wrapper headers")
accidentally exported shmparam.h for microblaze.

This commit unexports shmparam.h for those architectures.

There is no more reason to export include/uapi/asm-generic/shmparam.h,
so it has been moved to include/asm-generic/shmparam.h

Link: http://lkml.kernel.org/r/1546904307-11124-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Stafford Horne <shorne@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Mark Salter <msalter@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Vincent Chen <deanbo422@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

36c0f7f0

proc: fix /proc/net/* after setns(2) · 1fde6f21

Alexey Dobriyan authored Feb 01, 2019

/proc entries under /proc/net/* can't be cached into dcache because
setns(2) can change current net namespace.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: avoid vim miscolorization]
[adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time]
  Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2
Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2
Fixes: 1da4d377 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Mateusz Stępień <mateusz.stepien@netrounds.com>
Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1fde6f21

mm, memory_hotplug: don't bail out in do_migrate_range() prematurely · 1723058e

Oscar Salvador authored Feb 01, 2019

do_migrate_range() takes a memory range and tries to isolate the pages
to put them into a list.  This list will be later on used in
migrate_pages() to know the pages we need to migrate.

Currently, if we fail to isolate a single page, we put all already
isolated pages back to their LRU and we bail out from the function.
This is quite suboptimal, as this will force us to start over again
because scan_movable_pages will give us the same range.  If there is no
chance that we can isolate that page, we will loop here forever.

Issue debugged in [1] has proved that.  During the debugging of that
issue, it was noticed that if do_migrate_ranges() fails to isolate a
single page, we will just discard the work we have done so far and bail
out, which means that scan_movable_pages() will find again the same set
of pages.

Instead, we can just skip the error, keep isolating as much pages as
possible and then proceed with the call to migrate_pages().

This will allow us to do as much work as possible at once.

[1] https://lkml.org/lkml/2018/12/6/324

Michal said:

: I still think that this doesn't give us a whole picture.  Looping for
: ever is a bug.  Failing the isolation is quite possible and it should
: be a ephemeral condition (e.g.  a race with freeing the page or
: somebody else isolating the page for whatever reason).  And here comes
: the disadvantage of the current implementation.  We simply throw
: everything on the floor just because of a ephemeral condition.  The
: racy page_count check is quite dubious to prevent from that.

Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.deSigned-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1723058e

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 5eeb6335

Linus Torvalds authored Feb 01, 2019

Pull rdma fixes from Jason Gunthorpe:
 "Still not much going on, the usual set of oops and driver fixes this
  time:

   - Fix two uapi breakage regressions in mlx5 drivers

   - Various oops fixes in hfi1, mlx4, umem, uverbs, and ipoib

   - A protocol bug fix for hfi1 preventing it from implementing the
     verbs API properly, and a compatability fix for EXEC STACK user
     programs

   - Fix missed refcounting in the 'advise_mr' patches merged this
     cycle.

   - Fix wrong use of the uABI in the hns SRQ patches merged this cycle"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate
  IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
  IB/uverbs: Fix ioctl query port to consider device disassociation
  RDMA/mlx5: Fix flow creation on representors
  IB/uverbs: Fix OOPs upon device disassociation
  RDMA/umem: Add missing initialization of owning_mm
  RDMA/hns: Update the kernel header file of hns
  IB/mlx5: Fix how advise_mr() launches async work
  RDMA/device: Expose ib_device_try_get(()
  IB/hfi1: Add limit test for RC/UC send via loopback
  IB/hfi1: Remove overly conservative VM_EXEC flag check
  IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
  IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV
  RDMA/mlx5: Fix check for supported user flags when creating a QP

5eeb6335

Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 9ace868a

Linus Torvalds authored Feb 01, 2019

Pull iomap fixes from Darrick Wong:
 "A couple of iomap fixes to eliminate some memory corruption and hang
  problems that were reported:

   - fix page migration when using iomap for pagecache management

   - fix a use-after-free bug in the directio code"

* tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix a use after free in iomap_dio_rw
  iomap: get/put the page in iomap_page_create/release()

9ace868a

Merge tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3325254c

Linus Torvalds authored Feb 01, 2019

Pull power management fixes from Rafael Wysocki:
 "These fix a PM-runtime framework regression introduced by the recent
  switch-over of device autosuspend to hrtimers and a mistake in the
  "poll idle state" code introduced by a recent change in it.

  Specifics:

   - Since ktime_get() turns out to be problematic for device
     autosuspend in the PM-runtime framework, make it use
     ktime_get_mono_fast_ns() instead (Vincent Guittot).

   - Fix an initial value of a local variable in the "poll idle state"
     code that makes it behave not exactly as expected when all idle
     states except for the "polling" one are disabled (Doug Smythies)"

* tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: poll_state: Fix default time limit
  PM-runtime: Fix deadlock with ktime_get()

3325254c

Merge tag 'acpi-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 4771eec1

Linus Torvalds authored Feb 01, 2019

Pull ACPI Kconfig fixes from Rafael Wysocki:
 "Prevent invalid configurations from being created (e.g. by randconfig)
  due to some ACPI-related Kconfig options' dependencies that are not
  specified directly (Sinan Kaya)"

* tag 'acpi-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
  platform/x86: Fix unmet dependency warning for ACPI_CMPC
  mfd: Fix unmet dependency warning for MFD_TPS68470

4771eec1

Merge tag 'mmc-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · cca2e06a

Linus Torvalds authored Feb 01, 2019

Pull MMC host fixes from Ulf Hansson:

 - mediatek: Fix incorrect register write for tunings

 - bcm2835: Fixup leakage of DMA channel on probe errors

* tag 'mmc-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay
  mmc: bcm2835: Fix DMA channel leak on probe error

cca2e06a

Merge tag 'i3c/fixes-for-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · 520fac05

Linus Torvalds authored Feb 01, 2019

Pull i3c fixes from Boris Brezillon:

 - Fix a deadlock in the designware driver

 - Fix the error path in i3c_master_add_i3c_dev_locked()

* tag 'i3c/fixes-for-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
  i3c: master: dw: fix deadlock
  i3c: fix missing detach if failed to retrieve i3c dev

520fac05

x86: explicitly align IO accesses in memcpy_{to,from}io · c228d294

Linus Torvalds authored Jan 31, 2019

In commit 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
I made our copy from IO space use a separate copy routine rather than
rely on the generic memcpy.  I did that because our generic memory copy
isn't actually well-defined when it comes to internal access ordering or
alignment, and will in fact depend on various CPUID flags.

In particular, the default memcpy() for a modern Intel CPU will
generally be just a "rep movsb", which works reasonably well for
medium-sized memory copies of regular RAM, since the CPU will turn it
into fairly optimized microcode.

However, for non-cached memory and IO, "rep movs" ends up being
horrendously slow and will just do the architectural "one byte at a
time" accesses implied by the movsb.

At the other end of the spectrum, if you _don't_ end up using the "rep
movsb" code, you'd likely fall back to the software copy, which does
overlapping accesses for the tail, and may copy things backwards.
Again, for regular memory that's fine, for IO memory not so much.

The thinking was that clearly nobody really cared (because things
worked), but some people had seen horrible performance due to the byte
accesses, so let's just revert back to our long ago version that dod
"rep movsl" for the bulk of the copy, and then fixed up the potentially
last few bytes of the tail with "movsw/b".

Interestingly (and perhaps not entirely surprisingly), while that was
our original memory copy implementation, and had been used before for
IO, in the meantime many new users of memcpy_*io() had come about.  And
while the access patterns for the memory copy weren't well-defined (so
arguably _any_ access pattern should work), in practice the "rep movsb"
case had been very common for the last several years.

In particular Jarkko Sakkinen reported that the memcpy_*io() change
resuled in weird errors from his Geminilake NUC TPM module.

And it turns out that the TPM TCG accesses according to spec require
that the accesses be

 (a) done strictly sequentially

 (b) be naturally aligned

otherwise the TPM chip will abort the PCI transaction.

And, in fact, the tpm_crb.c driver did this:

	memcpy_fromio(buf, priv->rsp, 6);
	...
	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

which really should never have worked in the first place, but back
before commit 170d13ca it *happened* to work, because the
memcpy_fromio() would be expanded to a regular memcpy, and

 (a) gcc would expand the first memcpy in-line, and turn it into a
     4-byte and a 2-byte read, and they happened to be in the right
     order, and the alignment was right.

 (b) gcc would call "memcpy()" for the second one, and the machines that
     had this TPM chip also apparently ended up always having ERMS
     ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
     movbs" for that copy.

In other words, basically by pure luck, the code happened to use the
right access sizes in the (two different!) memcpy() implementations to
make it all work.

But after commit 170d13ca, both of the memcpy_fromio() calls
resulted in a call to the routine with the consistent memory accesses,
and in both cases it started out transferring with 4-byte accesses.
Which worked for the first copy, but resulted in the second copy doing a
32-bit read at an address that was only 2-byte aligned.

Jarkko is actually fixing the fragile code in the TPM driver, but since
this is an excellent example of why we absolutely must not use a generic
memcpy for IO accesses, _and_ an IO-specific one really should strive to
align the IO accesses, let's do exactly that.

Side note: Jarkko also noted that the driver had been used on ARM
platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
ends up always doing byte accesses, and on 64-bit ARM it first does byte
accesses to align to 8-byte boundaries, and then does 8-byte accesses
for the bulk.

So ARM actually worked by design, and the x86 case worked by pure luck.

We *might* want to make x86-64 do the 8-byte case too.  That should be a
pretty straightforward extension, but let's do one thing at a time.  And
generally MMIO accesses aren't really all that performance-critical, as
shown by the fact that for a long time we just did them a byte at a
time, and very few people ever noticed.
Reported-and-tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: David Laight <David.Laight@aculab.com>
Fixes: 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c228d294

Merge branch 'acpi-misc' · b473406a

Rafael J. Wysocki authored Feb 01, 2019

* acpi-misc:
  platform/x86: Fix unmet dependency warning for SAMSUNG_Q10
  platform/x86: Fix unmet dependency warning for ACPI_CMPC

b473406a

Merge branch 'pm-cpuidle-fixes' · cbffab68
Rafael J. Wysocki authored Feb 01, 2019
```
* pm-cpuidle-fixes:
  cpuidle: poll_state: Fix default time limit
```
cbffab68

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 5b4746a0

Linus Torvalds authored Jan 31, 2019

Pull clk fixes from Stephen Boyd:
 "Mostly driver fixes, but there's a core framework fix in here too:

   - Revert the commits that introduce clk management for the SP clk on
     MMP2 SoCs (used for OLPC). Turns out it wasn't a good idea and
     there isn't any need to manage this clk, it just causes more
     headaches.

   - A performance regression that went unnoticed for many years where
     we would traverse the entire clk tree looking for a clk by name
     when we already have the pointer to said clk that we're looking for

   - A parent linkage fix for the qcom SDM845 clk driver

   - An i.MX clk driver rate miscalculation fix where order of
     operations were messed up

   - One error handling fix from the static checkers"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gcc: Use active only source for CPUSS clocks
  clk: ti: Fix error handling in ti_clk_parse_divider_data()
  clk: imx: Fix fractional clock set rate computation
  clk: Remove global clk traversal on fetch parent index
  Revert "dt-bindings: marvell,mmp2: Add clock id for the SP clock"
  Revert "clk: mmp2: add SP clock"
  Revert "Input: olpc_apsp - enable the SP clock"

5b4746a0

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 52107c54

Linus Torvalds authored Jan 31, 2019

Pull crypto fix from Herbert Xu:
 "This fixes a bug in cavium/nitrox where the callback is invoked prior
  to the DMA unmap"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: cavium/nitrox - Invoke callback after DMA unmap

52107c54

Merge tag 'pci-v5.0-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 44e56f32

Linus Torvalds authored Jan 31, 2019

Pull PCI fixes from Bjorn Helgaas:

 - Revert armada8k GPIO reset change that broke Macchiatobin booting
   (Baruch Siach)

 - Use actual size config reads on ARM cns3xxx (Koen Vandeputte)

 - Fix ARM cns3xxx config write alignment issue (Koen Vandeputte)

 - Fix imx6 PHY device link error checking (Leonard Crestez)

 - Fix imx6 probe failure on chips without separate PCI power domain
   (Leonard Crestez)

* tag 'pci-v5.0-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  Revert "PCI: armada8k: Add support for gpio controlled reset signal"
  ARM: cns3xxx: Use actual size reads for PCIe
  ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment
  PCI: imx: Fix checking pd_pcie_phy device link addition
  PCI: imx: Fix probe failure without power domain

44e56f32

31 Jan, 2019 9 commits

Revert "PCI: armada8k: Add support for gpio controlled reset signal" · f14bcc0a

Baruch Siach authored Jan 31, 2019

Revert commit 3d71746c ("PCI: armada8k: Add support for gpio controlled
reset signal").

That commit breaks boot on Macchiatobin board when a Mellanox NIC is
present in the PCIe slot.

It turns out that full reset cycle requires first comphy serdes
initialization. Reset signal toggle without comphy initialization makes
access to PCI configuration registers stall indefinitely. U-Boot toggles
the Macchiatobin PCIe reset line already at boot, after initializing the
comphy serdes.

So while commit 3d71746c ("PCI: armada8k: Add support for gpio controlled
reset signal") enables PCIe on platforms that U-Boot does not touch the
reset line (like Clearfog GT-8K), it breaks PCIe (and boot) on the
Macchiatobin board.

Revert commit 3d71746c ("PCI: armada8k: Add support for gpio controlled
reset signal") entirely to fix the Macchiatobin regression.
Reported-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

f14bcc0a

ARM: cns3xxx: Use actual size reads for PCIe · 432dd706

Koen Vandeputte authored Jan 31, 2019

commit 802b7c06 ("ARM: cns3xxx: Convert PCI to use generic config
accessors") reimplemented cns3xxx_pci_read_config() using
pci_generic_config_read32(), which preserved the property of only doing
32-bit reads.

It also replaced cns3xxx_pci_write_config() with pci_generic_config_write(),
so it changed writes from always being 32 bits to being the actual size,
which works just fine.

Given that:

- The documentation does not mention that only 32 bit access is allowed.
- Writes are already executed using the actual size
- Extensive testing shows that 8b, 16b and 32b reads work as intended

Allow read access of any size by replacing pci_generic_config_read32()
with the pci_generic_config_read() accessors.

Fixes: 802b7c06 ("ARM: cns3xxx: Convert PCI to use generic config accessors")
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
[lorenzo.pieralisi@arm.com: updated commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Krzysztof Halasa <khalasa@piap.pl>
Acked-by: Arnd Bergmann <arnd@arndb.de>
CC: Krzysztof Halasa <khalasa@piap.pl>
CC: Olof Johansson <olof@lixom.net>
CC: Robin Leblon <robin.leblon@ncentric.com>
CC: Rob Herring <robh@kernel.org>
CC: Russell King <linux@armlinux.org.uk>
CC: Tim Harvey <tharvey@gateworks.com>

432dd706

ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment · 65dbb423

Koen Vandeputte authored Jan 31, 2019

Originally, cns3xxx used its own functions for mapping, reading and
writing config registers.

Commit 802b7c06 ("ARM: cns3xxx: Convert PCI to use generic config
accessors") removed the internal PCI config write function in favor of
the generic one:

  cns3xxx_pci_write_config() --> pci_generic_config_write()

cns3xxx_pci_write_config() expected aligned addresses, being produced by
cns3xxx_pci_map_bus() while the generic one pci_generic_config_write()
actually expects the real address as both the function and hardware are
capable of byte-aligned writes.

This currently leads to pci_generic_config_write() writing to the wrong
registers.

For instance, upon ath9k module loading:

- driver ath9k gets loaded
- The driver wants to write value 0xA8 to register PCI_LATENCY_TIMER,
  located at 0x0D
- cns3xxx_pci_map_bus() aligns the address to 0x0C
- pci_generic_config_write() effectively writes 0xA8 into register 0x0C
  (CACHE_LINE_SIZE)

Fix the bug by removing the alignment in the cns3xxx mapping function.

Fixes: 802b7c06 ("ARM: cns3xxx: Convert PCI to use generic config accessors")
Signed-off-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
[lorenzo.pieralisi@arm.com: updated commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Krzysztof Halasa <khalasa@piap.pl>
Acked-by: Tim Harvey <tharvey@gateworks.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
CC: stable@vger.kernel.org	# v4.0+
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Olof Johansson <olof@lixom.net>
CC: Robin Leblon <robin.leblon@ncentric.com>
CC: Rob Herring <robh@kernel.org>
CC: Russell King <linux@armlinux.org.uk>

65dbb423

PCI: imx: Fix checking pd_pcie_phy device link addition · a4ace4fa

Leonard Crestez authored Jan 31, 2019

The check on the device_link_add() return value is wrong;
this leads to erroneous code execution, so fix it.

Fixes: 3f7cceea ("PCI: imx: Add multi-pd support")
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
[lorenzo.pieralisi@arm.com: updated commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

a4ace4fa

PCI: imx: Fix probe failure without power domain · a6093ad7

Leonard Crestez authored Jan 31, 2019

On chips without a separate power domain for PCI (such as 6q/6qp) the
imx6_pcie_attach_pd() function incorrectly returns an error.

Fix by returning 0 if dev_pm_domain_attach_by_name() does not find
anything.

Fixes: 3f7cceea ("PCI: imx: Add multi-pd support")
Reported-by: Lukas F.Hartmann <lukas@mntmn.com>
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
[lorenzo.pieralisi@arm.com: updated commit log]
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

a6093ad7

gfs2: Revert "Fix loop in gfs2_rbm_find" · e74c98ca

Andreas Gruenbacher authored Jan 30, 2019

This reverts commit 2d29f6b9.

It turns out that the fix can lead to a ~20 percent performance regression
in initial writes to the page cache according to iozone.  Let's revert this
for now to have more time for a proper fix.

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e74c98ca

Merge tag 'linux-kselftest-5.0-rc5' of... · 9f789567

Linus Torvalds authored Jan 31, 2019

Merge tag 'linux-kselftest-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:
 "This consists of run-time fixes to cpu-hotplug, and seccomp tests,
  compile fixes to ir, net, and timers Makefiles"

* tag 'linux-kselftest-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: timers: use LDLIBS instead of LDFLAGS
  selftests: net: use LDLIBS instead of LDFLAGS
  selftests/seccomp: Enhance per-arch ptrace syscall skip tests
  selftests: Use lirc.h from kernel tree, not from system
  selftests: cpu-hotplug: fix case where CPUs offline > CPUs present

9f789567

Merge tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs · 937108b0

Linus Torvalds authored Jan 31, 2019

Pull NFS client fixes from Anna Schumaker:
 "This addresses two bugs, one in the error code handling of
  nfs_page_async_flush() and one to fix a potential NULL pointer
  dereference in nfs_parse_devname().

  Stable bugfix:
   - Fix up return value on fatal errors in nfs_page_async_flush()

  Other bugfix:
   - Fix NULL pointer dereference of dev_name"

* tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix up return value on fatal errors in nfs_page_async_flush()
  nfs: Fix NULL pointer dereference of dev_name

937108b0

Merge tag 'sound-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 83f4997a

Linus Torvalds authored Jan 31, 2019

Pull sound fixes from Takashi Iwai:
 "Only three fixes.

  The fix for Realtek HD-audio looks lengthy, but it's just a code
  shuffling, and the actual changes are fairly small.

  The rest are a PCM core fix for a long-standing bug that was recently
  scratched by syzkaller, and a trivial USB-audio quirk for DSD support"

* tag 'sound-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek - Fixed hp_pin no value
  ALSA: pcm: Fix tight loop of OSS capture stream
  ALSA: usb-audio: Add Opus #3 to quirks for native DSD support

83f4997a

30 Jan, 2019 5 commits

cpuidle: poll_state: Fix default time limit · 1617971c

Doug Smythies authored Jan 30, 2019

The default time is declared in units of microsecnds,
but is used as nanoseconds, resulting in significant
accounting errors for idle state 0 time when all idle
states deeper than 0 are disabled.

Under these unusual conditions, we don't really care
about the poll time limit anyhow.

Fixes: 800fb34a ("cpuidle: poll_state: Disregard disable idle states")
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

1617971c

PM-runtime: Fix deadlock with ktime_get() · 15efb47d

Vincent Guittot authored Jan 30, 2019

A deadlock has been seen when swicthing clocksources which use
PM-runtime.  The call path is:

change_clocksource
    ...
    write_seqcount_begin
    ...
    timekeeping_update
        ...
        sh_cmt_clocksource_enable
            ...
            rpm_resume
                pm_runtime_mark_last_busy
                    ktime_get
                        do
                            read_seqcount_begin
                        while read_seqcount_retry
    ....
    write_seqcount_end

Although we should be safe because we haven't yet changed the
clocksource at that time, we can't do that because of seqcount
protection.

Use ktime_get_mono_fast_ns() instead which is lock safe for such
cases.

With ktime_get_mono_fast_ns, the timestamp is not guaranteed to be
monotonic across an update and as a result can goes backward.
According to update_fast_timekeeper() description: "In the worst
case, this can result is a slightly wrong timestamp (a few
nanoseconds)". For PM-runtime autosuspend, this means only that
the suspend decision may be slightly suboptimal.

Fixes: 8234f673 ("PM-runtime: Switch autosuspend over to using hrtimers")
Reported-by: Biju Das <biju.das@bp.renesas.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

15efb47d

fs/dcache: Track & report number of negative dentries · af0c9af1

Waiman Long authored Jan 30, 2019

The current dentry number tracking code doesn't distinguish between
positive & negative dentries.  It just reports the total number of
dentries in the LRU lists.

As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.

This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file.  The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.

The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.

Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension.  They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero.  IOW, it will be safe to use one of the dummy array entry for
negative dentry count.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

af0c9af1

fs: Don't need to put list_lru into its own cacheline · 7d10f70f

Waiman Long authored Jan 30, 2019

The list_lru structure is essentially just a pointer to a table of
per-node LRU lists.  Even if CONFIG_MEMCG_KMEM is defined, the list
field is just used for LRU list registration and shrinker_id is set at
initialization.  Those fields won't need to be touched that often.

So there is no point to make the list_lru structures to sit in their own
cachelines.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7d10f70f

fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() · 1dbd449c

Waiman Long authored Jan 30, 2019

The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
lists and the shrink lists where the DCACHE_LRU_LIST bit is set.

The shrink_dcache_sb() function moves dentries from the LRU list to a
shrink list and subtracts the dentry count from nr_dentry_unused.  This
is incorrect as the nr_dentry_unused count will also be decremented in
shrink_dentry_list() via d_shrink_del().

To fix this double decrement, the decrement in the shrink_dcache_sb()
function is taken out.

Fixes: 4e717f5c ("list_lru: remove special case function list_lru_dispose_all."
Cc: stable@kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1dbd449c