Commits · 305c1fac3225dfa7eeb89bfe91b7335a6edd5172 · nexedi / linux

25 Jul, 2018 9 commits

sched/numa: Evaluate move once per node · 305c1fac

Srikar Dronamraju authored Jun 20, 2018

task_numa_compare() helps choose the best CPU to move or swap the
selected task. To achieve this task_numa_compare() is called for every
CPU in the node. Currently it evaluates if the task can be moved/swapped
for each of the CPUs. However the move evaluation is mostly independent
of the CPU. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25705.2 25058.2 -2.51
1 74433 72950 -1.99

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 96589.6 105930 9.670
1 181830 178624 -1.76

(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 440.65 941.32 758.98 189.17
numa01.sh Sys: 183.48 320.07 258.42 50.09
numa01.sh User: 37384.65 71818.14 60302.51 13798.96
numa02.sh Real: 61.24 65.35 62.49 1.49
numa02.sh Sys: 16.83 24.18 21.40 2.60
numa02.sh User: 5219.59 5356.34 5264.03 49.07
numa03.sh Real: 822.04 912.40 873.55 37.35
numa03.sh Sys: 118.80 140.94 132.90 7.60
numa03.sh User: 62485.19 70025.01 67208.33 2967.10
numa04.sh Real: 690.66 872.12 778.49 65.44
numa04.sh Sys: 459.26 563.03 494.03 42.39
numa04.sh User: 51116.44 70527.20 58849.44 8461.28
numa05.sh Real: 418.37 562.28 525.77 54.27
numa05.sh Sys: 299.45 481.00 392.49 64.27
numa05.sh User: 34115.09 41324.02 39105.30 2627.68

Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 516.14 892.41 739.84 151.32 2.587%
numa01.sh Sys: 153.16 192.99 177.70 14.58 45.42%
numa01.sh User: 39821.04 69528.92 57193.87 10989.48 5.435%
numa02.sh Real: 60.91 62.35 61.58 0.63 1.477%
numa02.sh Sys: 16.47 26.16 21.20 3.85 0.943%
numa02.sh User: 5227.58 5309.61 5265.17 31.04 -0.02%
numa03.sh Real: 739.07 917.73 795.75 64.45 9.776%
numa03.sh Sys: 94.46 136.08 109.48 14.58 21.39%
numa03.sh User: 57478.56 72014.09 61764.48 5343.69 8.813%
numa04.sh Real: 442.61 715.43 530.31 96.12 46.79%
numa04.sh Sys: 224.90 348.63 285.61 48.83 72.97%
numa04.sh User: 35836.84 47522.47 40235.41 3985.26 46.26%
numa05.sh Real: 386.13 489.17 434.94 43.59 20.88%
numa05.sh Sys: 144.29 438.56 278.80 105.78 40.77%
numa05.sh User: 33255.86 36890.82 34879.31 1641.98 12.11%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

305c1fac

sched/numa: Remove redundant field · 6e303967

Srikar Dronamraju authored Jun 20, 2018

'numa_entry' is a struct list_head defined in task_struct, but never used.

No functional change.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

6e303967

sched/debug: Show the sum wait time of a task group · 3d6c50c2

Yun Wang authored Jul 04, 2018

Although we can rely on cpuacct to present the CPU usage of task
groups, it is hard to tell how intense the competition is between
these groups on CPU resources.

Monitoring the wait time or sched_debug of each process could be
very expensive, and there is no good way to accurately represent the
conflict with these info, we need the wait time on group dimension.

Thus we introduce group's wait_sum to represent the resource conflict
between task groups, which is simply the sum of the wait time of
the group's cfs_rq.

The 'cpu.stat' is modified to show the statistic, like:

   nr_periods 0
   nr_throttled 0
   throttled_time 0
   wait_sum 2035098795584

Now we can monitor the changes of wait_sum to tell how much a
a task group is suffering in the fight of CPU resources.

For example:

   (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%

means the task group paid X percentage of period on waiting
for the CPU.
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

3d6c50c2

sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474

Vincent Guittot authored Jul 19, 2018

Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.

But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):

	free *= (max - irq);
	free /= max;

when irq is fixed to 0

Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>

2e62c474

Merge branch 'sched/urgent' into sched/core, to pick up fixes · 4765096f
Ingo Molnar authored Jul 25, 2018
```
Signed-off-by: Ingo Molnar <mingo@kernel.org>
```
4765096f

sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee

Hailong Liu authored Jul 18, 2018

NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
runtime with a spin-rt-task.

However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
enough rt_runtime at the beginning, rt_runtime can't be restored to
its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.

E.g. on my PC with 4 cores, procedure to reproduce:
1) Make sure  RT_RUNTIME_SHARE is enabled
 cat /sys/kernel/debug/sched_features
  GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
  CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
  LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
  NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
  ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
2) Start a spin-rt-task
 ./loop_rr &
3) set affinity to the last cpu
 taskset -p 8 $pid_of_loop_rr
4) Observe that last cpu have borrowed enough runtime.
 cat /proc/sched_debug | grep rt_runtime
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 900.000000
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 1000.000000
5) Disable RT_RUNTIME_SHARE
 echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
6) Observe that rt_runtime can not been restored
 cat /proc/sched_debug | grep rt_runtime
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 900.000000
  .rt_runtime                    : 950.000000
  .rt_runtime                    : 1000.000000

This patch help to restore rt_runtime after we disable
RT_RUNTIME_SHARE.
Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: Ingo Molnar <mingo@kernel.org>

f3d133ee

sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196

Daniel Bristot de Oliveira authored Jul 20, 2018

Daniel Casini got this warn while running a DL task here at RetisLab:

  [  461.137582] ------------[ cut here ]------------
  [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
  [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
      [a ton of modules]
  [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
  [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
  [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
  [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
  [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
  [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
  [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
  [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
  [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
  [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
  [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
  [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
  [  461.137680] Call Trace:
  [  461.137684]  push_dl_task.part.46+0x3bc/0x460
  [  461.137686]  task_woken_dl+0x60/0x80
  [  461.137689]  ttwu_do_wakeup+0x4f/0x150
  [  461.137690]  ttwu_do_activate+0x77/0x80
  [  461.137692]  try_to_wake_up+0x1d6/0x4c0
  [  461.137693]  wake_up_q+0x32/0x70
  [  461.137696]  do_futex+0x7e7/0xb50
  [  461.137698]  __x64_sys_futex+0x8b/0x180
  [  461.137701]  do_syscall_64+0x5a/0x110
  [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [  461.137705] RIP: 0033:0x7efe4918ca26
  [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
  [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
  [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
  [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
  [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
  [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
  [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
  [  461.137743] ---[ end trace 187df4cad2bf7649 ]---

This warning happened in the push_dl_task(), because
__add_running_bw()->cpufreq_update_util() is getting the rq_clock of
the later_rq before its update, which takes place at activate_task().
The fix then is to update the rq_clock before calling add_running_bw().

To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
activate_task().
Reported-by: Daniel Casini <daniel.casini@santannapisa.it>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

840d7196

stop_machine: Disable preemption after queueing stopper threads · 2610e889

Isaac J. Manjarres authored Jul 17, 2018

This commit:

  9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")

does not fully address the race condition that can occur
as follows:

On one CPU, call it CPU 3, thread 1 invokes
cpu_stop_queue_two_works(2, 3,...), and the execution is such
that thread 1 queues the works for migration/2 and migration/3,
and is preempted after releasing the locks for migration/2 and
migration/3, but before waking the threads.

Then, On CPU 2, a kworker, call it thread 2, is running,
and it invokes cpu_stop_queue_two_works(1, 2,...), such that
thread 2 queues the works for migration/1 and migration/2.
Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
migration/2 and migration/3. This means that when CPU 2
releases the locks for migration/1 and migration/2, but before
it wakes those threads, it can be preempted by migration/2.

If thread 2 is preempted by migration/2, then migration/2 will
execute the first work item successfully, since migration/3
was woken up by CPU 3, but when it goes to execute the second
work item, it disables preemption, calls multi_cpu_stop(),
and thus, CPU 2 will wait forever for migration/1, which should
have been woken up by thread 2. However migration/1 cannot be
woken up by thread 2, since it is a kworker, so it is affine to
CPU 2, but CPU 2 is running migration/2 with preemption
disabled, so thread 2 will never run.

Disable preemption after queueing works for stopper threads
to ensure that the operation of queueing the works and waking
the stopper threads is atomic.
Co-Developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Co-Developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: matt@codeblueprint.co.uk
Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>

2610e889

sched/topology: Check variable group before dereferencing it · 6cd0c583

Yi Wang authored Jul 23, 2018

The 'group' variable in sched_domain_debug_one() is not checked
when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
but it might be NULL (it is checked later in the following while loop)
and may cause NULL pointer dereference.

We need to check it before using to avoid NULL dereference.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: Ingo Molnar <mingo@kernel.org>

6cd0c583

22 Jul, 2018 8 commits

Linux 4.18-rc6 · d72e90f3
Linus Torvalds authored Jul 22, 2018

d72e90f3

Merge tag 'nvme-for-4.18' of git://git.infradead.org/nvme · 74413084

Linus Torvalds authored Jul 22, 2018

Pull NVMe fixes from Christoph Hellwig:

 - fix a regression in 4.18 that causes a memory leak on probe failure
   (Keith Bush)

 - fix a deadlock in the passthrough ioctl code (Scott Bauer)

 - don't enable AENs if not supported (Weiping Zhang)

 - fix an old regression in metadata handling in the passthrough ioctl
   code (Roland Dreier)

* tag 'nvme-for-4.18' of git://git.infradead.org/nvme:
  nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
  nvme: don't enable AEN if not supported
  nvme: ensure forward progress during Admin passthru
  nvme-pci: fix memory leak on probe failure

74413084

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 165ea0d1

Linus Torvalds authored Jul 22, 2018

Pull vfs fixes from Al Viro:
 "Fix several places that screw up cleanups after failures halfway
  through opening a file (one open-coding filp_clone_open() and getting
  it wrong, two misusing alloc_file()). That part is -stable fodder from
  the 'work.open' branch.

  And Christoph's regression fix for uapi breakage in aio series;
  include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
  definition of sigset_t, the reason for doing so in the first place had
  been bogus - there's no need to expose struct __aio_sigset in
  aio_abi.h at all"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: don't expose __aio_sigset in uapi
  ocxlflash_getfile(): fix double-iput() on alloc_file() failures
  cxl_getfile(): fix double-iput() on alloc_file() failures
  drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()

165ea0d1

alpha: fix osf_wait4() breakage · f88a333b

Al Viro authored Jul 22, 2018

kernel_wait4() expects a userland address for status - it's only
rusage that goes as a kernel one (and needs a copyout afterwards)

[ Also, fix the prototype of kernel_wait4() to have that __user
  annotation   - Linus ]

Fixes: 92ebce5a ("osf_wait4: switch to kernel_wait4()")
Cc: stable@kernel.org # v4.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f88a333b

Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 45ae4df9

Linus Torvalds authored Jul 21, 2018

Pull ARM SoC fixes from Olof Johansson:

 - Fix interrupt type on ethernet switch for i.MX-based RDU2

 - GPC on i.MX exposed too large a register window which resulted in
   userspace being able to crash the machine.

 - Fixup of bad merge resolution moving GPIO DT nodes under pinctrl on
   droid4.

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: dts: imx6: RDU2: fix irq type for mv88e6xxx switch
  soc: imx: gpc: restrict register range for regmap access
  ARM: dts: omap4-droid4: fix dts w.r.t. pwm

45ae4df9

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ef81e63e

Linus Torvalds authored Jul 21, 2018

Pull x86 fix from Ingo Molnar:
 "A single fix for a MCE-polling regression, which prevented the
  disabling of polling"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/MCE: Remove min interval polling limitation

ef81e63e

Merge branch 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 43227e09

Linus Torvalds authored Jul 21, 2018

Pull x86 pti fixes from Ingo Molnar:
 "An APM fix, and a BTS hardware-tracing fix related to PTI changes"

* 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apm: Don't access __preempt_count with zeroed fs
  x86/events/intel/ds: Fix bts_interrupt_threshold alignment

43227e09

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 48b1db7c

Linus Torvalds authored Jul 21, 2018

Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix switched_from_dl() warning
  stop_machine: Disable preemption when waking two stopper threads

48b1db7c

21 Jul, 2018 12 commits

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ea75a2c7

Linus Torvalds authored Jul 21, 2018

Pull core kernel fixes from Ingo Molnar:
 "This is mostly the copy_to_user_mcsafe() related fixes from Dan
  Williams, and an ORC fix for Clang"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
  lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()
  lib/iov_iter: Document _copy_to_iter_flushcache()
  lib/iov_iter: Document _copy_to_iter_mcsafe()
  objtool: Use '.strtab' if '.shstrtab' doesn't exist, to support ORC tables on Clang

ea75a2c7

Merge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · ffb48e79

Linus Torvalds authored Jul 21, 2018

Pull powerpc fixes from Michael Ellerman:
 "Two regression fixes, one for xmon disassembly formatting and the
  other to fix the E500 build.

  Two commits to fix a potential security issue in the VFIO code under
  obscure circumstances.

  And finally a fix to the Power9 idle code to restore SPRG3, which is
  user visible and used for sched_getcpu().

  Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
  James Clarke"

* tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
  powerpc/Makefile: Assemble with -me500 when building for E500
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  powerpc/xmon: Fix disassembly since printf changes

ffb48e79

Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 55b636b4

Linus Torvalds authored Jul 21, 2018

Pull btrfs fix from David Sterba:
 "A fix of a corruption regarding fsync and clone, under some very
  specific conditions explained in the patch.

  The fix is marked for stable 3.16+ so I'd like to get it merged now
  given the impact"

* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix file data corruption after cloning a range and fsync

55b636b4

mm: make vm_area_alloc() initialize core fields · 490fc053

Linus Torvalds authored Jul 21, 2018

Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

490fc053

mm: make vm_area_dup() actually copy the old vma data · 95faf699

Linus Torvalds authored Jul 21, 2018

.. and re-initialize th eanon_vma_chain head.

This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

95faf699

mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5

Linus Torvalds authored Jul 21, 2018

The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3928d4f5

Merge branch 'akpm' (patches from Andrew) · 191a3afa

Linus Torvalds authored Jul 21, 2018

Merge fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: memcg: fix use after free in mem_cgroup_iter()
  mm/huge_memory.c: fix data loss when splitting a file pmd
  fat: fix memory allocation failure handling of match_strdup()
  MAINTAINERS: Peter has moved
  mm/memblock: add missing include <linux/bootmem.h>

191a3afa

mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6

Jing Xia authored Jul 20, 2018

It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
  mem_cgroup_iter+0x2e0/0x6d4
  shrink_zone+0x8c/0x324
  balance_pgdat+0x450/0x640
  kswapd+0x130/0x4b8
  kthread+0xe8/0xfc
  ret_from_fork+0x10/0x20

  mem_cgroup_iter():
      ......
      if (css_tryget(css))    <-- crash here
	    break;
      ......

The crashing reason is that mem_cgroup_iter() uses the memcg object whose
pointer is stored in iter->position, which has been freed before and
filled with POISON_FREE(0x6b).

And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.

Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia <jing.xia.mail@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <chunyan.zhang@unisoc.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9f15bde6

mm/huge_memory.c: fix data loss when splitting a file pmd · e1f1b157

Hugh Dickins authored Jul 20, 2018

__split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
and propagate that to PageDirty: otherwise, data may be lost when a huge
tmpfs page is modified then split then reclaimed.

How has this taken so long to be noticed?  Because there was no problem
when the huge page is written by a write system call (shmem_write_end()
calls set_page_dirty()), nor when the page is allocated for a write fault
(fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
a read fault (which MAP_POPULATE simulates), no set_page_dirty().

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
Fixes: d21b9e57 ("thp: handle file pages in split_huge_pmd()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Ashwin Chaugule <ashwinch@google.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1f1b157

fat: fix memory allocation failure handling of match_strdup() · 35033ab9

OGAWA Hirofumi authored Jul 20, 2018

In parse_options(), if match_strdup() failed, parse_options() leaves
opts->iocharset in unexpected state (i.e. still pointing the freed
string). And this can be the cause of double free.

To fix, this initialize opts->iocharset always when freeing.

Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jpSigned-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

35033ab9

MAINTAINERS: Peter has moved · 5a696494

Peter Senna Tschudin authored Jul 20, 2018

Update my E-mail address in the MAINTAINERS file.

Link: http://lkml.kernel.org/r/20180710144702.1308-1-peter.senna@gmail.comSigned-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.co.uk>
Acked-by: Martyn Welch <martyn.welch@collabora.co.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Martin Donnelly <martin.donnelly@ge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5a696494

mm/memblock: add missing include <linux/bootmem.h> · 19373672

Mathieu Malaterre authored Jul 20, 2018

Commit 26f09e9b ("mm/memblock: add memblock memory allocation apis")
introduced two new function definitions:

  memblock_virt_alloc_try_nid_nopanic()
  memblock_virt_alloc_try_nid()

and commit ea1f5f37 ("mm: define memblock_virt_alloc_try_nid_raw")
introduced the following function definition:

  memblock_virt_alloc_try_nid_raw()

This commit adds an include of header file <linux/bootmem.h> to provide
the missing function prototypes.  This silences the following gcc warning
(W=1):

  mm/memblock.c:1334:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_raw' [-Wmissing-prototypes]
  mm/memblock.c:1371:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_nopanic' [-Wmissing-prototypes]
  mm/memblock.c:1407:15: warning: no previous prototype for `memblock_virt_alloc_try_nid' [-Wmissing-prototypes]

Also adds #ifdef blockers to prevent compilation failure on mips/ia64
where CONFIG_NO_BOOTMEM=n as could be seen in commit commit 6cc22dc0
("revert "mm/memblock: add missing include <linux/bootmem.h>"").

Because Makefile already does:

  obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o

The #ifdef has been simplified from:

  #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)

to simply:

  #if defined(CONFIG_NO_BOOTMEM)

Link: http://lkml.kernel.org/r/20180626184422.24974-1-malat@debian.orgSigned-off-by: Mathieu Malaterre <malat@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

19373672

20 Jul, 2018 11 commits

Merge tag 'vfio-v4.18-rc6' of git://github.com/awilliam/linux-vfio · 48e5aee8

Linus Torvalds authored Jul 20, 2018

Pull VFIO fix from Alex Williamson:
 "Harden potential Spectre v1 issue (Gustavo A. R. Silva)"

* tag 'vfio-v4.18-rc6' of git://github.com/awilliam/linux-vfio:
  vfio/pci: Fix potential Spectre v1

48e5aee8

Merge tag 'for-4.18/dm-fixes-2' of... · b4460a95

Linus Torvalds authored Jul 20, 2018

Merge tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix DM writecache target to allow an optional offset to the start of
  the data and metadata area.

  This allows userspace tools (e.g. LVM2) to place a header and metadata
  at the front of the writecache device for its use"

* tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: support optional offset for start of device

b4460a95

Merge tag 'imx-fixes-4.18-4' of... · 5858610f

Olof Johansson authored Jul 20, 2018

Merge tag 'imx-fixes-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes

i.MX fixes for 4.18, round 4:
 - A fix for i.MX6 RDU2 board on the wrong IRQ type of Marvell switch,
   which might result in a race condition in the interrupt handler and
   cause the OS to miss all future events.

* tag 'imx-fixes-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
  ARM: dts: imx6: RDU2: fix irq type for mv88e6xxx switch
Signed-off-by: Olof Johansson <olof@lixom.net>

5858610f

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 18cadf9f

Linus Torvalds authored Jul 20, 2018

Pull SCSI fixes from James Bottomley:
 "A set of 8 obvious fixes.

  Three (2 qla2xxx and the cxlflash oopses) are regressions, two from
  4.17 and one from the merge window. The hpsa change is user visible,
  but it fixes an error users have complained about"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: cxlflash: fix assignment of the backend operations
  scsi: qedi: Send driver state to MFW
  scsi: qedf: Send the driver state to MFW
  scsi: hpsa: correct enclosure sas address
  scsi: sd_zbc: Fix variable type and bogus comment
  scsi: qla2xxx: Fix NULL pointer dereference for fcport search
  scsi: qla2xxx: Fix kernel crash due to late workqueue allocation
  scsi: qla2xxx: Fix inconsistent DMA mem alloc/free

18cadf9f

Merge tag 'iommu-fixes-v4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 47736af3

Linus Torvalds authored Jul 20, 2018

Pull IOMMU fix from Joerg Roedel:
 "Only one revert, for an an Intel VT-d patch that caused issues with
  the i915 GPU driver"

* tag 'iommu-fixes-v4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  Revert "iommu/vt-d: Clean up pasid quirk for pre-production devices"

47736af3

Merge tag 'platform-drivers-x86-v4.18-2' of git://git.infradead.org/linux-platform-drivers-x86 · de87dcde

Linus Torvalds authored Jul 20, 2018

Pull x86 platform driver fixes from Andy Shevchenko:
 "The Dell laptop ACPI video brightness control is now back after fixing
  a regression brought by SMM refactoring"

* tag 'platform-drivers-x86-v4.18-2' of git://git.infradead.org/linux-platform-drivers-x86:
  platform/x86: dell-laptop: Fix backlight detection

de87dcde

Merge tag 'arc-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 2a0ea7df

Linus Torvalds authored Jul 20, 2018

Pull ARC fixes from Vineet Gupta:
 "ARC is back after radio silence in 4.17:

   - Fix CONFIG_SWAP [Alexey]

   - Robustify cmpxchg emulation for systems w/o atomics [Alexey /
     PeterZ]

   - Allow mprotext(PROT_EXEC) for stack mappings [Vineet]

   - HSDK platform enable PCIe, APG GPIO [Gustavo]

   - miscll other fixes, config updates etc"

* tag 'arc-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARCv2: [plat-hsdk]: Save accl reg pair by default
  ARC: mm: allow mprotect to make stack mappings executable
  ARC: Fix CONFIG_SWAP
  ARC: [arcompact] entry.S: minor code movement
  ARC: configs: Remove CONFIG_INITRAMFS_SOURCE from defconfigs
  ARC: configs: remove no longer needed CONFIG_DEVPTS_MULTIPLE_INSTANCES
  ARC: Improve cmpxchg syscall implementation
  ARC: [plat-hsdk]: Configure APB GPIO controller on ARC HSDK platform
  ARC: [plat-hsdk] Add PCIe support
  ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
  ARC: Explicitly add -mmedium-calls to CFLAGS

2a0ea7df

Merge tag 'nds32-for-linus-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux · 293bccc5

Linus Torvalds authored Jul 20, 2018

Pull nds32 updates from Greentime Hu:
 "Bug fixes and build ixes for nds32"

* tag 'nds32-for-linus-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux:
  nds32: fix build error "relocation truncated to fit: R_NDS32_25_PCREL_RELA" when make allyesconfig
  nds32: To simplify the implementation of update_mmu_cache()
  nds32: Fix the dts pointer is not passed correctly issue.
  nds32: To implement these icache invalidation APIs since nds32 cores don't snoop data cache. This issue is found by Guo Ren. Based on the Documentation/core-api/cachetlb.rst and it says:
  nds32: Fix build error caused by configuration flag rename
  nds32: define __NDS32_E[BL]__ for sparse

293bccc5

Merge tag 'pm-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · bfa54a3a

Linus Torvalds authored Jul 20, 2018

Pull power management fix from Rafael Wysocki:
 "Fix a relatively old initialization issue in intel_pstate causing the
  pcc-cpufreq driver to be used instead of it on some HP Proliant
  systems.

  This turned into a functional regression during the 4.17 cycle,
  because pcc-cpufreq is a scalability disaster and that was amplified
  by the idle loop rework done at that time (Rafael Wysocki).

* tag 'pm-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Register when ACPI PCCH is present

bfa54a3a

Merge tag 'acpi-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 73894397

Linus Torvalds authored Jul 20, 2018

Pull ACPI fix from Rafael Wysocki:
 "Extend the recently added suspend-to-idle quirk for Thinkpad X1 Carbon
  6th to other systems from that familiy which turned out to need it too
  (Robin Johnson)"

* tag 'acpi-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / EC: Use ec_no_wakeup on more Thinkpad X1 Carbon 6th systems

73894397

nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD · 9b382768

Roland Dreier authored Jul 19, 2018

The old code in nvme_user_cmd() passed the userspace virtual address
from nvme_passthru_cmd.metadata as the length of the metadata buffer
as well as the address to nvme_submit_user_cmd().

Fixes: 63263d60 ("nvme: Use metadata for passthrough commands")
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

9b382768