Commits · 87261de30fd8e5ebd441cd2f05df73ddf04c2af2 · Kirill Smelkov / linux

04 May, 2016 40 commits

mtd: spi-nor: remove micron_quad_enable() · 87261de3

Cyrille Pitchen authored Feb 03, 2016

commit 3b5394a3 upstream.

This patch remove the micron_quad_enable() function which force the Quad
SPI mode. However, once this mode is enabled, the Micron memory expect ALL
commands to use the SPI 4-4-4 protocol. Hence a failure does occur when
calling spi_nor_wait_till_ready() right after the update of the Enhanced
Volatile Configuration Register (EVCR) in the micron_quad_enable() as
the SPI controller driver is not aware about the protocol change.

Since there is almost no performance increase using Fast Read 4-4-4
commands instead of Fast Read 1-1-4 commands, we rather keep on using the
Extended SPI mode than enabling the Quad SPI mode.

Let's take the example of the pretty standard use of 8 dummy cycles during
Fast Read operations on 64KB erase sectors:

Fast Read 1-1-4 requires 8 cycles for the command, then 24 cycles for the
3byte address followed by 8 dummy clock cycles and finally 65536*2 cycles
for the read data; so 131112 clock cycles.

On the other hand the Fast Read 4-4-4 would require 2 cycles for the
command, then 6 cycles for the 3byte address followed by 8 dummy clock
cycles and finally 65536*2 cycles for the read data. So 131088 clock
cycles. The theorical bandwidth increase is 0.0%.

Now using Fast Read operations on 512byte pages:
Fast Read 1-1-4 needs 8+24+8+(512*2) = 1064 clock cycles whereas Fast
Read 4-4-4 would requires 2+6+8+(512*2) = 1040 clock cycles. Hence the
theorical bandwidth increase is 2.3%.
Consecutive reads for non sequential pages is not a relevant use case so
The Quad SPI mode is not worth it.

mtd_speedtest seems to confirm these figures.
Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Fixes: 548cd3ab ("mtd: spi-nor: Add quad I/O support for Micron SPI NOR")
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

87261de3

serial: sh-sci: Remove cpufreq notifier to fix crash/deadlock · 447ea0a3

Geert Uytterhoeven authored Jan 05, 2016

commit ff1cab37 upstream.

The BSP team noticed that there is spin/mutex lock issue on sh-sci when
CPUFREQ is used.  The issue is that the notifier function may call
mutex_lock() while the spinlock is held, which can lead to a BUG().
This may happen if CPUFREQ is changed while another CPU calls
clk_get_rate().

Taking the spinlock was added to the notifier function in commit
e552de24 ("sh-sci: add platform device private data"), to
protect the list of serial ports against modification during traversal.
At that time the Common Clock Framework didn't exist yet, and
clk_get_rate() just returned clk->rate without taking a mutex.
Note that since commit d535a230 ("serial: sh-sci: Require a
device per port mapping."), there's no longer a list of serial ports to
traverse, and taking the spinlock became superfluous.

To fix the issue, just remove the cpufreq notifier:
  1. The notifier doesn't work correctly: all it does is update stored
     clock rates; it does not update the divider in the hardware.
     The divider will only be updated when calling sci_set_termios().
     I believe this was broken back in 2004, when the old
     drivers/char/sh-sci.c driver (where the notifier did update the
     divider) was replaced by drivers/serial/sh-sci.c (where the
     notifier just updated port->uartclk).
     Cfr. full-history-linux commits 6f8deaef2e9675d9 ("[PATCH] sh: port
     sh-sci driver to the new API") and 3f73fe878dc9210a ("[PATCH]
     Remove old sh-sci driver").
  2. On modern SoCs, the sh-sci parent clock rate is no longer related
     to the CPU clock rate anyway, so using a cpufreq notifier is
     futile.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

447ea0a3

ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() · c745297b

Eryu Guan authored Mar 12, 2016

commit 5e1021f2 upstream.

ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
ignored in the following "if" condition and ext4_expand_extra_isize()
might be called with NULL iloc.bh set, which triggers NULL pointer
dereference.

This is uncovered by commit 8b4953e1 ("ext4: reserve code points for
the project quota feature"), which enlarges the ext4_inode size, and
run the following script on new kernel but with old mke2fs:

  #/bin/bash
  mnt=/mnt/ext4
  devname=ext4-error
  dev=/dev/mapper/$devname
  fsimg=/home/fs.img

  trap cleanup 0 1 2 3 9 15

  cleanup()
  {
          umount $mnt >/dev/null 2>&1
          dmsetup remove $devname
          losetup -d $backend_dev
          rm -f $fsimg
          exit 0
  }

  rm -f $fsimg
  fallocate -l 1g $fsimg
  backend_dev=`losetup -f --show $fsimg`
  devsize=`blockdev --getsz $backend_dev`

  good_tab="0 $devsize linear $backend_dev 0"
  error_tab="0 $devsize error $backend_dev 0"

  dmsetup create $devname --table "$good_tab"

  mkfs -t ext4 $dev
  mount -t ext4 -o errors=continue,strictatime $dev $mnt

  dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
  echo 3 > /proc/sys/vm/drop_caches
  ls -l $mnt
  exit 0

[ Patch changed to simplify the function a tiny bit. -- Ted ]
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

c745297b

x86/mm/kmmio: Fix mmiotrace for hugepages · 8481fdf6

Karol Herbst authored Mar 03, 2016

commit cfa52c0c upstream.

Because Linux might use bigger pages than the 4K pages to handle those mmio
ioremaps, the kmmio code shouldn't rely on the pade id as it currently does.

Using the memory address instead of the page id lets us look up how big the
page is and what its base address is, so that we won't get a page fault
within the same page twice anymore.
Tested-by: Pierre Moreau <pierre.morrow@free.fr>
Signed-off-by: Karol Herbst <nouveau@karolherbst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Cc: linux-x86_64@vger.kernel.org
Cc: nouveau@lists.freedesktop.org
Cc: pq@iki.fi
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1456966991-6861-1-git-send-email-nouveau@karolherbst.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8481fdf6

perf evlist: Reference count the cpu and thread maps at set_maps() · 36828721

Arnaldo Carvalho de Melo authored Feb 17, 2016

commit a55e5663 upstream.

We were dropping the reference we possibly held but not obtaining one
for the new maps, which we will drop at perf_evlist__delete(), fix it.

This was caught by Steven Noonan in some of the machines which would
produce this output when caught by glibc debug mechanisms:

  $ sudo perf test 21
  21: Test object code reading                                 :***
  Error in `perf': corrupted double-linked list: 0x00000000023ffcd0 ***
  ======= Backtrace: =========
  /usr/lib/libc.so.6(+0x72055)[0x7f25be0f3055]
  /usr/lib/libc.so.6(+0x779b6)[0x7f25be0f89b6]
  /usr/lib/libc.so.6(+0x7a0ed)[0x7f25be0fb0ed]
  /usr/lib/libc.so.6(__libc_calloc+0xba)[0x7f25be0fceda]
  perf(parse_events_lex_init_extra+0x38)[0x4cfff8]
  perf(parse_events+0x55)[0x4a0615]
  perf(perf_evlist__config+0xcf)[0x4eeb2f]
  perf[0x479f82]
  perf(test__code_reading+0x1e)[0x47ad4e]
  perf(cmd_test+0x5dd)[0x46452d]
  perf[0x47f4e3]
  perf(main+0x603)[0x42c723]
  /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f25be0a1610]
  perf(_start+0x29)[0x42c859]

Further investigation using valgrind led to the reference count imbalance fixed
in this patch.
Reported-and-Tested-by: Steven Noonan <steven@uplinklabs.net>
Report-Link: http://lkml.kernel.org/r/CAKbGBLjC2Dx5vshxyGmQkcD+VwiAQLbHoXA9i7kvRB2-2opHZQ@mail.gmail.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: f30a79b0 ("perf tools: Add reference counting for cpu_map object")
Link: http://lkml.kernel.org/n/tip-j0u1bdhr47sa511sgg76kb8h@git.kernel.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

36828721

drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors · ab2c82dc

Michael Hennerich authored Feb 22, 2016

commit f3df53e4 upstream.

Fix RDAC read back errors caused by a typo. Value must shift by 2.

Fixes: a4bd3949 ("drivers/misc/ad525x_dpot.c: new features")
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ab2c82dc

rtc: max77686: Properly handle regmap_irq_get_virq() error code · f5513114

Krzysztof Kozlowski authored Feb 04, 2016

commit fb166ba1 upstream.

The regmap_irq_get_virq() can return 0 or -EINVAL in error conditions
but driver checked only for value of 0.

This could lead to a cast of -EINVAL to an unsigned int used as a
interrupt number for devm_request_threaded_irq(). Although this is not
yet fatal (devm_request_threaded_irq() will just fail with -EINVAL) but
might be a misleading when diagnosing errors.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Fixes: 6f1c1e71 ("mfd: max77686: Convert to use regmap_irq")
Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

f5513114

rtc: rx8025: remove rv8803 id · 11dd7f9a

Alexandre Belloni authored Jan 21, 2016

commit aaa3cee5 upstream.

The rv8803 has its own driver that should be used. Remove its id from
the rx8025 driver.

Fixes: b1f9d790Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

11dd7f9a

rtc: ds1685: passing bogus values to irq_restore · 83fe55ba

Dan Carpenter authored Mar 02, 2016

commit 8c09b9fd upstream.

We call spin_lock_irqrestore with "flags" set to zero instead of to the
value from spin_lock_irqsave().

Fixes: aaaf5fbf ('rtc: add driver for DS1685 family of real time clocks')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

83fe55ba

rtc: vr41xx: Wire up alarm_irq_enable · 041f2ca3

Geert Uytterhoeven authored Mar 01, 2016

commit a25f4a95 upstream.

drivers/rtc/rtc-vr41xx.c:229: warning: ‘vr41xx_rtc_alarm_irq_enable’ defined but not used

Apparently the conversion to alarm_irq_enable forgot to wire up the
callback.

Fixes: 16380c15 ("RTC: Convert rtc drivers to use the alarm_irq_enable method")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

041f2ca3

rtc: hym8563: fix invalid year calculation · 1392ec2a

Alexander Kochetkov authored Mar 06, 2016

commit d5861262 upstream.

Year field must be in BCD format, according to
hym8563 datasheet.

Due to the bug year 2016 became 2010.

Fixes: dcaf0384 ("rtc: add hym8563 rtc-driver")
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

1392ec2a

PM / Domains: Fix removal of a subdomain · cab4c949

Jon Hunter authored Mar 04, 2016

commit beda5fc1 upstream.

Commit 30e7a65b (PM / Domains: Ensure subdomain is not in use
before removing) added a test to ensure that a subdomain is not a
master to another subdomain or if any devices are using the subdomain
before removing. This change incorrectly used the "slave_links" list to
determine if the subdomain is a master to another subdomain, where it
should have been using the "master_links" list instead. The
"slave_links" list will never be empty for a subdomain and so a
subdomain can never be removed. Fix this by testing if the
"master_links" list is empty instead.

Fixes: 30e7a65b (PM / Domains: Ensure subdomain is not in use before removing)
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

cab4c949

PM / OPP: Initialize u_volt_min/max to a valid value · dabe1416

Viresh Kumar authored Feb 15, 2016

commit c88c395f upstream.

We kept u_volt_min/max initialized to 0, when only the target voltage is
present in DT, instead of the target/min/max triplet.

This didn't go well with the regulator framework, as on few calls the
min voltage was set to target and max was set to 0 and so resulted in a
kernel crash like below:

kernel BUG at ../drivers/regulator/core.c:216!

[<c0684af4>] (regulator_check_voltage) from [<c06857ac>] (regulator_set_voltage_unlocked+0x58/0x230)
[<c06857ac>] (regulator_set_voltage_unlocked) from [<c06859ac>] (regulator_set_voltage+0x28/0x54)
[<c06859ac>] (regulator_set_voltage) from [<c0775b28>] (_set_opp_voltage+0x30/0x98)
[<c0775b28>] (_set_opp_voltage) from [<c0776630>] (dev_pm_opp_set_rate+0xf0/0x28c)
[<c0776630>] (dev_pm_opp_set_rate) from [<c096f784>] (__cpufreq_driver_target+0x184/0x2b4)
[<c096f784>] (__cpufreq_driver_target) from [<c0973760>] (dbs_check_cpu+0x1b0/0x1f4)
[<c0973760>] (dbs_check_cpu) from [<c0973f30>] (cpufreq_governor_dbs+0x324/0x5c4)
[<c0973f30>] (cpufreq_governor_dbs) from [<c0970958>] (__cpufreq_governor+0xe4/0x1ec)
[<c0970958>] (__cpufreq_governor) from [<c09711e0>] (cpufreq_init_policy+0x64/0x8c)
[<c09711e0>] (cpufreq_init_policy) from [<c09718cc>] (cpufreq_online+0x2fc/0x708)
[<c09718cc>] (cpufreq_online) from [<c0765ff0>] (subsys_interface_register+0x94/0xd8)
[<c0765ff0>] (subsys_interface_register) from [<c0970530>] (cpufreq_register_driver+0x14c/0x19c)
[<c0970530>] (cpufreq_register_driver) from [<c09746dc>] (dt_cpufreq_probe+0x70/0xec)
[<c09746dc>] (dt_cpufreq_probe) from [<c076907c>] (platform_drv_probe+0x4c/0xb0)
[<c076907c>] (platform_drv_probe) from [<c07678e0>] (driver_probe_device+0x214/0x2c0)
[<c07678e0>] (driver_probe_device) from [<c0767a18>] (__driver_attach+0x8c/0x90)
[<c0767a18>] (__driver_attach) from [<c0765c2c>] (bus_for_each_dev+0x68/0x9c)
[<c0765c2c>] (bus_for_each_dev) from [<c0766d78>] (bus_add_driver+0x1a0/0x218)
[<c0766d78>] (bus_add_driver) from [<c076810c>] (driver_register+0x78/0xf8)
[<c076810c>] (driver_register) from [<c0301d74>] (do_one_initcall+0x90/0x1d8)
[<c0301d74>] (do_one_initcall) from [<c1100e14>] (kernel_init_freeable+0x15c/0x1fc)
[<c1100e14>] (kernel_init_freeable) from [<c0b27a0c>] (kernel_init+0x8/0xf0)
[<c0b27a0c>] (kernel_init) from [<c0307d78>] (ret_from_fork+0x14/0x3c)
Code: e1550004 baffffeb e3a00000 e8bd8070 (e7f001f2)

Fix that by initializing u_volt_min/max to the target voltage in such cases.
Reported-and-tested-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Fixes: 27465902 (PM / OPP: Add support to parse "operating-points-v2" bindings)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

dabe1416

misc: mic/scif: fix wrap around tests · 4f8e29e7

Dan Carpenter authored Oct 19, 2015

commit 7b64dbf8 upstream.

Signed integer overflow is undefined.  Also I added a check for
"(offset < 0)" in scif_unregister() because that makes it match the
other conditions and because I didn't want to subtract a negative.

Fixes: ba612aa8 ('misc: mic: SCIF memory registration and unregistration')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

4f8e29e7

misc/bmp085: Enable building as a module · 01c8261c

Ben Hutchings authored Dec 14, 2015

commit 50e6315d upstream.

Commit 985087db 'misc: add support for bmp18x chips to the bmp085
driver' changed the BMP085 config symbol to a boolean.  I see no
reason why the shared code cannot be built as a module, so change it
back to tristate.

Fixes: 985087db ("misc: add support for bmp18x chips to the bmp085 driver")
Cc: Eric Andersson <eric.andersson@unixphere.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

01c8261c

lib/mpi: Endianness fix · 81b3a56e

Michal Marek authored Feb 17, 2016

commit 3ee0cb5f upstream.

The limbs are integers in the host endianness, so we can't simply
iterate over the individual bytes. The current code happens to work on
little-endian, because the order of the limbs in the MPI array is the
same as the order of the bytes in each limb, but it breaks on
big-endian.

Fixes: 0f74fbf7 ("MPI: Fix mpi_read_buffer")
Signed-off-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

81b3a56e

fbdev: da8xx-fb: fix videomodes of lcd panels · 0658e8c5

Sushaanth Srirangapathi authored Feb 29, 2016

commit 713fced8 upstream.

Commit 028cd86b ("video: da8xx-fb: fix the polarities of the
hsync/vsync pulse") fixes polarities of HSYNC/VSYNC pulse but
forgot to update known_lcd_panels[] which had sync values
according to old logic. This breaks LCD at least on DA850 EVM.

This patch fixes this issue and I have tested this for panel
"Sharp_LK043T1DG01" using DA850 EVM board.

Fixes: 028cd86b ("video: da8xx-fb: fix the polarities of the hsync/vsync pulse")
Signed-off-by: Sushaanth Srirangapathi <sushaanth.s@ti.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

0658e8c5

scsi_dh: force modular build if SCSI is a module · 9d9fefc8

Arnd Bergmann authored Jan 27, 2016

commit 0c994c03 upstream.

When the scsi_dh core was moved into the scsi core module,
CONFIG_SCSI_DH became a 'bool' option, and now anything depending on it
can be built-in even when CONFIG_SCSI=m. This of course cannot link
successfully:

drivers/scsi/built-in.o: In function `rdac_init':
scsi_dh_alua.c:(.init.text+0x14): undefined reference to `scsi_register_device_handler'
scsi_dh_alua.c:(.init.text+0x64): undefined reference to `scsi_unregister_device_handler'
drivers/scsi/built-in.o: In function `alua_init':
scsi_dh_alua.c:(.init.text+0xb0): undefined reference to `scsi_register_device_handler'

As a workaround, this adds an extra dependency on CONFIG_SCSI, so
Kconfig can figure out whether built-in is allowed or not.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 086b91d0 ("scsi_dh: integrate into the core SCSI code")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

9d9fefc8

paride: make 'verbose' parameter an 'int' again · aea6995a

Arnd Bergmann authored Mar 15, 2016

commit dec63a4d upstream.

gcc-6.0 found an ancient bug in the paride driver, which had a
"module_param(verbose, bool, 0);" since before 2.6.12, but actually uses
it to accept '0', '1' or '2' as arguments:

  drivers/block/paride/pd.c: In function 'pd_init_dev_parms':
  drivers/block/paride/pd.c:298:29: warning: comparison of constant '1' with boolean expression is always false [-Wbool-compare]
   #define DBMSG(msg) ((verbose>1)?(msg):NULL)

In 2012, Rusty did a cleanup patch that also changed the type of the
variable to 'bool', which introduced what is now a gcc warning.

This changes the type back to 'int' and adapts the module_param() line
instead, so it should work as documented in case anyone ever cares about
running the ancient driver with debugging.

Fixes: 90ab5ee9 ("module_param: make bool parameters really bool (drivers & misc)")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Rusty Russell <rusty@rustcorp.com.au>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

aea6995a

regulator: s5m8767: fix get_register() error handling · 72291d61

Arnd Bergmann authored Feb 16, 2016

commit e07ff943 upstream.

The s5m8767_pmic_probe() function calls s5m8767_get_register() to
read data without checking the return code, which produces a compile-time
warning when that data is accessed:

drivers/regulator/s5m8767.c: In function 's5m8767_pmic_probe':
drivers/regulator/s5m8767.c:924:7: error: 'enable_reg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
drivers/regulator/s5m8767.c:944:30: error: 'enable_val' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This changes the s5m8767_get_register() function to return a -EINVAL
not just for an invalid register number but also for an invalid
regulator number, as both would result in returning uninitialized
data. The s5m8767_pmic_probe() function is then changed accordingly
to fail on a read error, as all the other callers of s5m8767_get_register()
already do.

In practice this probably cannot happen, as we don't call
s5m8767_get_register() with invalid arguments, but the gcc
warning seems valid in principle, in terms writing safe
error checking.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9c4c6055 ("regulator: s5m8767: Convert to use regulator_[enable|disable|is_enabled]_regmap")
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

72291d61

irqchip/mxs: Fix error check of of_io_request_and_map() · e60711a1

Vladimir Zapolskiy authored Mar 09, 2016

commit edf8fcdc upstream.

The of_io_request_and_map() returns a valid pointer in iomem region or
ERR_PTR(), check for NULL always fails and may cause a NULL pointer
dereference on error path.

Fixes: 25e34b44 ("irqchip/mxs: Prepare driver for hardware with different offsets")
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Oleksij Rempel <linux@rempel-privat.de>
Cc: Sascha Hauer <kernel@pengutronix.de>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1457486500-10237-1-git-send-email-vz@mleia.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

e60711a1

irqchip/sunxi-nmi: Fix error check of of_io_request_and_map() · fd66dc5d

Vladimir Zapolskiy authored Mar 09, 2016

commit cfe199af upstream.

The of_io_request_and_map() returns a valid pointer in iomem region or
ERR_PTR(), check for NULL always fails and may cause a NULL pointer
dereference on error path.

Fixes: 0e841b04 ("irqchip/sunxi-nmi: Switch to of_io_request_and_map() from of_iomap()")
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1457486489-10189-1-git-send-email-vz@mleia.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fd66dc5d

spi/rockchip: Make sure spi clk is on in rockchip_spi_set_cs · 791e8462

Huibin Hong authored Feb 24, 2016

commit b920cc31 upstream.

Rockchip_spi_set_cs could be called by spi_setup, but
spi_setup may be called by device driver after runtime suspend.
Then the spi clock is closed, rockchip_spi_set_cs may access the
spi registers, which causes cpu block in some socs.

Fixes: 64e36824 ("spi/rockchip: add driver for Rockchip RK3xxx")
Signed-off-by: Huibin Hong <huibin.hong@rock-chips.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

791e8462

locking/mcs: Fix mcs_spin_lock() ordering · 23a67ddd

Peter Zijlstra authored Feb 01, 2016

commit 920c720a upstream.

Similar to commit b4b29f94 ("locking/osq: Fix ordering of node
initialisation in osq_lock") the use of xchg_acquire() is
fundamentally broken with MCS like constructs.

Furthermore, it turns out we rely on the global transitivity of this
operation because the unlock path observes the pointer with a
READ_ONCE(), not an smp_load_acquire().

This is non-critical because the MCS code isn't actually used and
mostly serves as documentation, a stepping stone to the more complex
things we've build on top of the idea.
Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 3552a07a ("locking/mcs: Use acquire/release semantics")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

23a67ddd

regulator: core: Fix nested locking of supplies · 5a58f809

Thierry Reding authored Dec 02, 2015

commit 70a7fb80 upstream.

Commit fa731ac7 ("regulator: core: avoid unused variable warning")
introduced a subtle change in how supplies are locked. Where previously
code was always locking the regulator of the current iteration, the new
implementation only locks the regulator if it has a supply. For any
given power tree that means that the root will never get locked.

On the other hand the regulator_unlock_supply() will still release all
the locks, which in turn causes the lock debugging code to warn about a
mutex being unlocked which wasn't locked.

Cc: Mark Brown <broonie@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Fixes: fa731ac7 ("regulator: core: avoid unused variable warning")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

5a58f809

regulator: core: Ensure we lock all regulators · 29c9f634

Mark Brown authored Dec 01, 2015

commit 49a6bb7a upstream.

The latest workaround for the lockdep interface's not using the second
argument of mutex_lock_nested() changed the loop missed locking the last
regulator due to a thinko with the loop termination condition exiting
one regulator too soon.
Reported-by: Tyler Baker <tyler.baker@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

29c9f634

regulator: core: fix regulator_lock_supply regression · f500da32

Arnd Bergmann authored Nov 27, 2015

commit bb41897e upstream.

As noticed by Geert Uytterhoeven, my patch to avoid a harmless build warning
in regulator_lock_supply() was total crap and introduced a real bug:

> [ BUG: bad unlock balance detected! ]
> kworker/u4:0/6 is trying to release lock (&rdev->mutex) at:
> [<c0247b84>] regulator_set_voltage+0x38/0x50

we still lock the regulator supplies, but not the actual regulators,
so we are missing a lock, and the unlock is unbalanced.

This rectifies it by first locking the regulator device itself before
using the same loop as before to lock its supplies.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 716fec9d1965 ("[SUBMITTED] regulator: core: avoid unused variable warning")
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

f500da32

Revert "regulator: core: Fix nested locking of supplies" · 34af67eb

Greg Kroah-Hartman authored May 02, 2016

This reverts commit b1999fa6 which was
commit 70a7fb80 upstream.

It causes run-time breakage in the 4.4-stable tree and more patches are
needed to be applied first before this one in order to resolve the
issue.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

34af67eb

videobuf2-v4l2: Verify planes array in buffer dequeueing · 19a4e46b

Sakari Ailus authored Apr 03, 2016

commit 2c1f6951 upstream.

When a buffer is being dequeued using VIDIOC_DQBUF IOCTL, the exact buffer
which will be dequeued is not known until the buffer has been removed from
the queue. The number of planes is specific to a buffer, not to the queue.

This does lead to the situation where multi-plane buffers may be requested
and queued with n planes, but VIDIOC_DQBUF IOCTL may be passed an argument
struct with fewer planes.

__fill_v4l2_buffer() however uses the number of planes from the dequeued
videobuf2 buffer, overwriting kernel memory (the m.planes array allocated
in video_usercopy() in v4l2-ioctl.c)  if the user provided fewer
planes than the dequeued buffer had. Oops!

Fixes: b0e0e1f8 ("[media] media: videobuf2: Prepare to divide videobuf2")
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

19a4e46b

videobuf2-core: Check user space planes array in dqbuf · 3a4b3d18

Sakari Ailus authored Apr 03, 2016

commit e7e0c3e2 upstream.

The number of planes in videobuf2 is specific to a buffer. In order to
verify that the planes array provided by the user is long enough, a new
vb2_buf_op is required.

Call __verify_planes_array() when the dequeued buffer is known. Return an
error to the caller if there was one, otherwise remove the buffer from the
done list.
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3a4b3d18

USB: usbip: fix potential out-of-bounds write · 4a1bb501

Ignat Korchagin authored Mar 17, 2016

commit b348d7dd upstream.

Fix potential out-of-bounds write to urb->transfer_buffer
usbip handles network communication directly in the kernel. When receiving a
packet from its peer, usbip code parses headers according to protocol. As
part of this parsing urb->actual_length is filled. Since the input for
urb->actual_length comes from the network, it should be treated as untrusted.
Any entity controlling the network may put any value in the input and the
preallocated urb->transfer_buffer may not be large enough to hold the data.
Thus, the malicious entity is able to write arbitrary data to kernel memory.
Signed-off-by: Ignat Korchagin <ignat.korchagin@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

4a1bb501

cgroup: make sure a parent css isn't freed before its children · 3c6266d5

Tejun Heo authored Jan 21, 2016

commit 8bb5ef79 upstream.

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

The previous patch updated ordering for css_offline() which fixes the
cpu controller issue.  While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.

css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3c6266d5

mm/hwpoison: fix wrong num_poisoned_pages accounting · 36abe727

Minchan Kim authored Apr 28, 2016

commit d7e69488 upstream.

Currently, migration code increses num_poisoned_pages on *failed*
migration page as well as successfully migrated one at the trial of
memory-failure.  It will make the stat wrong.  As well, it marks the
page as PG_HWPoison even if the migration trial failed.  It would mean
we cannot recover the corrupted page using memory-failure facility.

This patches fixes it.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

36abe727

mm: vmscan: reclaim highmem zone if buffer_heads is over limit · 87c855f1

Minchan Kim authored Apr 28, 2016

commit 7bf52fb8 upstream.

We have been reclaimed highmem zone if buffer_heads is over limit but
commit 6b4f7799 ("mm: vmscan: invoke slab shrinkers from
shrink_zone()") changed the behavior so it doesn't reclaim highmem zone
although buffer_heads is over the limit.  This patch restores the logic.

Fixes: 6b4f7799 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

87c855f1

numa: fix /proc/<pid>/numa_maps for THP · e513b90a

Gerald Schaefer authored Apr 28, 2016

commit 28093f9f upstream.

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

e513b90a

mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check · be591a68

Konstantin Khlebnikov authored Apr 28, 2016

commit 3486b85a upstream.

Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.

This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.

Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes: 78f11a25 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups")
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

be591a68

memcg: relocate charge moving from ->attach to ->post_attach · 52526076

Tejun Heo authored Apr 21, 2016

commit 264a0ae1 upstream.

Hello,

So, this ended up a lot simpler than I originally expected.  I tested
it lightly and it seems to work fine.  Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?

Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads.  This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.

There's no reason to perform charge moving from ->attach which is deep
in the task migration path.  Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.

* move_charge_struct->mm is added and ->can_attach is now responsible
  for pinning and recording the target mm.  mem_cgroup_clear_mc() is
  updated accordingly.  This also simplifies mem_cgroup_move_task().

* mem_cgroup_move_task() is now called from ->post_attach instead of
  ->attach.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Debugged-and-tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes: 1ed13287 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

52526076

cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · d5209747

Tejun Heo authored Apr 21, 2016

commit 5cf1cacb upstream.

Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d5209747

slub: clean up code for kmem cgroup support to kmem_cache_free_bulk · a4e25ff3

Jesper Dangaard Brouer authored Mar 15, 2016

commit 376bf125 upstream.

This change is primarily an attempt to make it easier to realize the
optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
enabled.

Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
overhead is zero.  This is because, as long as no process have enabled
kmem cgroups accounting, the assignment is replaced by asm-NOP
operations.  This is possible because memcg_kmem_enabled() uses a
static_key_false() construct.

It also helps readability as it avoid accessing the p[] array like:
p[size - 1] which "expose" that the array is processed backwards inside
helper function build_detached_freelist().

Lastly this also makes the code more robust, in error case like passing
NULL pointers in the array.  Which were previously handled before commit
03374518 ("slub: add missing kmem cgroup support to
kmem_cache_free_bulk").

Fixes: 03374518 ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

a4e25ff3

workqueue: fix ghost PENDING flag while doing MQ IO · 2da9606a

Roman Pen authored Apr 26, 2016

commit 346c09f8 upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

2da9606a