Commits · 6a5e39136fdb0efd16b04cf9f84e70b92a2646ab · Kirill Smelkov / linux

31 Jan, 2017 2 commits

x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic · 6a5e3913

Vitaly Kuznetsov authored Jan 27, 2017

BugLink: http://bugs.launchpad.net/bugs/1630924

There is a feature in Hyper-V ('Debug-VM --InjectNonMaskableInterrupt')
which injects NMI to the guest. We may want to crash the guest and do kdump
on this NMI by enabling unknown_nmi_panic. To make kdump succeed we need to
allow the kdump kernel to re-establish VMBus connection so it will see
VMBus devices (storage, network,..).

To properly unload VMBus making it possible to start over during kdump we
need to do the following:

 - Send an 'unload' message to the hypervisor. This can be done on any CPU
   so we do this the crashing CPU.

 - Receive the 'unload finished' reply message. WS2012R2 delivers this
   message to the CPU which was used to establish VMBus connection during
   module load and this CPU may differ from the CPU sending 'unload'.

Receiving a VMBus message means the following:

 - There is a per-CPU slot in memory for one message. This slot can in
   theory be accessed by any CPU.

 - We get an interrupt on the CPU when a message was placed into the slot.

 - When we read the message we need to clear the slot and signal the fact
   to the hypervisor. In case there are more messages to this CPU pending
   the hypervisor will deliver the next message. The signaling is done by
   writing to an MSR so this can only be done on the appropriate CPU.

To avoid doing cross-CPU work on crash we have vmbus_wait_for_unload()
function which checks message slots for all CPUs in a loop waiting for the
'unload finished' messages. However, there is an issue which arises when
these conditions are met:

 - We're crashing on a CPU which is different from the one which was used
   to initially contact the hypervisor.

 - The CPU which was used for the initial contact is blocked with interrupts
   disabled and there is a message pending in the message slot.

In this case we won't be able to read the 'unload finished' message on the
crashing CPU. This is reproducible when we receive unknown NMIs on all CPUs
simultaneously: the first CPU entering panic() will proceed to crash and
all other CPUs will stop themselves with interrupts disabled.

The suggested solution is to handle unknown NMIs for Hyper-V guests on the
first CPU which gets them only. This will allow us to rely on VMBus
interrupt handler being able to receive the 'unload finish' message in
case it is delivered to a different CPU.

The issue is not reproducible on WS2016 as Debug-VM delivers NMI to the
boot CPU only, WS2012R2 and earlier Hyper-V versions are affected.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: devel@linuxdriverproject.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Link: http://lkml.kernel.org/r/20161202100720.28121-1-vkuznets@redhat.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 59107e2f)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

6a5e3913

UBUNTU: [Config] CONFIG_PINCTRL_CHERRYVIEW=y · c4ae8dea

Joseph Salisbury authored Dec 13, 2016

BugLink: http://bugs.launchpad.net/bugs/1630238

CONFIG_PINCTRL_CHERRYVIEW=y (must be y, not m) in addition to CONFIG_TOUCHSCREEN_ELAN to get the touchscreen working on CYAN.
The change of Intel related PINCTRLs (specifically the CHERRYVIEW one) to m instead of y causes at least the touchscreen regression.
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

c4ae8dea

30 Jan, 2017 3 commits

UBUNTU: SAUCE: HID: usbhid: Quirk a AMI virtual mouse and keyboard with ALWAYS_POLL · 4ffb8f69

Colin Ian King authored Jan 27, 2017

BugLink: http://bugs.launchpad.net/bugs/1652132

Quirking the following AMI USB device with ALWAYS_POLL fixes an AMI
virtual keyboard and mouse from not responding and timing out when
it is attached to a ppc64el Power 8 system and when we have some
rapid open/closes on the mouse device.

 usb 1-3: new high-speed USB device number 2 using xhci_hcd
 usb 1-3: New USB device found, idVendor=046b, idProduct=ff01
 usb 1-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 usb 1-3: Product: Virtual Hub
 usb 1-3: Manufacturer: American Megatrends Inc.
 usb 1-3: SerialNumber: serial
 usb 1-3.3: new high-speed USB device number 3 using xhci_hcd
 usb 1-3.3: New USB device found, idVendor=046b, idProduct=ff31
 usb 1-3.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 usb 1-3.3: Product: Virtual HardDisk Device
 usb 1-3.3: Manufacturer: American Megatrends Inc.
 usb 1-3.4: new low-speed USB device number 4 using xhci_hcd
 usb 1-3.4: New USB device found, idVendor=046b, idProduct=ff10
 usb 1-3.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
 usb 1-3.4: Product: Virtual Keyboard and Mouse
 usb 1-3.4: Manufacturer: American Megatrends Inc.

With the quirk I have not been able to trigger the issue with
half an hour of saturation soak testing.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

4ffb8f69

UBUNTU: [debian] derive indep_hdrs_pkg_name from src_pkg_name · 813489f7

Kamal Mostafa authored Jan 24, 2017

This long-standing oversight in our debian rules hardcodes the string "linux"
instead of using the $(src_pkg_name) for just one of the generated .deb package
names: linux-headers-x.x.x-x.  Lets fix it in the generic branches
(T,X,Y,Z,unstable) so that we won't have to keep applying this patch to each of
the derivative/custom kernels.

-----8<-----

Ignore: yes
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Ben Romer <ben.romer@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

813489f7

ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths · 5a96c011

Alexander Duyck authored Jan 23, 2017

BugLink: http://bugs.launchpad.net/bugs/1658491

When I was adding the code for enabling VLAN promiscuous mode with SR-IOV
enabled I had inadvertently left the VLNCTRL.VFE bit unchanged as I has
assumed there was code in another path that was setting it when we enabled
SR-IOV.  This wasn't the case and as a result we were just disabling VLAN
filtering for all the VFs apparently.

Also the previous patches were always clearing CFIEN which was always set
to 0 by the hardware anyway so I am dropping the redundant bit clearing.

Fixes: 16369564 ("ixgbe: Add support for VLAN promiscuous with SR-IOV")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry-picked from commit f60439bc)
Fixes: 6c39aa93 ("ixgbe: Add support for VLAN promiscuous with SR-IOV")
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

5a96c011

26 Jan, 2017 23 commits

mm, oom: prevent premature OOM killer invocation for high order request · 018a3c3c

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems).  In all reported cases the memory is
fragmented and there are no order-2+ pages available.  There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction.  Multiple reporters have confirmed
that the current linux-next which includes [1] and [2] helped and OOMs
are not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress and
we are not getting OOM for order-0 pages.  We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

Fixes: 0a0337e0 ("mm, oom: rework oom detection")
Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.czSigned-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Cc: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6b4e3181)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

018a3c3c

mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION · 3c2eabde

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

Joonsoo has reported that he is able to trigger OOM for !costly high
order requests (heavy fork() workload close the OOM) with the new oom
detection rework.  This is because we rely only on should_reclaim_retry
when the compaction is disabled and it only checks watermarks for the
requested order and so we might trigger OOM when there is a lot of free
memory.

It is not very clear what are the usual workloads when the compaction is
disabled.  Relying on high order allocations heavily without any
mechanism to create those orders except for unbound amount of reclaim is
certainly not a good idea.

To prevent from potential regressions let's help this configuration
some.  We have to sacrifice the determinsm though because there simply
is none here possible.  should_compact_retry implementation for
!CONFIG_COMPACTION, which was empty so far, will do watermark check for
order-0 on all eligible zones.  This will cause retrying until either
the reclaim cannot make any further progress or all the zones are
depleted even for order-0 pages.  This means that the number of retries
is basically unbounded for !costly orders but that was the case before
the rework as well so this shouldn't regress.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.orgReported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit 31e49bfd)
[cascardo: fixup ac_classzone_idx to ac->classzone_idx]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

3c2eabde

mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 8f32a02e

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders.  While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic.  The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap.  If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g.  hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark.  The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.

I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.

The reason why compaction requires being over low rather than min
watermark is not clear to me.  This check was there essentially since
56de7263 ("mm: compaction: direct compact when a high-order
allocation fails").  It is clearly an implementation detail though and
we shouldn't pull it into the generic retry logic while we should be
able to cope with such eventuality.  The only place in
should_compact_retry where we retry without any upper bound is for
compaction_withdrawn() case.

Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed.  In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark
check.  The reclaimable memory is reduced linearly by the allocation
order.  The idea is that we do not want to reclaim all the remaining
memory for a single allocation request just unblock
__compaction_suitable which doesn't guarantee we will make a further
progress.

The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress.  !costly requests shouldn't be affected much - e.g.  order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.

[vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
[akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit 86a294a8)
[cascardo: fixup ac_classzone_idx to ac->classzone_idx]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

8f32a02e

mm: consider compaction feedback also for costly allocation · 51760132

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress.  Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order.  Lumpy reclaim, has
been removed quite some time ago so the assumption doesn't hold anymore.
Remove the check for the number of reclaimed pages and rely on the
compaction feedback solely.  should_reclaim_retry now only makes sure
that we keep retrying reclaim for high order pages only if they are
hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt.  !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim.  We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
	- compact memory
	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
	  of swapspace)
	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
	  steps until certain amount of memory is freed (250M in my test)
	  and reduce the step to (step / 2) + 1 after reaching the end of
	  the mapping
	- then run a script which populates the page cache 2G (MemTotal)
	  from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7854ea6c)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

51760132

mm, oom: protect !costly allocations some more · 989e78e9

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0.  This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations where the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger OOM
too early - e.g.  after the first reclaim/compaction round.  Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if

	- the last compaction round backed off or
	- we haven't completed at least MAX_COMPACT_RETRIES active
	  compaction rounds.

The first rule ensures that the very last attempt for compaction was not
ignored while the second guarantees that the compaction has done some
work.  Multiple retries might be needed to prevent occasional pigggy
backing of other contexts to steal the compacted pages before the
current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense.  We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up.  With this logic in place we do not
have to increase the migration mode unconditionally and rather do it
only if the compaction failed for the weaker mode.  A nice side effect
is that the stronger migration mode is used only when really needed so
this has a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

[mhocko@suse.com: fix warning]
  Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.czSigned-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 33c2d214)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

989e78e9

mm, compaction: abstract compaction feedback to helpers · bca70cc5

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

Compaction can provide a wild variation of feedback to the caller.  Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.

This patch abstracts the feedback into three basic types:
	- compaction_made_progress - compaction was active and made some
	  progress.
	- compaction_failed - compaction failed and further attempts to
	  invoke it would most probably fail and therefore it is not
	  worth retrying
	- compaction_withdrawn - compaction wasn't invoked for an
          implementation specific reasons. In the current implementation
          it means that the compaction was deferred, contended or the
          page scanners met too early without any progress. Retrying is
          still worthwhile.

[vbabka@suse.cz: do not change thp back off behavior]
[akpm@linux-foundation.org: fix typo in comment, per Hillf]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit cab1802b)
[cascardo: fixup as we don't have kcompactd]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

bca70cc5

mm, compaction: distinguish between full and partial COMPACT_COMPLETE · ff732276

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

COMPACT_COMPLETE now means that compaction and free scanner met.  This
is not very useful information if somebody just wants to use this
feedback and make any decisions based on that.  The current caller might
be a poor guy who just happened to scan tiny portion of the zone and
that could be the reason no suitable pages were compacted.  Make sure we
distinguish the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to use
COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c8f7de0b)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

ff732276

mm, compaction: simplify __alloc_pages_direct_compact feedback interface · 5d042d3c

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

__alloc_pages_direct_compact communicates potential back off by two
variables:
	- deferred_compaction tells that the compaction returned
	  COMPACT_DEFERRED
	- contended_compaction is set when there is a contention on
	  zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit c5d01d0d)
[cascardo: small cleanup as we backported 0a0337e0 before]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

5d042d3c

mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 0b7dc5f6

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

try_to_compact_pages() can currently return COMPACT_SKIPPED even when
the compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1d4746d3)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

0b7dc5f6

mm, compaction: change COMPACT_ constants into enum · ea422501

Michal Hocko authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

Compaction code is doing weird dances between COMPACT_FOO -> int ->
unsigned long

But there doesn't seem to be any reason for that.  All functions which
return/use one of those constants are not expecting any other value so it
really makes sense to define an enum for them and make it clear that no
other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ea7ab982)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

ea422501

mm, page_alloc: convert alloc_flags to unsigned · 09807762

Mel Gorman authored Jan 24, 2017

BugLink: https://bugs.launchpad.net/bugs/1655842

alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler.  Even
without an impact, it makes more sense that this be unsigned.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(backported from commit c603844b)
[cascardo: keep ALLOC_CPUSET at __alloc_pages_nodemask]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>

09807762

UBUNTU: SAUCE: ibmvscsis: Fix srp_transfer_data fail return code · 5cb925f3

Bryant G. Ly authored Jan 17, 2017

BugLink: http://bugs.launchpad.net/bugs/1657194

If srp_transfer_data fails within ibmvscsis_write_pending, then
the most likely scenario is that the client timed out the op and
removed the TCE mapping. Thus it will loop forever retrying the
op that is pretty much guaranteed to fail forever. A better return
code would be EIO instead of EAGAIN.

Cc: stable@vger.kernel.org
Reported-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Tested-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Signed-off-by: Bryant G. Ly <bgly@us.ibm.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

5cb925f3

UBUNTU: SAUCE: ibmvscsis: fix sleeping in interrupt context · d7335085

Bryant G. Ly authored Jan 17, 2017

BugLink: http://bugs.launchpad.net/bugs/1657194

Currently, dma_alloc_coherent is being called with a GFP_KERNEL
flag which allows it to sleep in an interrupt context, need to
change to GFP_ATOMIC.

Cc: stable@vger.kernel.org
Tested-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Reviewed-by: Michael Cyr <mikecyr@linux.vnet.ibm.com>
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

d7335085

UBUNTU: SAUCE: ibmvscsis: Fix max transfer length · 4b7a27c8

Bryant G. Ly authored Jan 17, 2017

BugLink: http://bugs.launchpad.net/bugs/1657194

Current code incorrectly calculates the max transfer length, since
it is assuming a 4k page table, but ppc64 all run on 64k page tables.

Cc: stable@vger.kernel.org
Reported-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Tested-by: Steven Royer <seroyer@linux.vnet.ibm.com>
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

4b7a27c8

nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too · 3eac3808

Guilherme G. Piccoli authored Jan 17, 2017

BugLink: http://bugs.launchpad.net/bugs/1656913

Commit 54adc010 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.

When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.

In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.

So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.

Fixes: 54adc010 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne <byrneadw@ie.ibm.com>
Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com>
Reported-by: Zachary D. Myers <zdmyers@us.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit b5a10c5f)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

3eac3808

PCI: Add comments about ROM BAR updating · b09fcafe

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

pci_update_resource() updates a hardware BAR so its address matches the
kernel's struct resource UNLESS it's a disabled ROM BAR.  We only update
those when we enable the ROM.

It's not obvious from the code why ROM BARs should be handled specially.
Apparently there are Matrox devices with defective ROM BARs that read as
zero when disabled.  That means that if pci_enable_rom() reads the disabled
BAR, sets PCI_ROM_ADDRESS_ENABLE (without re-inserting the address), and
writes it back, it would enable the ROM at address zero.

Add comments and references to explain why we can't make the code look more
rational.

The code changes are from 755528c8 ("Ignore disabled ROM resources at
setup") and 8085ce08 ("[PATCH] Fix PCI ROM mapping").

Link: https://lkml.org/lkml/2005/8/30/138Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(back ported from commit 0b457dde)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

Conflicts:
	drivers/pci/rom.c

b09fcafe

PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE · 26e0d0da

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

Remove the assumption that IORESOURCE_ROM_ENABLE == PCI_ROM_ADDRESS_ENABLE.
PCI_ROM_ADDRESS_ENABLE is the ROM enable bit defined by the PCI spec, so if
we're reading or writing a BAR register value, that's what we should use.
IORESOURCE_ROM_ENABLE is a corresponding bit in struct resource flags.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(cherry picked from commit 7a6d312b)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

26e0d0da

PCI: Remove pci_resource_bar() and pci_iov_resource_bar() · db24ae9a

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

pci_std_update_resource() only deals with standard BARs, so we don't have
to worry about the complications of VF BARs in an SR-IOV capability.

Compute the BAR address inline and remove pci_resource_bar().  That makes
pci_iov_resource_bar() unused, so remove that as well.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(cherry picked from commit 286c2378)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

db24ae9a

PCI: Don't update VF BARs while VF memory space is enabled · 20e8bbd6

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

If we update a VF BAR while it's enabled, there are two potential problems:

  1) Any driver that's using the VF has a cached BAR value that is stale
     after the update, and

  2) We can't update 64-bit BARs atomically, so the intermediate state
     (new lower dword with old upper dword) may conflict with another
     device, and an access by a driver unrelated to the VF may cause a bus
     error.

Warn about attempts to update VF BARs while they are enabled.  This is a
programming error, so use dev_WARN() to get a backtrace.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(cherry picked from commit 546ba9f8)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

20e8bbd6

PCI: Separate VF BAR updates from standard BAR updates · 8d0dd6fc

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

Previously pci_update_resource() used the same code path for updating
standard BARs and VF BARs in SR-IOV capabilities.

Split the VF BAR update into a new pci_iov_update_resource() internal
interface, which makes it simpler to compute the BAR address (we can get
rid of pci_resource_bar() and pci_iov_resource_bar()).

This patch:

  - Renames pci_update_resource() to pci_std_update_resource(),
  - Adds pci_iov_update_resource(),
  - Makes pci_update_resource() a wrapper that calls the appropriate one,

No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(cherry picked from commit 6ffa2489)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

8d0dd6fc

PCI: Update BARs using property bits appropriate for type · a602ebaf

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

The BAR property bits (0-3 for memory BARs, 0-1 for I/O BARs) are supposed
to be read-only, but we do save them in res->flags and include them when
updating the BAR.

Mask the I/O property bits with ~PCI_BASE_ADDRESS_IO_MASK (0x3) instead of
PCI_REGION_FLAG_MASK (0xf) to make it obvious that we can't corrupt bits
2-3 of I/O addresses.

Use PCI_ROM_ADDRESS_MASK for ROM BARs.  This means we'll only check the top
21 bits (instead of the 28 bits we used to check) of a ROM BAR to see if
the update was successful.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 45d004f4)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

a602ebaf

PCI: Ignore BAR updates on virtual functions · 8f9859b1

Bjorn Helgaas authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

VF BARs are read-only zero, so updating VF BARs will not have any effect.
See the SR-IOV spec r1.1, sec 3.4.1.11.

We already ignore these updates because of 70675e0b ("PCI: Don't try to
restore VF BARs"); this merely restructures it slightly to make it easier
to split updates for standard and SR-IOV BARs.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
(cherry picked from commit 63880b23)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

8f9859b1

PCI: Do any VF BAR updates before enabling the BARs · 1aed9aba

Gavin Shan authored Jan 13, 2017

BugLink: http://bugs.launchpad.net/bugs/1625318

Previously we enabled VFs and enable their memory space before calling
pcibios_sriov_enable().  But pcibios_sriov_enable() may update the VF BARs:
for example, on PPC PowerNV we may change them to manage the association of
VFs to PEs.

Because 64-bit BARs cannot be updated atomically, it's unsafe to update
them while they're enabled.  The half-updated state may conflict with other
devices in the system.

Call pcibios_sriov_enable() before enabling the VFs so any BAR updates
happen while the VF BARs are disabled.

[bhelgaas: changelog]
Tested-by: Carol Soto <clsoto@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit f40ec3c7)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

1aed9aba

25 Jan, 2017 3 commits

kvm: x86: Check dest_map->vector to match eoi signals for rtc · 296cb660

Joerg Roedel authored Dec 14, 2016

BugLink: http://bugs.launchpad.net/bugs/1649718

Using the vector stored at interrupt delivery makes the eoi
matching safe agains irq migration in the ioapic.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 4d99ba89)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

296cb660

kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map · f05aa1ca

Joerg Roedel authored Dec 14, 2016

BugLink: http://bugs.launchpad.net/bugs/1649718

This allows backtracking later in case the rtc irq has been
moved to another vcpu/vector.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 9daa5007)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

f05aa1ca

kvm: x86: Convert ioapic->rtc_status.dest_map to a struct · aee4b7a8

Joerg Roedel authored Dec 14, 2016

BugLink: http://bugs.launchpad.net/bugs/1649718

Currently this is a bitmap which tracks which CPUs we expect
an EOI from. Move this bitmap to a struct so that we can
track additional information there.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 9e4aabe2)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>

aee4b7a8

20 Jan, 2017 9 commits

Linux 4.4.44 · f1e7cfcc

Greg Kroah-Hartman authored Jan 20, 2017

BugLink: http://bugs.launchpad.net/bugs/1658091Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

f1e7cfcc

pinctrl: sh-pfc: Do not unconditionally support PIN_CONFIG_BIAS_DISABLE · ae949cda

Niklas Söderlund authored Nov 12, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit 5d7400c4 upstream.

Always stating PIN_CONFIG_BIAS_DISABLE is supported gives untrue output
when examining /sys/kernel/debug/pinctrl/e6060000.pfc/pinconf-pins if
the operation get_bias() is implemented but the pin is not handled by
the get_bias() implementation. In that case the output will state that
"input bias disabled" indicating that this pin has bias control
support.

Make support for PIN_CONFIG_BIAS_DISABLE depend on that the pin either
supports SH_PFC_PIN_CFG_PULL_UP or SH_PFC_PIN_CFG_PULL_DOWN. This also
solves the issue where SoC specific implementations print error messages
if their particular implementation of {set,get}_bias() is called with a
pin it does not know about.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

ae949cda

powerpc/ibmebus: Fix device reference leaks in sysfs interface · 3e991191

Johan Hovold authored Nov 01, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit fe0f3168 upstream.

Make sure to drop any reference taken by bus_find_device() in the sysfs
callbacks that are used to create and destroy devices based on
device-tree entries.

Fixes: 6bccf755 ("[POWERPC] ibmebus: dynamic addition/removal of adapters, some code cleanup")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

3e991191

powerpc/ibmebus: Fix further device reference leaks · a861081d

Johan Hovold authored Nov 01, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit 815a7141 upstream.

Make sure to drop any reference taken by bus_find_device() when creating
devices during init and driver registration.

Fixes: 55347cc9 ("[POWERPC] ibmebus: Add device creation and bus probing based on of_device")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

a861081d

bus: vexpress-config: fix device reference leak · cd5c5be5

Johan Hovold authored Nov 16, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit c090959b upstream.

Make sure to drop the reference to the parent device taken by
class_find_device() after populating the bus.

Fixes: 3b9334ac ("mfd: vexpress: Convert custom func API to regmap")
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

cd5c5be5

blk-mq: Always schedule hctx->next_cpu · 5273bc28

Gabriel Krisman Bertazi authored Sep 28, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit c02ebfdd upstream.

Commit 0e87e58b ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

5273bc28

ACPI / APEI: Fix NMI notification handling · 17945299

Prarit Bhargava authored Nov 30, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit a545715d upstream.

When removing and adding cpu 0 on a system with GHES NMI the following stack
trace is seen when re-adding the cpu:

WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:1349 setup_local_APIC+
Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache coretemp intel_ra
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc6+ #2
Call Trace:
 dump_stack+0x63/0x8e
 __warn+0xd1/0xf0
 warn_slowpath_null+0x1d/0x20
 setup_local_APIC+0x275/0x370
 apic_ap_setup+0xe/0x20
 start_secondary+0x48/0x180
 set_init_arg+0x55/0x55
 early_idt_handler_array+0x120/0x120
 x86_64_start_reservations+0x2a/0x2c
 x86_64_start_kernel+0x13d/0x14c

During the cpu bringup, wakeup_cpu_via_init_nmi() is called and issues an
NMI on CPU 0.  The GHES NMI handler, ghes_notify_nmi() runs the
ghes_proc_irq_work work queue which ends up setting IRQ_WORK_VECTOR
(0xf6).  The "faulty" IR line set at arch/x86/kernel/apic/apic.c:1349 is  also
0xf6 (specifically APIC IRR for irqs 255 to 224 is 0x400000) which confirms
that something has set the IRQ_WORK_VECTOR line prior to the APIC being
initialized.

Commit 2383844d ("GHES: Elliminate double-loop in the NMI handler")
incorrectly modified the behavior such that the handler returns
NMI_HANDLED only if an error was processed, and incorrectly runs the ghes
work queue for every NMI.

This patch modifies the ghes_proc_irq_work() to run as it did prior to
2383844d ("GHES: Elliminate double-loop in the NMI handler") by
properly returning NMI_HANDLED and only calling the work queue if
NMI_HANDLED has been set.

Fixes: 2383844d (GHES: Elliminate double-loop in the NMI handler)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

17945299

block: cfq_cpd_alloc() should use @gfp · b6542de2

Tejun Heo authored Nov 10, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit ebc4ff66 upstream.

cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter.  This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL.  Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e4a9bde9 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

b6542de2

cpufreq: powernv: Disable preemption while checking CPU throttling state · 9150811b

Denis Kirjanov authored Nov 08, 2016

BugLink: http://bugs.launchpad.net/bugs/1658091

commit 8a10c06a upstream.

With preemption turned on we can read incorrect throttling state
while being switched to CPU on a different chip.

 BUG: using smp_processor_id() in preemptible [00000000] code: cat/7343
 caller is .powernv_cpufreq_throttle_check+0x2c/0x710
 CPU: 13 PID: 7343 Comm: cat Not tainted 4.8.0-rc5-dirty #1
 Call Trace:
 [c0000007d25b75b0] [c000000000971378] .dump_stack+0xe4/0x150 (unreliable)
 [c0000007d25b7640] [c0000000005162e4] .check_preemption_disabled+0x134/0x150
 [c0000007d25b76e0] [c0000000007b63ac] .powernv_cpufreq_throttle_check+0x2c/0x710
 [c0000007d25b7790] [c0000000007b6d18] .powernv_cpufreq_target_index+0x288/0x360
 [c0000007d25b7870] [c0000000007acee4] .__cpufreq_driver_target+0x394/0x8c0
 [c0000007d25b7920] [c0000000007b22ac] .cpufreq_set+0x7c/0xd0
 [c0000007d25b79b0] [c0000000007adf50] .store_scaling_setspeed+0x80/0xc0
 [c0000007d25b7a40] [c0000000007ae270] .store+0xa0/0x100
 [c0000007d25b7ae0] [c0000000003566e8] .sysfs_kf_write+0x88/0xb0
 [c0000007d25b7b70] [c0000000003553b8] .kernfs_fop_write+0x178/0x260
 [c0000007d25b7c10] [c0000000002ac3cc] .__vfs_write+0x3c/0x1c0
 [c0000007d25b7cf0] [c0000000002ad584] .vfs_write+0xc4/0x230
 [c0000007d25b7d90] [c0000000002aeef8] .SyS_write+0x58/0x100
 [c0000007d25b7e30] [c00000000000bfec] system_call+0x38/0xfc

Fixes: 09a972d1 (cpufreq: powernv: Report cpu frequency throttling)
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>

9150811b