Commits · 441a92de847fcc2550a72898f4a6b554fe825514 · nexedi / linux

10 Jun, 2016 40 commits

mmc: longer timeout for long read time quirk · 441a92de

Matt Gumbel authored May 20, 2016

commit 32ecd320 upstream.

008GE0 Toshiba mmc in some Intel Baytrail tablets responds to
MMC_SEND_EXT_CSD in 450-600ms.

This patch will...

() Increase the long read time quirk timeout from 300ms to 600ms. Original
   author of that quirk says 300ms was only a guess and that the number
   may need to be raised in the future.

() Add this specific MMC to the quirk
Signed-off-by: Matt Gumbel <matthew.k.gumbel@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

441a92de

drm/i915: Don't leave old junk in ilk active watermarks on readout · 6ebe723a

Ville Syrjälä authored May 13, 2016

commit 7045c368 upstream.

When we read out the watermark state from the hardware we're supposed to
transfer that into the active watermarks, but currently we fail to any
part of the active watermarks that isn't explicitly written. Let's clear
it all upfront.

Looks like this has been like this since the beginning, when I added the
readout. No idea why I didn't clear it up.

Cc: Matt Roper <matthew.d.roper@intel.com>
Fixes: 243e6a44 ("drm/i915: Init HSW watermark tracking in intel_modeset_setup_hw_state()")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1463151318-14719-2-git-send-email-ville.syrjala@linux.intel.com
(cherry picked from commit 15606534)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

6ebe723a

locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait() · a6f015aa

Peter Zijlstra authored May 20, 2016

commit 54cf809b upstream.

Similar to commits:

  51d7d520 ("powerpc: Add smp_mb() to arch_spin_is_locked()")
  d86b8da0 ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")

qspinlock suffers from the fact that the _Q_LOCKED_VAL store is
unordered inside the ACQUIRE of the lock.

And while this is not a problem for the regular mutual exclusive
critical section usage of spinlocks, it breaks creative locking like:

	spin_lock(A)			spin_lock(B)
	spin_unlock_wait(B)		if (!spin_is_locked(A))
	do_something()			  do_something()

In that both CPUs can end up running do_something at the same time,
because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait()
spin_is_locked() loads (even on x86!!).

To avoid making the normal case slower, add smp_mb()s to the less used
spin_unlock_wait() / spin_is_locked() side of things to avoid this
problem.
Reported-and-tested-by: Davidlohr Bueso <dave@stgolabs.net>
Reported-by: Giovanni Gherdovich <ggherdovich@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

a6f015aa

mm: use phys_addr_t for reserve_bootmem_region() arguments · aca45bc7

Stefan Bader authored May 20, 2016

commit 4b50bcc7 upstream.

Since commit 92923ca3 ("mm: meminit: only set page reserved in the
memblock region") the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long.  This is only
32bit on i386, so it can end up marking the wrong pages reserved for
ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example).  This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range).  But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

aca45bc7

PM / sleep: Handle failures in device_suspend_late() consistently · 8fb9c1b6

Rafael J. Wysocki authored May 20, 2016

commit 3a17fb32 upstream.

Grygorii Strashko reports:

 The PM runtime will be left disabled for the device if its
 .suspend_late() callback fails and async suspend is not allowed
 for this device. In this case device will not be added in
 dpm_late_early_list and dpm_resume_early() will ignore this
 device, as result PM runtime will be disabled for it forever
 (side effect: after 8 subsequent failures for the same device
 the PM runtime will be reenabled due to disable_depth overflow).

To fix this problem, add devices to dpm_late_early_list regardless
of whether or not device_suspend_late() returns errors for them.

That will ensure failures in there to be handled consistently for
all devices regardless of their async suspend/resume status.
Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8fb9c1b6

Input: uinput - handle compat ioctl for UI_SET_PHYS · f958317e

Ricky Liang authored May 20, 2016

commit affa80bd upstream.

When running a 32-bit userspace on a 64-bit kernel, the UI_SET_PHYS
ioctl needs to be treated with special care, as it has the pointer
size encoded in the command.
Signed-off-by: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

f958317e

kvm: arm64: Fix EC field in inject_abt64 · 67f12f82

Matt Evans authored May 16, 2016

commit e4fe9e7d upstream.

The EC field of the constructed ESR is conditionally modified by ORing in
ESR_ELx_EC_DABT_LOW for a data abort.  However, ESR_ELx_EC_SHIFT is missing
from this condition.
Signed-off-by: Matt Evans <matt.evans@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

67f12f82

ALSA: hda - Fix headphone noise on Dell XPS 13 9360 · 4ecb0e5f

Kai-Heng Feng authored May 20, 2016

commit 423cd785 upstream.

The headphone has noise when playing sound or switching microphone sources.
It uses the same codec on XPS 13 9350, but with different subsystem ID.
Applying the fixup can solve the issue.
Also, changing the model name to better differentiate models.

v2: Reorder by device ID.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

4ecb0e5f

cifs: Create dedicated keyring for spnego operations · e58a0f5d

Sachin Prabhu authored May 17, 2016

commit b74cb9a8 upstream.

The session key is the default keyring set for request_key operations.
This session key is revoked when the user owning the session logs out.
Any long running daemon processes started by this session ends up with
revoked session keyring which prevents these processes from using the
request_key mechanism from obtaining the krb5 keys.

The problem has been reported by a large number of autofs users. The
problem is also seen with multiuser mounts where the share may be used
by processes run by a user who has since logged out. A reproducer using
automount is available on the Red Hat bz.

The patch creates a new keyring which is used to cache cifs spnego
upcalls.

Red Hat bz: 1267754
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
[ kamal: backport to 4.2-stable: keyring_alloc takes no restrict_link param ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

e58a0f5d

ASoC: ak4642: Enable cache usage to fix crashes on resume · 32e56a92

Mark Brown authored May 18, 2016

commit d3030d11 upstream.

The ak4642 driver is using a regmap cache sync to restore the
configuration of the chip on resume but (as Peter observed) does not
actually define a register cache which means that the resume is never
going to work and we trigger asserts in regmap.  Fix this by enabling
caching.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Mark Brown <broonie@kernel.org>
[ kamal: backport to 4.2-stable: no separate ak4643_regmap ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

32e56a92

KVM: MTRR: remove MSR 0x2f8 · 62ab3693

Andy Honig authored May 17, 2016

commit 9842df62 upstream.

MSR 0x2f8 accessed the 124th Variable Range MTRR ever since MTRR support
was introduced by 9ba075a6 ("KVM: MTRR support").

0x2f8 became harmful when 910a6aae ("KVM: MTRR: exactly define the
size of variable MTRRs") shrinked the array of VR MTRRs from 256 to 8,
which made access to index 124 out of bounds.  The surrounding code only
WARNs in this situation, thus the guest gained a limited read/write
access to struct kvm_arch_vcpu.

0x2f8 is not a valid VR MTRR MSR, because KVM has/advertises only 16 VR
MTRR MSRs, 0x200-0x20f.  Every VR MTRR is set up using two MSRs, 0x2f8
was treated as a PHYSBASE and 0x2f9 would be its PHYSMASK, but 0x2f9 was
not implemented in KVM, therefore 0x2f8 could never do anything useful
and getting rid of it is safe.

This fixes CVE-2016-3713.

Fixes: 910a6aae ("KVM: MTRR: exactly define the size of variable MTRRs")
Reported-by: David Matlack <dmatlack@google.com>
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

62ab3693

xfs: skip stale inodes in xfs_iflush_cluster · 0cb2d772

Dave Chinner authored May 18, 2016

commit 7d3aa7fe upstream.

We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

0cb2d772

xfs: fix inode validity check in xfs_iflush_cluster · 87994ff4

Dave Chinner authored May 18, 2016

commit 51b07f30 upstream.

Some careless idiot(*) wrote crap code in commit 1a3e8f3d ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.

(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

87994ff4

xfs: xfs_iflush_cluster fails to abort on error · 9b63d6af

Dave Chinner authored May 18, 2016

commit b1438f47 upstream.

When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.

Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

9b63d6af

cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter() · 3f53052e

Daniel Lezcano authored May 17, 2016

commit e7387da5 upstream.

Commit 0b89e9aa (cpuidle: delay enabling interrupts until all
coupled CPUs leave idle) rightfully fixed a regression by letting
the coupled idle state framework to handle local interrupt enabling
when the CPU is exiting an idle state.

The current code checks if the idle state is coupled and, if so, it
will let the coupled code to enable interrupts. This way, it can
decrement the ready-count before handling the interrupt. This
mechanism prevents the other CPUs from waiting for a CPU which is
handling interrupts.

But the check is done against the state index returned by the back
end driver's ->enter functions which could be different from the
initial index passed as parameter to the cpuidle_enter_state()
function.

 entered_state = target_state->enter(dev, drv, index);

 [ ... ]

 if (!cpuidle_state_is_coupled(drv, entered_state))
	local_irq_enable();

 [ ... ]

If the 'index' is referring to a coupled idle state but the
'entered_state' is *not* coupled, then the interrupts are enabled
again. All CPUs blocked on the sync barrier may busy loop longer
if the CPU has interrupts to handle before decrementing the
ready-count. That's consuming more energy than saving.

Fixes: 0b89e9aa (cpuidle: delay enabling interrupts until all coupled CPUs leave idle)
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ kamal: backport to 4.2-stable: context ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

3f53052e

remove directory incorrectly tries to set delete on close on non-empty directories · b2d06481

Steve French authored May 12, 2016

commit 897fba11 upstream.

Wrong return code was being returned on SMB3 rmdir of
non-empty directory.

For SMB3 (unlike for cifs), we attempt to delete a directory by
set of delete on close flag on the open. Windows clients set
this flag via a set info (SET_FILE_DISPOSITION to set this flag)
which properly checks if the directory is empty.

With this patch on smb3 mounts we correctly return
 "DIRECTORY NOT EMPTY"
on attempts to remove a non-empty directory.
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

b2d06481

fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication · 03161ff5

Stefan Metzmacher authored May 03, 2016

commit 1a967d6c upstream.

Only server which map unknown users to guest will allow
access using a non-null NTLMv2_Response.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

03161ff5

fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication · 0430e786

Stefan Metzmacher authored May 03, 2016

commit 777f69b8 upstream.

Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

0430e786

fs/cifs: correctly to anonymous authentication for the LANMAN authentication · b1b5741b

Stefan Metzmacher authored May 03, 2016

commit fa8f3a35 upstream.

Only server which map unknown users to guest will allow
access using a non-null LMChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

b1b5741b

fs/cifs: correctly to anonymous authentication via NTLMSSP · 051868ea

Stefan Metzmacher authored May 03, 2016

commit cfda35d9 upstream.

See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client:

   ...
   Set NullSession to FALSE
   If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND
      AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND
      (AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1)
       OR
       AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0))
       -- Special case: client requested anonymous authentication
       Set NullSession to TRUE
   ...

Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

051868ea

drm/fb_helper: Fix references to dev->mode_config.num_connector · a31acb15

Lyude authored May 12, 2016

commit 255f0e7c upstream.

During boot, MST hotplugs are generally expected (even if no physical
hotplugging occurs) and result in DRM's connector topology changing.
This means that using num_connector from the current mode configuration
can lead to the number of connectors changing under us. This can lead to
some nasty scenarios in fbcon:

- We allocate an array to the size of dev->mode_config.num_connectors.
- MST hotplug occurs, dev->mode_config.num_connectors gets incremented.
- We try to loop through each element in the array using the new value
  of dev->mode_config.num_connectors, and end up going out of bounds
  since dev->mode_config.num_connectors is now larger then the array we
  allocated.

fb_helper->connector_count however, will always remain consistent while
we do a modeset in fb_helper.

Note: This is just polish for 4.7, Dave Airlie's drm_connector
refcounting fixed these bugs for real. But it's good enough duct-tape
for stable kernel backporting, since backporting the refcounting
changes is way too invasive.
Signed-off-by: Lyude <cpaul@redhat.com>
[danvet: Clarify why we need this. Also remove the now unused "dev"
local variable to appease gcc.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1463065021-18280-3-git-send-email-cpaul@redhat.comSigned-off-by: Kamal Mostafa <kamal@canonical.com>

a31acb15

drm/i915/fbdev: Fix num_connector references in intel_fb_initial_config() · 29b35727

Lyude authored May 12, 2016

commit 14a3842a upstream.

During boot time, MST devices usually send a ton of hotplug events
irregardless of whether or not any physical hotplugs actually occurred.
Hotplugs mean connectors being created/destroyed, and the number of DRM
connectors changing under us. This isn't a problem if we use
fb_helper->connector_count since we only set it once in the code,
however if we use num_connector from struct drm_mode_config we risk it's
value changing under us. On top of that, there's even a chance that
dev->mode_config.num_connector != fb_helper->connector_count. If the
number of connectors happens to increase under us, we'll end up using
the wrong array size for memcpy and start writing beyond the actual
length of the array, occasionally resulting in kernel panics.

Note: This is just polish for 4.7, Dave Airlie's drm_connector
refcounting fixed these bugs for real. But it's good enough duct-tape
for stable kernel backporting, since backporting the refcounting
changes is way too invasive.
Signed-off-by: Lyude <cpaul@redhat.com>
[danvet: Clarify why we need this.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1463065021-18280-2-git-send-email-cpaul@redhat.comSigned-off-by: Kamal Mostafa <kamal@canonical.com>

29b35727

MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC · 8a51e226

Maciej W. Rozycki authored May 17, 2016

commit e49d3848 upstream.

Fix a build regression from commit c9017757 ("MIPS: init upper 64b
of vector registers when MSA is first used"):

arch/mips/built-in.o: In function `enable_restore_fp_context':
traps.c:(.text+0xbb90): undefined reference to `_init_msa_upper'
traps.c:(.text+0xbb90): relocation truncated to fit: R_MIPS_26 against `_init_msa_upper'
traps.c:(.text+0xbef0): undefined reference to `_init_msa_upper'
traps.c:(.text+0xbef0): relocation truncated to fit: R_MIPS_26 against `_init_msa_upper'

to !CONFIG_CPU_HAS_MSA configurations with older GCC versions, which are
unable to figure out that calls to `_init_msa_upper' are indeed dead.
Of the many ways to tackle this failure choose the approach we have
already taken in `thread_msa_context_live'.

[ralf@linux-mips.org: Drop patch segment to junk file.]
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13271/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8a51e226

PCI: Disable all BAR sizing for devices with non-compliant BARs · bcdc7d4a

Prarit Bhargava authored May 11, 2016

commit ad67b437 upstream.

b84106b4 ("PCI: Disable IO/MEM decoding for devices with non-compliant
BARs") disabled BAR sizing for BARs 0-5 of devices that don't comply with
the PCI spec.  But it didn't do anything for expansion ROM BARs, so we
still try to size them, resulting in warnings like this on Broadwell-EP:

  pci 0000:ff:12.0: BAR 6: failed to assign [mem size 0x00000001 pref]

Move the non-compliant BAR check from __pci_read_base() up to
pci_read_bases() so it applies to the expansion ROM BAR as well as
to BARs 0-5.

Note that direct callers of __pci_read_base(), like sriov_init(), will now
bypass this check.  We haven't had reports of devices with broken SR-IOV
BARs yet.

[bhelgaas: changelog]
Fixes: b84106b4 ("PCI: Disable IO/MEM decoding for devices with non-compliant BARs")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

bcdc7d4a

mmc: mmc: Fix partition switch timeout for some eMMCs · 05daa23f

Adrian Hunter authored May 05, 2016

commit 1c447116 upstream.

Some eMMCs set the partition switch timeout too low.

Now typically eMMCs are considered a critical component (e.g. because
they store the root file system) and consequently are expected to be
reliable.  Thus we can neglect the use case where eMMCs can't switch
reliably and we might want a lower timeout to facilitate speedy
recovery.

Although we could employ a quirk for the cards that are affected (if
we could identify them all), as described above, there is little
benefit to having a low timeout, so instead simply set a minimum
timeout.

The minimum is set to 300ms somewhat arbitrarily - the examples that
have been seen had a timeout of 10ms but were sometimes taking 60-70ms.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

05daa23f

ring-buffer: Prevent overflow of size in ring_buffer_resize() · 5be13df6

Steven Rostedt (Red Hat) authored May 13, 2016

commit 59643d15 upstream.

If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.

Here's the details:

  # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb

tracing_entries_write() processes this and converts kb to bytes.

 18014398509481980 << 10 = 18446744073709547520

and this is passed to ring_buffer_resize() as unsigned long size.

 size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);

Where DIV_ROUND_UP(a, b) is (a + b - 1)/b

BUF_PAGE_SIZE is 4080 and here

 18446744073709547520 + 4080 - 1 = 18446744073709551599

where 18446744073709551599 is still smaller than 2^64

 2^64 - 18446744073709551599 = 17

But now 18446744073709551599 / 4080 = 4521260802379792

and size = size * 4080 = 18446744073709551360

This is checked to make sure its still greater than 2 * 4080,
which it is.

Then we convert to the number of buffer pages needed.

 nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)

but this time size is 18446744073709551360 and

 2^64 - (18446744073709551360 + 4080 - 1) = -3823

Thus it overflows and the resulting number is less than 4080, which makes

  3823 / 4080 = 0

an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.

There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.

Fixes: 83f40318 ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

5be13df6

ring-buffer: Use long for nr_pages to avoid overflow failures · de2c7afc

Steven Rostedt (Red Hat) authored May 12, 2016

commit 9b94a8fb upstream.

The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.

For example, the following will cause the ring buffer to crash:

 # cd /sys/kernel/debug/tracing
 # echo 10 > buffer_size_kb
 # echo 8556384240 > buffer_size_kb

Then you get the warning of:

 WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260

Which is:

  RB_WARN_ON(cpu_buffer, nr_removed);

Note each ring buffer page holds 4080 bytes.

This is because:

 1) 10 causes the ring buffer to have 3 pages.
    (10kb requires 3 * 4080 pages to hold)

 2) (2^31 / 2^10  + 1) * 4080 = 8556384240
    The value written into buffer_size_kb is shifted by 10 and then passed
    to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760

 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
    which is 4080. 8761737461760 / 4080 = 2147484672

 4) nr_pages is subtracted from the current nr_pages (3) and we get:
    2147484669. This value is saved in a signed integer nr_pages_to_update

 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
    turns into the value of -2147482627

 6) As the value is a negative number, in update_pages_handler() it is
    negated and passed to rb_remove_pages() and 2147482627 pages will
    be removed, which is much larger than 3 and it causes the warning
    because not all the pages asked to be removed were removed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001

Fixes: 7a8e76a3 ("tracing: unified trace buffer")
Reported-by: Hao Qin <QEver.cn@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

de2c7afc

MIPS: Force CPUs to lose FP context during mode switches · dd08b329

Paul Burton authored Apr 21, 2016

commit 6b832257 upstream.

Commit 9791554b ("MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options
for MIPS") added support for the PR_SET_FP_MODE prctl, which allows a
userland program to modify its FP mode at runtime. This is most notably
required if dynamic linking leads to the FP mode requirement changing at
runtime from that indicated in the initial executable's ELF header. In
order to avoid overhead in the general FP context restore code, it aimed
to have threads in the process become unable to enable the FPU during a
mode switch & have the thread calling the prctl syscall wait for all
other threads in the process to be context switched at least once. Once
that happens we can know that no thread in the process whose mode will
be switched has live FP context, and it's safe to perform the mode
switch. However in the (rare) case of modeswitches occurring in
multithreaded programs this can lead to indeterminate delays for the
thread invoking the prctl syscall, and the code monitoring for those
context switches was woefully inadequate for all but the simplest cases.

Fix this by broadcasting an IPI if other CPUs may have live FP context
for an affected thread, with a handler causing those CPUs to relinquish
their FPU ownership. Threads will then be allowed to continue running
but will stall on the wait_on_atomic_t in enable_restore_fp_context if
they attempt to use FP again whilst the mode switch is still in
progress. The end result is less fragile poking at scheduler context
switch counts & a more expedient completion of the mode switch.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: 9791554b ("MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS")
Reviewed-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13145/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

dd08b329

MIPS: Disable preemption during prctl(PR_SET_FP_MODE, ...) · 0b1fe3dc

Paul Burton authored Apr 21, 2016

commit bd239f1e upstream.

Whilst a PR_SET_FP_MODE prctl is performed there are decisions made
based upon whether the task is executing on the current CPU. This may
change if we're preempted, so disable preemption to avoid such changes
for the lifetime of the mode switch.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: 9791554b ("MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS")
Reviewed-by: Maciej W. Rozycki <macro@imgtec.com>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13144/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

0b1fe3dc

MIPS: ptrace: Prevent writes to read-only FCSR bits · 79b09819

Maciej W. Rozycki authored May 12, 2016

commit abf378be upstream.

Correct the cases missed with commit 9b26616c ("MIPS: Respect the
ISA level in FCSR handling") and prevent writes to read-only FCSR bits
there.

This in particular applies to FP context initialisation where any IEEE
754-2008 bits preset by `mips_set_personality_nan' are cleared before
the relevant ptrace(2) call takes effect and the PTRACE_POKEUSR request
addressing FPC_CSR where no masking of read-only FCSR bits is done.

Remove the FCSR clearing from FP context initialisation then and unify
PTRACE_POKEUSR/FPC_CSR and PTRACE_SETFPREGS handling, by factoring out
code from `ptrace_setfpregs' and calling it from both places.

This mostly matters to soft float configurations where the emulator can
be switched this way to a mode which should not be accessible and cannot
be set with the CTC1 instruction.  With hard float configurations any
effect is transient anyway as read-only bits will retain their values at
the time the FP context is restored.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13239/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

79b09819

MIPS: ptrace: Fix FP context restoration FCSR regression · 0df22462

Maciej W. Rozycki authored May 12, 2016

commit 42495484 upstream.

Fix a floating-point context restoration regression introduced with
commit 9b26616c ("MIPS: Respect the ISA level in FCSR handling")
that causes a Floating Point exception and consequently a kernel oops
with hard float configurations when one or more FCSR Enable and their
corresponding Cause bits are set both at a time via a ptrace(2) call.

To do so reinstate Cause bit masking originally introduced with commit
b1442d39 ("MIPS: Prevent user from setting FCSR cause bits") to
address this exact problem and then inadvertently removed from the
PTRACE_SETFPREGS request with the commit referred above.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13238/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

0df22462

MIPS: math-emu: Fix jalr emulation when rd == $0 · 4051f5a8

Paul Burton authored Apr 21, 2016

commit ab4a92e6 upstream.

When emulating a jalr instruction with rd == $0, the code in
isBranchInstr was incorrectly writing to GPR $0 which should actually
always remain zeroed. This would lead to any further instructions
emulated which use $0 operating on a bogus value until the task is next
context switched, at which point the value of $0 in the task context
would be restored to the correct zero by a store in SAVE_SOME. Fix this
by not writing to rd if it is $0.

Fixes: 102cedc3 ("MIPS: microMIPS: Floating point support.")
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Maciej W. Rozycki <macro@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13160/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

4051f5a8

MIPS: Fix uapi include in exported asm/siginfo.h · 8f97af21

James Hogan authored Feb 08, 2016

commit 987e5b83 upstream.

Since commit 8cb48fe1 ("MIPS: Provide correct siginfo_t.si_stime"),
MIPS' uapi/asm/siginfo.h has included uapi/asm-generic/siginfo.h
directly before defining MIPS' struct siginfo, in order to get the
necessary definitions needed for the siginfo struct without the generic
copy_siginfo() hitting compiler errors due to struct siginfo not yet
being defined.

Now that the generic copy_siginfo() is moved out to linux/signal.h we
can safely include asm-generic/siginfo.h before defining the MIPS
specific struct siginfo, which avoids the uapi/ include as well as
breakage due to generic copy_siginfo() being defined before struct
siginfo.
Reported-by: Christopher Ferris <cferris@google.com>
Fixes: 8cb48fe1 ("MIPS: Provide correct siginfo_t.si_stime")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Petr Malat <oss@malat.biz>
Cc: linux-mips@linux-mips.org
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8f97af21

SIGNAL: Move generic copy_siginfo() to signal.h · 009bf182

James Hogan authored Feb 08, 2016

commit ca9eb49a upstream.

The generic copy_siginfo() is currently defined in
asm-generic/siginfo.h, after including uapi/asm-generic/siginfo.h which
defines the generic struct siginfo. However this makes it awkward for an
architecture to use it if it has to define its own struct siginfo (e.g.
MIPS and potentially IA64), since it means that asm-generic/siginfo.h
can only be included after defining the arch-specific siginfo, which may
be problematic if the arch-specific definition needs definitions from
uapi/asm-generic/siginfo.h.

It is possible to work around this by first including
uapi/asm-generic/siginfo.h to get the constants before defining the
arch-specific siginfo, and include asm-generic/siginfo.h after. However
uapi headers can't be included by other uapi headers, so that first
include has to be in an ifdef __kernel__, with the non __kernel__ case
including the non-UAPI header instead.

Instead of that mess, move the generic copy_siginfo() definition into
linux/signal.h, which allows an arch-specific uapi/asm/siginfo.h to
include asm-generic/siginfo.h and define the arch-specific siginfo, and
for the generic copy_siginfo() to see that arch-specific definition.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Petr Malat <oss@malat.biz>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Christopher Ferris <cferris@google.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12478/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

009bf182

MIPS: Sync icache & dcache in set_pte_at · c1bab318

Paul Burton authored Mar 01, 2016

commit 37d22a0d upstream.

It's possible for pages to become visible prior to update_mmu_cache
running if a thread within the same address space preempts the current
thread or runs simultaneously on another CPU. That is, the following
scenario is possible:

    CPU0                            CPU1

    write to page
    flush_dcache_page
    flush_icache_page
    set_pte_at
                                    map page
    update_mmu_cache

If CPU1 maps the page in between CPU0's set_pte_at, which marks it valid
& visible, and update_mmu_cache where the dcache flush occurs then CPU1s
icache will fill from stale data (unless it fills from the dcache, in
which case all is good, but most MIPS CPUs don't have this property).
Commit 4d46a67a ("MIPS: Fix race condition in lazy cache flushing.")
attempted to fix that by performing the dcache flush in
flush_icache_page such that it occurs before the set_pte_at call makes
the page visible. However it has the problem that not all code that
writes to pages exposed to userland call flush_icache_page. There are
many callers of set_pte_at under mm/ and only 2 of them do call
flush_icache_page. Thus the race window between a page becoming visible
& being coherent between the icache & dcache remains open in some cases.

To illustrate some of the cases, a WARN was added to __update_cache with
this patch applied that triggered in cases where a page about to be
flushed from the dcache was not the last page provided to
flush_icache_page. That is, backtraces were obtained for cases in which
the race window is left open without this patch. The 2 standout examples
follow.

When forking a process:

[   15.271842] [<80417630>] __update_cache+0xcc/0x188
[   15.277274] [<80530394>] copy_page_range+0x56c/0x6ac
[   15.282861] [<8042936c>] copy_process.part.54+0xd40/0x17ac
[   15.289028] [<80429f80>] do_fork+0xe4/0x420
[   15.293747] [<80413808>] handle_sys+0x128/0x14c

When exec'ing an ELF binary:

[   14.445964] [<80417630>] __update_cache+0xcc/0x188
[   14.451369] [<80538d88>] move_page_tables+0x414/0x498
[   14.457075] [<8055d848>] setup_arg_pages+0x220/0x318
[   14.462685] [<805b0f38>] load_elf_binary+0x530/0x12a0
[   14.468374] [<8055ec3c>] search_binary_handler+0xbc/0x214
[   14.474444] [<8055f6c0>] do_execveat_common+0x43c/0x67c
[   14.480324] [<8055f938>] do_execve+0x38/0x44
[   14.485137] [<80413808>] handle_sys+0x128/0x14c

These code paths write into a page, call flush_dcache_page then call
set_pte_at without flush_icache_page inbetween. The end result is that
the icache can become corrupted & userland processes may execute
unexpected or invalid code, typically resulting in a reserved
instruction exception, a trap or a segfault.

Fix this race condition fully by performing any cache maintenance
required to keep the icache & dcache in sync in set_pte_at, before the
page is made valid. This has the added bonus of ensuring the cache
maintenance always happens in one location, rather than being duplicated
in flush_icache_page & update_mmu_cache. It also matches the way other
architectures solve the same problem (see arm, ia64 & powerpc).
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Reported-by: Ionela Voinescu <ionela.voinescu@imgtec.com>
Cc: Lars Persson <lars.persson@axis.com>
Fixes: 4d46a67a ("MIPS: Fix race condition in lazy cache flushing.")
Cc: Steven J. Hill <sjhill@realitydiluted.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Huacai Chen <chenhc@lemote.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12722/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

c1bab318

MIPS: Handle highmem pages in __update_cache · 3ef36d5c

Paul Burton authored Mar 01, 2016

commit f4281bba upstream.

The following patch will expose __update_cache to highmem pages. Handle
them by mapping them in for the duration of the cache maintenance, just
like in __flush_dcache_page. The code for that isn't shared because we
need the page address in __update_cache so sharing became messy. Given
that the entirity is an extra 5 lines, just duplicate it.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12721/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

3ef36d5c

Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" · 6a78d8e9

Guilherme G. Piccoli authored Apr 11, 2016

commit c2078d9e upstream.

This reverts commit 89a51df5.

The function eeh_add_device_early() is used to perform EEH
initialization in devices added later on the system, like in
hotplug/DLPAR scenarios. Since the commit 89a51df5 ("powerpc/eeh:
Fix crash in eeh_add_device_early() on Cell") a new check was introduced
in this function - Cell has no EEH capabilities which led to kernel oops
if hotplug was performed, so checking for eeh_enabled() was introduced
to avoid the issue.

However, in architectures that EEH is present like pSeries or PowerNV,
we might reach a case in which no PCI devices are present on boot time
and so EEH is not initialized. Then, if a device is added via DLPAR for
example, eeh_add_device_early() fails because eeh_enabled() is false,
and EEH end up not being enabled at all.

This reverts the aforementioned patch since a new verification was
introduced by the commit d91dafc0 ("powerpc/eeh: Delay probing EEH
device during hotplug") and so the original Cell issue does not happen
anymore.
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

6a78d8e9

powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover() · 3f992124

Gavin Shan authored Apr 27, 2016

commit 5a0cdbfd upstream.

The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrou device are transferred to guest and
backwards. The content in the device's config space will be lost
on PE reset issued in the middle of the recovery. The function
saves/restores it before/after the reset. However, config access
to some adapters like Broadcom BCM5719 at this point will causes
fenced PHB. The config space is always blocked and we save 0xFF's
that are restored at late point. The memory BARs are totally
corrupted, causing another EEH error upon access to one of the
memory BARs.

This restores the config space on those adapters like BCM5719
from the content saved to the EEH device when it's populated,
to resolve above issue.

Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

3f992124

powerpc/eeh: Don't report error in eeh_pe_reset_and_recover() · 4bf7e15e

Gavin Shan authored Apr 27, 2016

commit affeb0f2 upstream.

The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrough device are transferred to guest and
backwards, meaning the device's driver is vfio-pci or none.
When the driver is vfio-pci that provides error_detected() error
handler only, the handler simply stops the guest and it's not
expected behaviour. On the other hand, no error handlers will
be called if we don't have a bound driver.

This ignores the error handler in eeh_pe_reset_and_recover()
that reports the error to device driver to avoid the exceptional
behaviour.

Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

4bf7e15e

sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems · fada1ef9

Vik Heyndrickx authored Apr 28, 2016

commit 20878232 upstream.

Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.

Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.

Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).

Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems.  Note: 93/2048 = 0.0454..., which rounds up to 0.05.

It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.

By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.

The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.
Tested-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Doug Smythies <dsmythies@telus.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0f004f5a ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.netSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

fada1ef9