Commits · 68713815af3535461593c1329263c7109eb67d07 · Kirill Smelkov / linux

13 Mar, 2018 38 commits

UBUNTU: [Packaging] disable zfs module checks when zfs is disabled · 68713815

Andy Whitcroft authored Dec 08, 2017

We currently disable the zfs module changes when we disable zfs
builds as part of cross-compilation.  We should disable the zfs
module checks whenever zfs itself is disabled.

Pull the zfs module disablement support such that it is always
present.

BugLink: http://bugs.launchpad.net/bugs/1737176Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

68713815

integrity: convert digsig to akcipher api · d6fcdd2c

Tadeusz Struk authored Dec 07, 2017

BugLink: http://bugs.launchpad.net/bugs/1735977

Convert asymmetric_verify to akcipher api.
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit eb5798f2)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d6fcdd2c

netfilter: xt_osf: Add missing permission checks · ee9d741f

Kevin Cernekee authored Jan 31, 2018

CVE-2017-17450

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    vpnns -- nfnl_osf -f /tmp/pf.os

    vpnns -- nfnl_osf -f /tmp/pf.os -d

These non-root operations successfully modify the systemwide OS
fingerprint list.  Add new capable() checks so that they can't.
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 916a2790)
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

ee9d741f

net: Fix double free and memory corruption in get_net_ns_by_id() · 8d095ae3

Eric W. Biederman authored Jan 30, 2018

CVE-2017-15129

(I can trivially verify that that idr_remove in cleanup_net happens
 after the network namespace count has dropped to zero --EWB)

Function get_net_ns_by_id() does not check for net::count
after it has found a peer in netns_ids idr.

It may dereference a peer, after its count has already been
finaly decremented. This leads to double free and memory
corruption:

put_net(peer)                                   rtnl_lock()
atomic_dec_and_test(&peer->count) [count=0]     ...
__put_net(peer)                                 get_net_ns_by_id(net, id)
  spin_lock(&cleanup_list_lock)
  list_add(&net->cleanup_list, &cleanup_list)
  spin_unlock(&cleanup_list_lock)
queue_work()                                      peer = idr_find(&net->netns_ids, id)
  |                                               get_net(peer) [count=1]
  |                                               ...
  |                                               (use after final put)
  v                                               ...
  cleanup_net()                                   ...
    spin_lock(&cleanup_list_lock)                 ...
    list_replace_init(&cleanup_list, ..)          ...
    spin_unlock(&cleanup_list_lock)               ...
    ...                                           ...
    ...                                           put_net(peer)
    ...                                             atomic_dec_and_test(&peer->count) [count=0]
    ...                                               spin_lock(&cleanup_list_lock)
    ...                                               list_add(&net->cleanup_list, &cleanup_list)
    ...                                               spin_unlock(&cleanup_list_lock)
    ...                                             queue_work()
    ...                                           rtnl_unlock()
    rtnl_lock()                                   ...
    for_each_net(tmp) {                           ...
      id = __peernet2id(tmp, peer)                ...
      spin_lock_irq(&tmp->nsid_lock)              ...
      idr_remove(&tmp->netns_ids, id)             ...
      ...                                         ...
      net_drop_ns()                               ...
	net_free(peer)                            ...
    }                                             ...
  |
  v
  cleanup_net()
    ...
    (Second free of peer)

Also, put_net() on the right cpu may reorder with left's cpu
list_replace_init(&cleanup_list, ..), and then cleanup_list
will be corrupted.

Since cleanup_net() is executed in worker thread, while
put_net(peer) can happen everywhere, there should be
enough time for concurrent get_net_ns_by_id() to pick
the peer up, and the race does not seem to be unlikely.
The patch fixes the problem in standard way.

(Also, there is possible problem in peernet2id_alloc(), which requires
check for net::count under nsid_lock and maybe_get_net(peer), but
in current stable kernel it's used under rtnl_lock() and it has to be
safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
be fixed too. While this is not in stable kernel yet, so I'll send
a separate message to netdev@ later).

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 0c7aecd4 "netns: add rtnl cmd to add and get peer netns ids"
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 21b59443)
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

8d095ae3

loop: fix concurrent lo_open/lo_release · 58993897

Linus Torvalds authored Jan 29, 2018

CVE-2018-5344

范龙飞 reports that KASAN can report a use-after-free in __lock_acquire.
The reason is due to insufficient serialization in lo_release(), which
will continue to use the loop device even after it has decremented the
lo_refcnt to zero.

In the meantime, another process can come in, open the loop device
again as it is being shut down. Confusion ensues.
Reported-by: 范龙飞 <long7573@126.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ae665016)
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

58993897

KVM: x86: lower default for halt_poll_ns · e14cf4c6

Paolo Bonzini authored Jan 26, 2018

BugLink: https://bugs.launchpad.net/bugs/1724614

In some fio benchmarks, halt_poll_ns=400000 caused CPU utilization to
increase heavily even in cases where the performance improvement was
small.  In particular, bandwidth divided by CPU usage was as much as
60% lower.

To some extent this is the expected effect of the patch, and the
additional CPU utilization is only visible when running the
benchmarks.  However, halving the threshold also halves the extra
CPU utilization (from +30-130% to +20-70%) and has no negative
effect on performance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(backported from commit b401ee0b)
Signed-off-by: Victor Tapia <victor.tapia@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

e14cf4c6

UBUNTU: [Debian] pass LOCAL_ENV_CC and LOCAL_ENV_DISTCC_HOSTS properly · cdc00788

Wen-chien Jesse Sung authored Jan 18, 2018

BugLink: https://launchpad.net/bugs/1744077Signed-off-by: Wen-chien Jesse Sung <jesse.sung@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

cdc00788

UBUNTU: SAUCE: Redpine: fix wowlan issue · 73f23199

Prameela Rani Garnepudi authored Jan 09, 2018

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742090
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742094

Two issues were observed, kernel warning at S4 restore and other
is failing to wakeup at times.
Kernel warning is because, at hibernate resume while mac80211
is resuming, driver is issuing mac80211 detach. The warning is
as below:
[  374.972073] WARNING: CPU: 1 PID: 3725 at linux-4.4.0/net/mac80211/iface.c:1000 ieee80211_do_stop+0x6ea/0x810 [mac80211]()
....
[  374.972211] CPU: 1 PID: 3725 Comm: kworker/u4:44 Tainted: G        W       4.4.0-98-generic #121-Ubuntu
[  374.972213] Hardware name: Dell Inc. Edge Gateway 3002/      , BIOS 01.00.05 11/22/2017
[  374.972223] Workqueue: events_unbound async_run_entry_fn
[  374.972230]  0000000000000286 bf3948ba9db4c154 ffff88005a733ad8 ffffffff813fb2c3
[  374.972235]  0000000000000000 ffffffffc04b8ac8 ffff88005a733b10 ffffffff810812e2
[  374.972239]  ffff8800787c0840 ffff88006f18e700 0000000000000000 ffff88006f18ee90
[  374.972240] Call Trace:
[  374.972249]  [<ffffffff813fb2c3>] dump_stack+0x63/0x90
[  374.972256]  [<ffffffff810812e2>] warn_slowpath_common+0x82/0xc0
[  374.972260]  [<ffffffff8108142a>] warn_slowpath_null+0x1a/0x20
[  374.972305]  [<ffffffffc045915a>] ieee80211_do_stop+0x6ea/0x810 [mac80211]
[  374.972312]  [<ffffffff818441ee>] ? _raw_spin_unlock_bh+0x1e/0x20
[  374.972317]  [<ffffffff817608ba>] ? dev_deactivate_many+0x20a/0x240
[  374.972359]  [<ffffffffc045929a>] ieee80211_stop+0x1a/0x20 [mac80211]
[  374.972365]  [<ffffffff81732a39>] __dev_close_many+0x99/0x100
[  374.972369]  [<ffffffff81732b31>] dev_close_many+0x91/0x140
[  374.972374]  [<ffffffff810e6171>] ? synchronize_sched_expedited+0x4e1/0x880
[  374.972379]  [<ffffffff81734e2a>] dev_close.part.79+0x4a/0x70
[  374.972383]  [<ffffffff81734e6a>] dev_close+0x1a/0x20
[  374.972425]  [<ffffffffc035fac1>] cfg80211_shutdown_all_interfaces+0x41/0xa0 [cfg80211]
[  374.972467]  [<ffffffffc045a6c6>] ieee80211_remove_interfaces+0x56/0x1f0 [mac80211]
[  374.972506]  [<ffffffffc0441bca>] ieee80211_unregister_hw+0x4a/0x120 [mac80211]

This is avoided by calling ieee80211_restart_hw and reinitializing
device as usual in sdio restore and waiting in mac80211_resume
until device is ready.
Other issue may be due to firmware assertion observed at times for
the length of bgscan probe request at restore. To avoid this,
unnecessary IEs are cut from the frame at end.
Signed-off-by: Prameela Rani Garnepudi <prameela.j04cs@gmail.com>
Signed-off-by: Amitkumar Karwar <amit.karwar@redpinesignals.com>
Acked-by: Kai Heng Feng <kai.heng.feng@canonical.com>
Acked-by: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

73f23199

UBUNTU: SAUCE: Redpine: fix reset card issue · 837f4087

Prameela Rani Garnepudi authored Jan 09, 2018

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742090
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742094

Sometimes we don't get response for SDIO commands during reset.
Additional parameter 'expected response' is added to cmd52readbyte()
and cmd52writebyte(). This parameter is false during reset (To avoid
waiting for response) and true while disabling or enabling SDIO
interrupts.
Signed-off-by: Prameela Rani Garnepudi <prameela.j04cs@gmail.com>
Signed-off-by: Amitkumar Karwar <amit.karwar@redpinesignals.com>
Acked-by: Kai Heng Feng <kai.heng.feng@canonical.com>
Acked-by: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

837f4087

UBUNTU: SAUCE: Redpine: fix data issue with non-uapsd APs · d4450763

Prameela Rani Garnepudi authored Jan 09, 2018

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742090
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742094

UAPSD parameter configuration in power save request should be
under UAPSD bitmap check. Otherwise data block issue occurs
with non-UAPSD APs .
Signed-off-by: Prameela Rani Garnepudi <prameela.j04cs@gmail.com>
Signed-off-by: Amitkumar Karwar <amit.karwar@redpinesignals.com>
Acked-by: Kai Heng Feng <kai.heng.feng@canonical.com>
Acked-by: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

d4450763

UBUNTU: SAUCE: Redpine: fix for wowlan wakeup failure · 71b07f7b

Pavani Muthyala authored Jan 09, 2018

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742090
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742094

It is observed that magic packet is sometimes missed by firmware
which results in wakeup failure. This happens only in coex mode
when power save is enabled. Issue is resolved by disabling power
save to avoid radio loss for wlan
Signed-off-by: Pavani Muthyala <pavanimuthyala1992@gmail.com>
Signed-off-by: Amitkumar Karwar <amit.karwar@redpinesignals.com>
Acked-by: Kai Heng Feng <kai.heng.feng@canonical.com>
Acked-by: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

71b07f7b

nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A · 86f9f48d

Kai-Heng Feng authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

The NVMe device in question drops off the PCIe bus after system suspend.
I've tried several approaches to workaround this issue, but none of them
works:
- NVME_QUIRK_DELAY_BEFORE_CHK_RDY
- NVME_QUIRK_NO_DEEPEST_PS
- Disable APST before controller shutdown
- Delay between controller shutdown and system suspend
- Explicitly set power state to 0 before controller shutdown

Fortunately it's a desktop, so disable APST won't hurt the battery.

Also, change the quirk function name to reflect it's for vendor
combination quirks.

BugLink: https://bugs.launchpad.net/bugs/1705748Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit 8427bbc2)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

86f9f48d

nvme: Quirk APST on Intel 600P/P3100 devices · 675f069f

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

They have known firmware bugs.  A fix is apparently in the works --
once fixed firmware is available, someone from Intel (Hi, Keith!)
can adjust the quirk accordingly.

Cc: stable@vger.kernel.org # v4.11
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(backported from commit 50af47d0)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

675f069f

nvme: relax APST default max latency to 100ms · f8546b38

Kai-Heng Feng authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

Christoph Hellwig suggests we should to make APST work out of the box.
Hence relax the the default max latency to make them able to enter
deepest power state on default.

Here are id-ctrl excerpts from two high latency NVMes:

vid     : 0x14a4
ssvid   : 0x1b4b
mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

vid     : 0x15b7
ssvid   : 0x1b4b
mn      : A400 NVMe SanDisk 512GB
ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit 9947d6a0)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

f8546b38

nvme: only consider exit latency when choosing useful non-op power states · fab335a3

Kai-Heng Feng authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

When a NVMe is in non-op states, the latency is exlat.
The latency will be enlat + exlat only when the NVMe tries to transit
from operational state right atfer it begins to transit to
non-operational state, which should be a rare case.

Therefore, as Andy Lutomirski suggests, use exlat only when deciding power
states to trainsit to.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(backported from commit da87591b)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

fab335a3

nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA" · ef6a4d76

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

There's a report that it malfunctions with APST on.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit be56945c)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

ef6a4d76

nvme: Adjust the Samsung APST quirk · 44fee752

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

I got a couple more reports: the Samsung APST issues appears to
affect multiple 950-series devices in Dell XPS 15 9550 and Precision
5510 laptops.  Change the quirk: rather than blacklisting the
firmware on the first problematic SSD that was reported, disable
APST on all 144d:a802 devices if they're installed in the two
affected Dell models.  While we're at it, disable only the deepest
sleep state instead of all of them -- the reporters say that this is
sufficient to fix the problem.

(I have a device that appears to be entirely identical to one of the
affected devices, but I have a different Dell laptop, so it's not
the case that all Samsung devices with firmware BXW75D0Q are broken
under all circumstances.)

Samsung engineers have an affected system, and hopefully they'll
give us a better workaround some time soon.  In the mean time, this
should minimize regressions.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
(backported from commit ff5350a8)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

44fee752

nvme: Enable autonomous power state transitions · 03ef54e6

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

NVMe devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each successive
non-operational state will be entered after an idle time equal to 100%
of the total latency (entry plus exit) associated with that state.
The maximum acceptable latency is controlled using dev_pm_qos
(e.g. power/pm_qos_latency_tolerance_us in sysfs); non-operational
states with total latency greater than this value will not be used.
As a special case, setting the latency tolerance to 0 will disable
APST entirely.  On hardware without APST support, the sysfs file will
not be exposed.

The latency tolerance for newly-probed devices is set by the module
parameter nvme_core.default_ps_max_latency_us.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

This feature is quirked off on a known-buggy Samsung device.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
(backported from commit c5552fde)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

03ef54e6

nvme: Add a quirk mechanism that uses identify_ctrl · 73b4d470

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

Currently, all NVMe quirks are based on PCI IDs.  Add a mechanism to
define quirks based on identify_ctrl's vendor id, model number,
and/or firmware revision.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
(backported from commit bd4da3ab)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

73b4d470

nvme: Pass pointers, not dma addresses, to nvme_get/set_features() · 2b50a40b

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1a6fe74d)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

2b50a40b

nvme: Fix nvme_get/set_features() with a NULL result pointer · fa2aaddc

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

nvme_set_features() callers seem to expect that passing NULL as the
result pointer is acceptable.  Teach nvme_set_features() not to try to
write to the NULL address.

For symmetry, make the same change to nvme_get_features(), despite the
fact that all current callers pass a valid result pointer.

I assume that this bug hasn't been reported in practice because
the callers that pass NULL are all in the SCSI translation layer
and no one uses the relevant operations.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 9b47f77a)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

fa2aaddc

nvme: Modify and export sync command submission for fabrics · e8f85779

Christoph Hellwig authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

NVMe over fabrics will use __nvme_submit_sync_cmd in the the
transport and require a few tweaks to it.  For that we export it
and add a few more paramters:

1. allow passing a queue ID to the block layer

   For the NVMe over Fabrics connect command we need to able to specify a
   queue ID that we want to send the command on.  Add a qid parameter to
   the relevant functions to enable this behavior.

2. allow submitting at_head commands

   In cases where we want to (re)connect to a controller
   where we have inflight queued commands we want to first
   connect and only then allow the other queued commands to
   be kicked. This will prevents failures in controller resets
   and reconnects.

3. allow passing flags to blk_mq_allocate_request

   Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we
   want to be able to use reserved requests.
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit eb71f435)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

e8f85779

nvme: factor out a add nvme_is_write helper · e37c5cca

Christoph Hellwig authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

Centralize the check if a given NVMe command reads or writes data.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(backported from commit 7a5abb4b)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

e37c5cca

nvme: return the whole CQE through the request passthrough interface · 0aa7cb3a

Christoph Hellwig authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

Both LighNVM and NVMe over Fabrics need to look at more than just the
status and result field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bj?rling <m@bjorling.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(backported from commit 1cb3cce5)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0aa7cb3a

nvme/scsi: Remove power management support · 07e482c7

Andy Lutomirski authored Jan 05, 2018

BugLink: https://bugs.launchpad.net/bugs/1664602

As far as I can tell, there is basically nothing correct about this
code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse.  It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.

Since this code appears to be useless, just delete it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 26501db8)
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Acked-By: AceLan Kao <acelan.kao@canonical.com>
Acked-By: Shrirang Bagul <shrirang.bagul@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

07e482c7

bpf: fix branch pruning logic · 68dd63b2

Alexei Starovoitov authored Jan 04, 2018

when the verifier detects that register contains a runtime constant
and it's compared with another constant it will prune exploration
of the branch that is guaranteed not to be taken at runtime.
This is all correct, but malicious program may be constructed
in such a way that it always has a constant comparison and
the other branch is never taken under any conditions.
In this case such path through the program will not be explored
by the verifier. It won't be taken at run-time either, but since
all instructions are JITed the malicious program may cause JITs
to complain about using reserved fields, etc.
To fix the issue we have to track the instructions explored by
the verifier and sanitize instructions that are dead at run time
with NOPs. We cannot reject such dead code, since llvm generates
it for valid C code, since it doesn't do as much data flow
analysis as the verifier does.

Fixes: 17a52670 ("bpf: verifier (add verifier core)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
(backported from commit c131187d)
CVE-2017-17862
[ saf: Add partial backport of 3df126f3 ("bpf: don't (ab)use
  instructions to store state") to add bpf_insn_aux_data state to
  verifier_env ]
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

68dd63b2

bpf: fix incorrect sign extension in check_alu_op() · 04e1c8db

Jann Horn authored Jan 04, 2018

[ Upstream commit 95a762e2 ]

Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16995 for this issue.

v3:
 - add CVE number (Ben Hutchings)

Fixes: 48461135 ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
CVE-2017-16995
[ saf: Backport to 4.4. Include partial backports of 4923ec0b ("bpf:
  simplify verifier register state assignments") and 969bf05e ("bpf:
  direct packet access") to extend reg_state.imm to 64-bit. ]
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

04e1c8db

KVM: Fix stack-out-of-bounds read in write_mmio · 48a03f4f

Wanpeng Li authored Jan 04, 2018

CVE-2017-17741

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(backported from e39d200f)
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

48a03f4f

RDS: null pointer dereference in rds_atomic_free_op · a0a14798

Mohamed Ghannam authored Jan 03, 2018

set rm->atomic.op_active to 0 when rds_pin_pages() fails
or the user supplied address is invalid,
this prevents a NULL pointer usage in rds_atomic_free_op()
Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2018-5333
(cherry picked from commit 7d11f77f)
Signed-off-by: Benjamin M Romer <benjamin.romer@canonical.com>
Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Signed-off-by: Benjamin M Romer <benjamin.romer@canonical.com>

a0a14798

ipv6: Do not consider linkdown nexthops during multipath · b9625607

Ido Schimmel authored Dec 15, 2017

BugLink: http://bugs.launchpad.net/bugs/1738219

When the 'ignore_routes_with_linkdown' sysctl is set, we should not
consider linkdown nexthops during route lookup.

While the code correctly verifies that the initially selected route
('match') has a carrier, it does not perform the same check in the
subsequent multipath selection, resulting in a potential packet loss.

In case the chosen route does not have a carrier and the sysctl is set,
choose the initially selected route.

Fixes: 35103d11 ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bbfcd776)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

b9625607

UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent · 577fff51

Ryan Harper authored Dec 11, 2017

BugLink: http://bugs.launchpad.net/bugs/1729145

- decouple emitting a cached_dev CHANGE uevent which includes dev.uuid
  and dev.label from bch_cached_dev_run() which only happens when a
  bcacheX device is bound to the actual backing block device (bcache0 -> vdb)

- update bch_cached_dev_run() to invoke bch_cached_dev_emit_change() as
  needed; no functional code path changes here

- Modify register_bcache to detect a re-registering of a bcache
  cached_dev, and in that case call bcache_cached_dev_emit_change() to
Signed-off-by: Ryan Harper <ryan.harper@canonical.com>
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

577fff51

e1000e: Separate signaling for link check/link up · 2ebc9658

Benjamin Poirier authored Dec 14, 2017

BugLink: http://bugs.launchpad.net/bugs/1730550

Lennart reported the following race condition:

\ e1000_watchdog_task
    \ e1000e_has_link
        \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
            /* link is up */
            mac->get_link_status = false;

                            /* interrupt */
                            \ e1000_msix_other
                                hw->mac.get_link_status = true;

        link_active = !hw->mac.get_link_status
        /* link_active is false, wrongly */

This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.

Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().
Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 19110cfb)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

2ebc9658

e1000e: Avoid receiver overrun interrupt bursts · 6d6b5fd1

Benjamin Poirier authored Dec 14, 2017

BugLink: http://bugs.launchpad.net/bugs/1730550

When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.

This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.

Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
0a8047ac ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.
Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Fixes: 16ecba59 ("e1000e: Do not read ICR in Other interrupt")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 4aea7a5c)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

6d6b5fd1

ath10k: add max_tx_power for QCA6174 WLAN.RM.2.0 firmware · 357bf2a7

Alan Liu authored Dec 05, 2017

BugLink: https://bugs.launchpad.net/bugs/1736317

QCA6174 WLAN.RM.2.0 firmware uses max_tx_power instead of using max_reg_power
to set transmission power. The tx power was about -50dbm, after applying this
change, it become -32dbm.
Signed-off-by: Alan Liu <alanliu@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
(cherry picked from commit 513527c8)
Signed-off-by: Shrirang Bagul <shrirang.bagul@canonical.com>
Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Acked-by: Kai Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

357bf2a7

scsi_dh_alua: uninitialized variable in alua_rtpg() · 80a90e5b

Dan Carpenter authored Nov 29, 2017

BugLink: https://bugs.launchpad.net/bugs/1720228

It's possible to use "err" without initializing it.  If it happens to be
a 2 which is SCSI_DH_RETRY then that could cause a bug.  Bart Van Assche
pointed out that we should probably re-initialize it for every iteration
through the retry loop.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: James Bottomley <jejb@linux.vnet.ibm.com>
(cherry picked from commit a4bd8520)
Signed-off-by: Dragan Stancevic <dragan.stancevic@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

80a90e5b

UBUNTU: d-i: Add bnxt_en_bpo to nic-modules. · c512ae78

Vinson Lee authored Nov 27, 2017

BugLink: http://bugs.launchpad.net/bugs/1734757Suggested-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Acked-by: Kamal Mostafa <kamal@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

c512ae78

UBUNTU: SAUCE: use CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y as default · 0895a081

Colin Ian King authored Nov 23, 2017

BugLink: https://bugs.launchpad.net/bugs/1703742

The current configuration is set to always use transparent hugepages
by default. There exists plenty of anecdotal evidence that this is
less than perfect a choice and in some scenarios it leads to some
performance issues.

My own investigations with stress-ng stream and malloc tests show that
the current default impacts performance. I ran various test scenarios
on different MADVISE configurations, each result below is based on
the average of 5 runs on an i7-3770 CPU @ 3.4GHz with 8GB memory,
8MB L3 cache, 256K L2 cache, 32K/32K L1 cache.

All the above results are from an average of 5 rounds of tests.

malloc allocation stressor:

     malloc     always    madvise
    size (MB)   ops/sec   ops/sec
         32     1254.43   2422.49
         64     2100.36   4300.28
        128     3768.57   7215.38
        256     7940.73  14893.85
        512    17618.62  26861.29
       1024    32777.17  48029.37

Clearly madvise is more performent.

stream bandwidth/compute stressor:

    stream      always    madvise
                         NOHUGEPAGE
    size (MB)   MB/sec     MB/sec
          1   17713.54   18439.69
          2   12460.34   13015.46
          4   12195.81   12694.51
          8   12085.11   12674.26
         16   12054.09   12649.00
         32   12082.42   12409.65
         64   12262.88   12084.85
        128   12235.25   11788.49
        256   11808.69   11283.69
        512   11970.01   12434.82

For small allocations, always is less performant. Large
allocations can enable the more performant transparent
huge pages with madvise(2) if we disable always as default.

Other stress-ng memory allocation/writing/freeing and madvise
operations showed little significant differences.

I have also experimented with boot testing Ubuntu with kernels
configured with different MADVISE configs and found there is
little noticeable difference in performance, so I believe that
there is little scope for any kitten killer performance regressions
with this change.

This change will by default not use transparent huge pages unless
madvise(2) is used to instruct the kernel to do so on a memory
mapping.  According to the madvise(2) manual, this only takes
effect on private anonymous mappings with MADV_HUGEPAGE.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kamal Mostafa <kamal@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0895a081

UBUNTU: Start new release · c40862fd
Kleber Sacilotto de Souza authored Mar 13, 2018
```
Ignore: yes
Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
```
c40862fd

12 Feb, 2018 2 commits

UBUNTU: Ubuntu-4.4.0-116.140 · 855cff54
Khalid Elmously authored Feb 12, 2018
```
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
```
855cff54

UBUNTU: SAUCE: net: ipv4: fix for a race condition in raw_sendmsg -- fix backport · 0abf64f1

Andy Whitcroft authored Feb 12, 2018

Fix a miss-backport of the upstream commit.

Fixes: 63da13a9 ("net: ipv4: fix for a race condition in raw_sendmsg -- fix backport")
BugLink: http://bugs.launchpad.net/bugs/1748671Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Kamal Mostafa <kamal@canonical.com>
Acked-by: Brad Figg <brad.figg@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0abf64f1