Commits · 673c790e72822ee433931ea701e4fceef75a0eac · Kirill Smelkov / linux

29 Oct, 2018 12 commits

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu · 673c790e

Linus Torvalds authored Oct 29, 2018

Pull m68k nommu fix from Greg Ungerer:
 "Only a single change to fix an out of bounds array access when parsing
  boot command line"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68k: fix command-line parsing when passed from u-boot

673c790e

Merge tag 'm68k-for-v4.20-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k · 83e7e5b5

Linus Torvalds authored Oct 29, 2018

Pull m68k updates from Geert Uytterhoeven:
 "Just two small cleanups"

* tag 'm68k-for-v4.20-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k/sun3: Remove is_medusa and m68k_pgtable_cachemode
  m68k/atari: ARAnyM - Remove reference to long-deprecated MODULE_PARM

83e7e5b5

Merge tag 'csky-for-linus-4.20' of https://github.com/c-sky/csky-linux · ac435075

Linus Torvalds authored Oct 29, 2018

Pull C-SKY architecture port from Guo Ren:
 "This contains the Linux port for C-SKY(csky) based on linux-4.19
  Release, which has been through 10 rounds of review on mailing list.

  More information:

    http://en.c-sky.com

  The development repo:

    https://github.com/c-sky/csky-linux

  ABI Documentation:

    https://github.com/c-sky/csky-doc

  Here is the pre-built cross compiler for fast test from our CI:

    https://gitlab.com/c-sky/buildroot/-/jobs/101608095/artifacts/file/output/images/csky_toolchain_qemu_csky_ck807f_4.18_glibc_defconfig_482b221e52908be1c9b2ccb444255e1562bb7025.tar.xz

  We use buildroot as our CI-test enviornment. "LTP, Lmbench ..." will
  be tested for every commit. See here for more details:

    https://gitlab.com/c-sky/buildroot/pipelines

  We'll continouslly improve csky subsystem in future"

Arnd acks, and adds the following notes:
 "I did a thorough review of the ABI, which as usual mainly consists of
  spotting any files that don't use the asm-generic ABI itself, and
  having it changed to it matches exactly what we do on other new
  architectures.

  I also looked at every other patch and commented on maybe half of them
  where I saw something that did not quite seem right. Others have
  reviewed specific patches in greater depth. I'm sure that one could
  fine more of the minor details, but as long as they are not ABI
  relevant, they can be fixed later.

  The only patch that is part of the ABI and that nobody reviewed is the
  signal handling. This is one of the areas I never worked on in much
  detail. I did not see anything wrong with it, but I also don't know
  what the problems with the other architectures are here, and we seem
  to be hitting issues occasionally, and we never managed to generalize
  this enough for new architectures to have a trivial implementation.

  I was originally hoping that we could have the 64-bit time_t
  interfaces ready in time to completely drop the 32-bit ones, but that
  did not happen. We might still remove them in the next merge window
  depending on whether the libc upstream people prefer to keep them or
  not.

  One more general comment: I think this may well be the last new CPU
  architecture we ever add to the kernel. Both nds32 and c-sky are made
  by companies that also work on risc-v, and generally speaking risc-v
  seems to be killing off any of the minor licensable instruction set
  projects, just like ARM has mostly killed off the custom
  vendor-specific instruction sets already.

  If we add another architecture in the future, it may instead be
  something like the LLVM bitcode or WebAssembly, who knows?"

To which Geert Uytterhoeven pipes in about another architecture still in
the pipeline: Kalray MPPA.

* tag 'csky-for-linus-4.20' of https://github.com/c-sky/csky-linux: (24 commits)
  dt-bindings: interrupt-controller: C-SKY APB intc
  irqchip: add C-SKY APB bus interrupt controller
  dt-bindings: interrupt-controller: C-SKY SMP intc
  irqchip: add C-SKY SMP interrupt controller
  MAINTAINERS: Add csky
  dt-bindings: Add vendor prefix for csky
  dt-bindings: csky CPU Bindings
  csky: Misc headers
  csky: SMP support
  csky: Debug and Ptrace GDB
  csky: User access
  csky: Library functions
  csky: ELF and module probe
  csky: Atomic operations
  csky: IRQ handling
  csky: VDSO and rt_sigreturn
  csky: Process management and Signal
  csky: MMU and page table management
  csky: Cache and TLB routines
  csky: System Call
  ...

ac435075

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 9f51ae62

Linus Torvalds authored Oct 28, 2018

Pull networking fixes from David Miller:

 1) GRO overflow entries are not unlinked properly, resulting in list
    poison pointers being dereferenced.

 2) Fix bridge build with ipv6 disabled, from Nikolay Aleksandrov.

 3) Direct packet access and other fixes in BPF from Daniel Borkmann.

 4) gred_change_table_def() gets passed the wrong pointer, a pointer to
    a set of unparsed attributes instead of the attribute itself. From
    Jakub Kicinski.

 5) Allow macsec device to be brought up even if it's lowerdev is down,
    from Sabrina Dubroca.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: diag: document swapped src/dst in udp_dump_one.
  macsec: let the administrator set UP state even if lowerdev is down
  macsec: update operstate when lower device changes
  net: sched: gred: pass the right attribute to gred_change_table_def()
  ptp: drop redundant kasprintf() to create worker name
  net: bridge: remove ipv6 zero address check in mcast queries
  net: Properly unlink GRO packets on overflow.
  bpf: fix wrong helper enablement in cgroup local storage
  bpf: add bpf_jit_limit knob to restrict unpriv allocations
  bpf: make direct packet write unclone more robust
  bpf: fix leaking uninitialized memory on pop/peek helpers
  bpf: fix direct packet write into pop/peek helpers
  bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
  bpf: fix direct packet access for flow dissector progs
  bpf: disallow direct packet access for unpriv in cg_skb
  bpf: fix test suite to enable all unpriv program types
  bpf, btf: fix a missing check bug in btf_parse
  selftests/bpf: add config fragments BPF_STREAM_PARSER and XDP_SOCKETS
  bpf: devmap: fix wrong interface selection in notifier_call

9f51ae62

net: diag: document swapped src/dst in udp_dump_one. · 747569b0

Lorenzo Colitti authored Oct 29, 2018

Since its inception, udp_dump_one has had a bug where userspace
needs to swap src and dst addresses and ports in order to find
the socket it wants. This is because it passes the socket source
address to __udp[46]_lib_lookup's saddr argument, but those
functions are intended to find local sockets matching received
packets, so saddr is the remote address, not the local address.

This can no longer be fixed for backwards compatibility reasons,
so add a brief comment explaining that this is the case. This
will avoid confusion and help ensure SOCK_DIAG implementations
of new protocols don't have the same problem.

Fixes: a925aa00 ("udp_diag: Implement the get_exact dumping functionality")
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

747569b0

Merge branch 'macsec-fixes' · 3bdf6bac

David S. Miller authored Oct 28, 2018

Sabrina Dubroca says:

====================
macsec: linkstate fixes

This series fixes issues with handling administrative and operstate of
macsec devices.

Radu Rendec proposed another version of the first patch [0] but I'd
rather not follow the behavior of vlan devices, going with macvlan
does instead. Patrick Talbert also reported the same issue to me.

The second patch is a follow-up. The restriction on setting the device
up is a bit unreasonable, and operstate provides the information we
need in this case.

[0] https://patchwork.ozlabs.org/patch/971374/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3bdf6bac

macsec: let the administrator set UP state even if lowerdev is down · 07bddef9

Sabrina Dubroca authored Oct 28, 2018

Currently, the kernel doesn't let the administrator set a macsec device
up unless its lower device is currently up. This is inconsistent, as a
macsec device that is up won't automatically go down when its lower
device goes down.

Now that linkstate propagation works, there's really no reason for this
limitation, so let's remove it.

Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
Reported-by: Radu Rendec <radu.rendec@gmail.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

07bddef9

macsec: update operstate when lower device changes · e6ac0758

Sabrina Dubroca authored Oct 28, 2018

Like all other virtual devices (macvlan, vlan), the operstate of a
macsec device should match the state of its lower device. This is done
by calling netif_stacked_transfer_operstate from its netdevice notifier.

We also need to call netif_stacked_transfer_operstate when a new macsec
device is created, so that its operstate is set properly. This is only
relevant when we try to bring the device up directly when we create it.

Radu Rendec proposed a similar patch, inspired from the 802.1q driver,
that included changing the administrative state of the macsec device,
instead of just the operstate. This version is similar to what the
macvlan driver does, and updates only the operstate.

Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
Reported-by: Radu Rendec <radu.rendec@gmail.com>
Reported-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6ac0758

net: sched: gred: pass the right attribute to gred_change_table_def() · 38b4f18d

Jakub Kicinski authored Oct 26, 2018

gred_change_table_def() takes a pointer to TCA_GRED_DPS attribute,
and expects it will be able to interpret its contents as
struct tc_gred_sopt.  Pass the correct gred attribute, instead of
TCA_OPTIONS.

This bug meant the table definition could never be changed after
Qdisc was initialized (unless whatever TCA_OPTIONS contained both
passed netlink validation and was a valid struct tc_gred_sopt...).

Old behaviour:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
RTNETLINK answers: Invalid argument

Now:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0

Fixes: f62d6b93 ("[PKT_SCHED]: GRED: Use central VQ change procedure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38b4f18d

ptp: drop redundant kasprintf() to create worker name · 822c5f73

Rasmus Villemoes authored Oct 26, 2018

Building with -Wformat-nonliteral, gcc complains

drivers/ptp/ptp_clock.c: In function ‘ptp_clock_register’:
drivers/ptp/ptp_clock.c:239:26: warning: format not a string literal and no format arguments [-Wformat-nonliteral]
            worker_name : info->name);

kthread_create_worker takes fmt+varargs to set the name of the
worker, and that happens with a vsnprintf() to a stack buffer (that is
then copied into task_comm). So there's no reason not to just pass
"ptp%d", ptp->index to kthread_create_worker() and avoid the
intermediate worker_name variable.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

822c5f73

net: bridge: remove ipv6 zero address check in mcast queries · 0fe5119e

Nikolay Aleksandrov authored Oct 27, 2018

Recently a check was added which prevents marking of routers with zero
source address, but for IPv6 that cannot happen as the relevant RFCs
actually forbid such packets:
RFC 2710 (MLDv1):
"To be valid, the Query message MUST
 come from a link-local IPv6 Source Address, be at least 24 octets
 long, and have a correct MLD checksum."

Same goes for RFC 3810.

And also it can be seen as a requirement in ipv6_mc_check_mld_query()
which is used by the bridge to validate the message before processing
it. Thus any queries with :: source address won't be processed anyway.
So just remove the check for zero IPv6 source address from the query
processing function.

Fixes: 5a2de63f ("bridge: do not add port to router list when receives query with source 0.0.0.0")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fe5119e

Merge tag 'drm-next-2018-10-24' of git://anongit.freedesktop.org/drm/drm · 53b3b6bb

Linus Torvalds authored Oct 28, 2018

Pull drm updates from Dave Airlie:
 "This is going to rebuild more than drm as it adds a new helper to
  list.h for doing bulk updates. Seemed like a reasonable addition to
  me.

  Otherwise the usual merge window stuff lots of i915 and amdgpu, not so
  much nouveau, and piles of everything else.

  Core:
   - Adds a new list.h helper for doing bulk list updates for TTM.
   - Don't leak fb address in smem_start to userspace (comes with EXPORT
     workaround for people using mali out of tree hacks)
   - udmabuf device to turn memfd regions into dma-buf
   - Per-plane blend mode property
   - ref/unref replacements with get/put
   - fbdev conflicting framebuffers code cleaned up
   - host-endian format variants
   - panel orientation quirk for Acer One 10

  bridge:
   - TI SN65DSI86 chip support

  vkms:
   - GEM support.
   - Cursor support

  amdgpu:
   - Merge amdkfd and amdgpu into one module
   - CEC over DP AUX support
   - Picasso APU support + VCN dynamic powergating
   - Raven2 APU support
   - Vega20 enablement + kfd support
   - ACP powergating improvements
   - ABGR/XBGR display support
   - VCN jpeg support
   - xGMI support
   - DC i2c/aux cleanup
   - Ycbcr 4:2:0 support
   - GPUVM improvements
   - Powerplay and powerplay endian fixes
   - Display underflow fixes

  vmwgfx:
   - Move vmwgfx specific TTM code to vmwgfx
   - Split out vmwgfx buffer/resource validation code
   - Atomic operation rework

  bochs:
   - use more helpers
   - format/byteorder improvements

  qxl:
   - use more helpers

  i915:
   - GGTT coherency getparam
   - Turn off resource streamer API
   - More Icelake enablement + DMC firmware
   - Full PPGTT for Ivybridge, Haswell and Valleyview
   - DDB distribution based on resolution
   - Limited range DP display support

  nouveau:
   - CEC over DP AUX support
   - Initial HDMI 2.0 support

  virtio-gpu:
   - vmap support for PRIME objects

  tegra:
   - Initial Tegra194 support
   - DMA/IOMMU integration fixes

  msm:
   - a6xx perf improvements + clock prefix
   - GPU preemption optimisations
   - a6xx devfreq support
   - cursor support

  rockchip:
   - PX30 support
   - rgb output interface support

  mediatek:
   - HDMI output support on mt2701 and mt7623

  rcar-du:
   - Interlaced modes on Gen3
   - LVDS on R8A77980
   - D3 and E3 SoC support

  hisilicon:
   - misc fixes

  mxsfb:
   - runtime pm support

  sun4i:
   - R40 TCON support
   - Allwinner A64 support
   - R40 HDMI support

  omapdrm:
   - Driver rework changing display pipeline ordering to use common code
   - DMM memory barrier and irq fixes
   - Errata workarounds

  exynos:
   - out-bridge support for LVDS bridge driver
   - Samsung 16x16 tiled format support
   - Plane alpha and pixel blend mode support

  tilcdc:
   - suspend/resume update

  mali-dp:
   - misc updates"

* tag 'drm-next-2018-10-24' of git://anongit.freedesktop.org/drm/drm: (1382 commits)
  firmware/dmc/icl: Add missing MODULE_FIRMWARE() for Icelake.
  drm/i915/icl: Fix signal_levels
  drm/i915/icl: Fix DDI/TC port clk_off bits
  drm/i915/icl: create function to identify combophy port
  drm/i915/gen9+: Fix initial readout for Y tiled framebuffers
  drm/i915: Large page offsets for pread/pwrite
  drm/i915/selftests: Disable shrinker across mmap-exhaustion
  drm/i915/dp: Link train Fallback on eDP only if fallback link BW can fit panel's native mode
  drm/i915: Fix intel_dp_mst_best_encoder()
  drm/i915: Skip vcpi allocation for MSTB ports that are gone
  drm/i915: Don't unset intel_connector->mst_port
  drm/i915: Only reset seqno if actually idle
  drm/i915: Use the correct crtc when sanitizing plane mapping
  drm/i915: Restore vblank interrupts earlier
  drm/i915: Check fb stride against plane max stride
  drm/amdgpu/vcn:Fix uninitialized symbol error
  drm: panel-orientation-quirks: Add quirk for Acer One 10 (S1003)
  drm/amd/amdgpu: Fix debugfs error handling
  drm/amdgpu: Update gc_9_0 golden settings.
  drm/amd/powerplay: update PPtable with DC BTC and Tvr SocLimit fields
  ...

53b3b6bb

28 Oct, 2018 7 commits

Merge tag 'vla-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 746bb4ed

Linus Torvalds authored Oct 28, 2018

Pull VLA removal from Kees Cook:
 "Globally warn on VLA use.

  This turns on "-Wvla" globally now that the last few trees with their
  VLA removals have landed (crypto, block, net, and powerpc).

  Arnd mentioned that there may be a couple more VLAs hiding in
  hard-to-find randconfigs, but nothing big has shaken out in the last
  month or so in linux-next.

  We should be basically VLA-free now! Wheee. :)

  Summary:

   - Remove unused fallback for BUILD_BUG_ON (which technically contains
     a VLA)

   - Lift -Wvla to the top-level Makefile"

* tag 'vla-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  Makefile: Globally enable VLA warning
  compiler.h: give up __compiletime_assert_fallback()

746bb4ed

Merge tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · ac747c07

Linus Torvalds authored Oct 28, 2018

Pull Kbuild updates from Masahiro Yamada:

 - optimize kallsyms slightly

 - remove check for old CFLAGS usage

 - add some compiler flags unconditionally instead of evaluating
   $(call cc-option,...)

 - fix variable shadowing in host tools

 - refactor scripts/mkmakefile

 - refactor various makefiles

* tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  modpost: Create macro to avoid variable shadowing
  ASN.1: Remove unnecessary shadowed local variable
  kbuild: use 'else ifeq' for checksrc to improve readability
  kbuild: remove unneeded link_multi_deps
  kbuild: add -Wno-unused-but-set-variable flag unconditionally
  kbuild: add -Wdeclaration-after-statement flag unconditionally
  kbuild: add -Wno-pointer-sign flag unconditionally
  modpost: remove leftover symbol prefix handling for module device table
  kbuild: simplify command line creation in scripts/mkmakefile
  kbuild: do not pass $(objtree) to scripts/mkmakefile
  kbuild: remove user ID check in scripts/mkmakefile
  kbuild: remove VERSION and PATCHLEVEL from $(objtree)/Makefile
  kbuild: add --include-dir flag only for out-of-tree build
  kbuild: remove dead code in cmd_files calculation in top Makefile
  kbuild: hide most of targets when running config or mixed targets
  kbuild: remove old check for CFLAGS use
  kbuild: prefix Makefile.dtbinst path with $(srctree) unconditionally
  kallsyms: remove left-over Blackfin code
  kallsyms: reduce size a little on 64-bit

ac747c07

Merge tag 'linux-kselftest-4.20-rc1' of... · f8cab69b

Linus Torvalds authored Oct 28, 2018

Merge tag 'linux-kselftest-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest updates from Shuah Khan:
 "This Kselftest update for Linux 4.20-rc1 consists of:

   - Improvements to ftrace test suite from Masami Hiramatsu.

   - Color coded ftrace PASS / FAIL results from Steven Rostedt (VMware)
     to improve readability of reports.

   - watchdog Fixes and enhancement to add gettimeout and get|set
     pretimeout options from Jerry Hoemann.

   - Several fixes to warnings and spelling etc"

* tag 'linux-kselftest-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (40 commits)
  selftests/ftrace: Strip escape sequences for log file
  selftests/ftrace: Use colored output when available
  selftests: fix warning: "_GNU_SOURCE" redefined
  selftests: kvm: Fix -Wformat warnings
  selftests/ftrace: Add color to the PASS / FAIL results
  kvm: selftests: fix spelling mistake "Insufficent" -> "Insufficient"
  selftests: gpio: Fix OUTPUT directory in Makefile
  selftests: gpio: restructure Makefile
  selftests: watchdog: Fix ioctl SET* error paths to take oneshot exit path
  selftests: watchdog: Add gettimeout and get|set pretimeout
  selftests: watchdog: Fix error message.
  selftests: watchdog: fix message when /dev/watchdog open fails
  selftests/ftrace: Add ftrace cpumask testcase
  selftests/ftrace: Add wakeup_rt tracer testcase
  selftests/ftrace: Add wakeup tracer testcase
  selftests/ftrace: Add stacktrace ftrace filter command testcase
  selftests/ftrace: Add trace_pipe testcase
  selftests/ftrace: Add function filter on module testcase
  selftests/ftrace: Add max stack tracer testcase
  selftests/ftrace: Add function profiling stat testcase
  ...

f8cab69b

Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax · dad4f140

Linus Torvalds authored Oct 28, 2018

Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...

dad4f140

net: Properly unlink GRO packets on overflow. · ece23711

David S. Miller authored Oct 28, 2018

Just like with normal GRO processing, we have to initialize
skb->next to NULL when we unlink overflow packets from the
GRO hash lists.

Fixes: d4546c25 ("net: Convert GRO SKB handling to list_head.")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: David S. Miller <davem@davemloft.net>

ece23711

modpost: Create macro to avoid variable shadowing · c2b1a922

Leonardo Bras authored Oct 24, 2018

Create DEF_FIELD_ADDR_VAR as a more generic version of the DEF_FIELD_ADD
macro, allowing usage of a variable name other than the struct element name.
Also, sets DEF_FIELD_ADDR as a specific usage of DEF_FILD_ADDR_VAR in which
the var name is the same as the struct element name.
Then, makes use of DEF_FIELD_ADDR_VAR to create a variable of another name,
in order to avoid variable shadowing.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

c2b1a922

ASN.1: Remove unnecessary shadowed local variable · 9e1e8194

Leonardo Bras authored Oct 24, 2018

Remove an unnecessary shadowed local variable (start).
It was used only once, with the same value it was started before
the if block.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

9e1e8194

27 Oct, 2018 10 commits

HID: we do not randomly make new drivers 'default y' · 69d5b97c

Linus Torvalds authored Oct 27, 2018

.. even when that "default y" is hidden syntactically as a

	default !EXPERT

it's wrong.

The only reason something should be 'default y' is if it used to be
built-in, and it was made configurable, and the 'default y' is just
retaining the status quo.

Altheratively, the hardware for the driver has become _so_ common that
it really makes sense for everybody to build it.  Finally, one possible
reason for 'default y' is because the option is not enabling any new
code at all, but is just enabling other options (the networking people
do this for vendor options, for example, so that you can disable whole
vendors at a time).

Clearly, none of these cases hold for the BigBen Interactive Kids'
gamepad, and HID_BIGBEN_FF should thus most definitely not default
to on for everybody.

Cc: Hanno Zulla <kontakt@hanno.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

69d5b97c

Merge tag 'linux-watchdog-4.20-rc1' of git://www.linux-watchdog.org/linux-watchdog · 5ecf3e11

Linus Torvalds authored Oct 27, 2018

Pull watchdog updates from Wim Van Sebroeck:

 - Add Armada 37xx CPU watchdog

 - w83627hf_wdt: Add Support for NCT6796D, NCT6797D, NCT6798D

 - hpwdt: several improvements

 - renesas_wdt: SPDX identifiers, stop when unregistering, support for
   R7S9210

 - rza_wdt: SPDX identifiers, support longer timeouts

 - core: fix null pointer dereference when releasing cdev

 - iTCO_wdt: Drop option vendorsupport=2

 - sama5d4: fix timeout-sec usage

 - lantiq_wdt: convert to watchdog framework

 - several small fixes

* tag 'linux-watchdog-4.20-rc1' of git://www.linux-watchdog.org/linux-watchdog: (30 commits)
  watchdog: ts4800: release syscon device node in ts4800_wdt_probe()
  watchdog: armada_37xx_wdt: use do_div for u64 division
  documentation: watchdog: add documentation for armada-37xx-wdt
  dt-bindings: watchdog: Document armada-37xx-wdt binding
  watchdog: Add support for Armada 37xx CPU watchdog
  dt-bindings: watchdog: add mpc8xxx-wdt support
  watchdog: mpc8xxx: provide boot status
  MAINTAINERS: Fix file pattern for MEN Z069 watchdog driver
  dt-bindings: watchdog: renesas-wdt: Add support for R7S9210
  watchdog: rza_wdt: Support longer timeouts
  watchdog: hpwdt: Disable PreTimeout when Timeout is smaller
  watchdog: w83627hf_wdt: Support NCT6796D, NCT6797D, NCT6798D
  watchdog: mpc8xxx: use dev_xxxx() instead of pr_xxxx()
  watchdog: lantiq: add get_timeleft callback
  watchdog: lantiq: Convert to watchdog_device
  watchdog: lantiq: update register names to better match spec
  watchdog: sama5d4: fix timeout-sec usage
  watchdog: fix a small number of "watchog" typos in comments
  watchdog: rza_wdt: convert to SPDX identifiers
  watchdog: iTCO_wdt: Remove unused hooks
  ...

5ecf3e11

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · ed3f4e23

Linus Torvalds authored Oct 27, 2018

Pull input updates from Dmitry Torokhov:
 "Just random driver fixups, nothing exiting"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: synaptics - avoid using uninitialized variable when probing
  Input: xen-kbdfront - mark expected switch fall-through
  Input: atmel_mxt_ts - mark expected switch fall-through
  Input: cyapa - mark expected switch fall-throughs
  Input: wm97xx-ts - fix exit path
  Input: of_touchscreen - add support for touchscreen-min-x|y
  Input: Fix DIR-685 touchkeys MAINTAINERS entry
  Input: elants_i2c - use DMA safe i2c when possible
  Input: silead - try firmware reload after unsuccessful resume
  Input: st1232 - set INPUT_PROP_DIRECT property
  Input: xilinx_ps2 - convert to using %pOFn instead of device_node.name
  Input: atmel_mxt_ts - fix multiple <linux/property.h> includes
  Input: sun4i-lradc - convert to using %pOFn instead of device_node.name
  Input: pwm-vibrator - correct pwms in DT binding example

ed3f4e23

Merge tag 'rtc-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · c7b7eefa

Linus Torvalds authored Oct 27, 2018

Pull RTC updates from Alexandre Belloni:
 "This cycle, there were mostly non urgent fixes in drivers. I also
  finally unexported the non managed registration.

  Subsystem:

   - non devm managed registration is now removed from the driver API

   - all the unnecessary rtc_valid_tm() calls have been removed

  Drivers:

   - abx80X: watchdog support

   - cmos: fix non ACPI support

   - sc27xx: fix alarm support

   - Remove a possible sysfs race condition for ab8500, ds1307, ds1685,
     isl1208

   - Fix a possible race condition where an irq handler may be called
     before the rtc_device struct is allocated for mt6397, pl030,
     menelaus, armada38x"

* tag 'rtc-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (54 commits)
  rtc: sc27xx: Always read normal alarm when registering RTC device
  rtc: sc27xx: Add check to see if need to enable the alarm interrupt
  rtc: sc27xx: Remove interrupts disable and clear in probe()
  rtc: sc27xx: Clear SPG value update interrupt status
  rtc: sc27xx: Set wakeup capability before registering rtc device
  rtc: s35390a: Change buf's type to u8 in s35390a_init
  rtc: ds1307: fix ds1339 wakealarm support
  rtc: ds1685: simplify getting .driver_data
  rtc: m41t80: mark expected switch fall-through
  rtc: tegra: Propagate errors from platform_get_irq()
  rtc: cmos: Remove the `use_acpi_alarm' module parameter for !ACPI
  rtc: cmos: Fix non-ACPI undefined reference to `hpet_rtc_interrupt'
  rtc: mv: let the core handle invalid alarms
  rtc: vr41xx: switch to rtc_time64_to_tm/rtc_tm_to_time64
  rtc: ab8500: remove useless check
  rtc: ab8500: let the core handle range
  rtc: ab8500: use rtc_add_group
  rtc: rs5c348: report error when time is invalid
  rtc: rs5c348: remove forward declaration
  rtc: rs5c348: remove useless label
  ...

c7b7eefa

Merge tag 'led-fix-for-4.20-rc1' of... · e5585453

Linus Torvalds authored Oct 27, 2018

Merge tag 'led-fix-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds

Pull LED fix from Jacek Anaszewski.

* tag 'led-fix-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
  leds: gpio: set led_dat->gpiod pointer for OF defined GPIO leds

e5585453

i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array · b59dfdae

Linus Torvalds authored Oct 27, 2018

Commit 9ee3e066 ("HID: i2c-hid: override HID descriptors for certain
devices") added a new dmi_system_id quirk table to override certain HID
report descriptors for some systems that lack them.

But the table wasn't properly terminated, causing the dmi matching to
walk off into la-la-land, and starting to treat random data as dmi
descriptor pointers, causing boot-time oopses if you were at all
unlucky.

Terminate the array.

We really should have some way to just statically check that arrays that
should be terminated by an empty entry actually are so.  But the HID
people really should have caught this themselves, rather than have me
deal with an oops during the merge window.  Tssk, tssk.

Cc: Julian Sax <jsbc@gmx.de>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b59dfdae

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6788fac8

David S. Miller authored Oct 26, 2018

Daniel Borkmann says:

====================
pull-request: bpf 2018-10-27

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix toctou race in BTF header validation, from Martin and Wenwen.

2) Fix devmap interface comparison in notifier call which was
   neglecting netns, from Taehee.

3) Several fixes in various places, for example, correcting direct
   packet access and helper function availability, from Daniel.

4) Fix BPF kselftest config fragment to include af_xdp and sockmap,
   from Naresh.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6788fac8

Merge branch 'akpm' (patches from Andrew) · 345671ea

Linus Torvalds authored Oct 26, 2018

Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  hugetlbfs: dirty pages as they are added to pagecache
  mm: export add_swap_extent()
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
  mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
  mm/gup: cache dev_pagemap while pinning pages
  Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  mm: return zero_resv_unavail optimization
  mm: zero remaining unavailable struct pages
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
  tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
  tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
  mm/gup_benchmark.c: add additional pinning methods
  mm/gup_benchmark.c: time put_page()
  mm: don't raise MEMCG_OOM event due to failed high-order allocation
  mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
  ...

345671ea

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 49040081

Linus Torvalds authored Oct 26, 2018

Pull networking fixes from David Miller:
 "What better way to start off a weekend than with some networking bug
  fixes:

  1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
     by David Ahern and Bjørn Mork.

  2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
     properly in UDP, from Sean Tranchetti.

  3) Remove TCA_OPTIONS from policy validation, it turns out we don't
     consistently use nested attributes for this across all packet
     schedulers. From David Ahern.

  4) Fix SKB corruption in cadence driver, from Tristram Ha.

  5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.

  6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
  net/neigh: fix NULL deref in pneigh_dump_table()
  net: allow traceroute with a specified interface in a vrf
  bridge: do not add port to router list when receives query with source 0.0.0.0
  net/smc: fix smc_buf_unuse to use the lgr pointer
  ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
  net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
  lan743x: Remove SPI dependency from Microchip group.
  drivers: net: remove <net/busy_poll.h> inclusion when not needed
  net: phy: genphy_10g_driver: Avoid NULL pointer dereference
  r8169: fix broken Wake-on-LAN from S5 (poweroff)
  octeontx2-af: Use GFP_ATOMIC under spin lock
  net: ethernet: cadence: fix socket buffer corruption problem
  net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
  net: sched: Remove TCA_OPTIONS from policy
  ice: Poll for link status change
  ice: Allocate VF interrupts and set queue map
  ice: Introduce ice_dev_onetime_setup
  net: hns3: Fix for warning uninitialized symbol hw_err_lst3
  octeontx2-af: Copy the right amount of memory
  net: udp: fix handling of CHECKSUM_COMPLETE packets
  ...

49040081

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · a45dcff7

Linus Torvalds authored Oct 26, 2018

Pull sparc fixes from David Miller:
 "Some more sparc fixups, mostly aimed at getting the allmodconfig build
  up and clean again"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Rework xchg() definition to avoid warnings.
  sparc64: Export __node_distance.
  sparc64: Make corrupted user stacks more debuggable.

a45dcff7

26 Oct, 2018 11 commits

hugetlbfs: dirty pages as they are added to pagecache · 22146c3c

Mike Kravetz authored Oct 26, 2018

Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.

This can be recreated as follows:
 fallocate -l 2M /dev/hugepages/foo
 echo 1 > /proc/sys/vm/drop_caches
 fallocate -l 2M /dev/hugepages/foo
 grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
 ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

To address this issue, dirty pages as they are added to pagecache.  This
can easily be reproduced with fallocate as shown above.  Read faulted
pages will eventually end up being marked dirty.  But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.

Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mihcla Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

22146c3c

mm: export add_swap_extent() · aa8aa8a3

Omar Sandoval authored Oct 26, 2018

Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices.  See 35054394
("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").

However, the swap code has a mechanism for the filesystem to manually add
swap extents using add_swap_extent() from the ->swap_activate() aop.
iomap has done this since 67482129 ("iomap: add a swapfile activation
function").  Btrfs will do the same in a later patch, so export
add_swap_extent().

Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.comSigned-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Sterba <dsterba@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aa8aa8a3

mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS · bc4ae27d

Omar Sandoval authored Oct 26, 2018

The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
through the filesystem, and to make swapoff() call ->swap_deactivate().
For Btrfs, we want the latter but not the former, so split this flag into
two. This makes us always call ->swap_deactivate() if ->swap_activate()
succeeded, not just if it didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.comSigned-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bc4ae27d

tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE · 91cbacc3

Michael Ellerman authored Oct 26, 2018

Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann
Horn.  This would have caught the overlap bug reported by Daniel Micay.

I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but
instead of writing a selftest I spent my time bike-shedding whether it
should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW ..  mea culpa.

Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.auSigned-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

91cbacc3

mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() · 7eef5f97

Andrea Arcangeli authored Oct 26, 2018

There should be no cache left by the time we overwrite the old transhuge
pmd with the new one.  It's already too late to flush through the virtual
address because we already copied the page data to the new physical
address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning on
x86 where flush_cache_range() is a noop.

Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.comSigned-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7eef5f97

mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9

Andrea Arcangeli authored Oct 26, 2018

change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away.  do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start().  So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs.  It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:

CPU0		CPU1		GPU sharing linux pagetables using
                                only ->invalidate_range
-----------	------------	---------
				GPU secondary MMU writes to the page
				mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
		do_huge_pmd_numa_page()
		CPU TLB flush globally (1)
		CPU cannot write to page
		migrate_misplaced_transhuge_page()
				GPU writes to the page...
		migrate_page_copy()
				...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
				GPU secondary MMU is invalidated
				and cannot write to the page anymore
				(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().

After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.

Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.comSigned-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7066f0f9

mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · d7c33934

Andrea Arcangeli authored Oct 26, 2018

Patch series "migrate_misplaced_transhuge_page race conditions".

Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.

While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.

These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.

I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce).  The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).

To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.

This patch (of 3):

This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
race).

When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.

The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407f ("mm: Close races between THP migration and PMD numa
clearing").

The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.

The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.

I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.

At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().

This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.

This also optimizes the code and removes a superfluous TLB flush.

[akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d7c33934

mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t · 026d1eaf

Clark Williams authored Oct 26, 2018

The static lock quarantine_lock is used in quarantine.c to protect the
quarantine queue datastructures.  It is taken inside quarantine queue
manipulation routines (quarantine_put(), quarantine_reduce() and
quarantine_remove_cache()), with IRQs disabled.  This is not a problem on
a stock kernel but is problematic on an RT kernel where spin locks are
sleeping spinlocks, which can sleep and can not be acquired with disabled
interrupts.

Convert the quarantine_lock to a raw spinlock_t.  The usage of
quarantine_lock is confined to quarantine.c and the work performed while
the lock is held is used for debug purpose.

[bigeasy@linutronix.de: slightly altered the commit message]
Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.deSigned-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

026d1eaf

mm/gup: cache dev_pagemap while pinning pages · df06b37f

Keith Busch authored Oct 26, 2018

Getting pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata. This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.

This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap while getting user
pages. The gup_benchmark kernel self test reports this reduces time to
get user pages to as low as 1/3 of the previous time.

Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.comSigned-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

df06b37f

Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" · 9fd61bc9

Masayoshi Mizuma authored Oct 26, 2018

commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") breaks movable_node kernel option because it changed
the memory gap range to reserved memblock.  So, the node is marked as
Normal zone even if the SRAT has Hot pluggable affinity.

    =====================================================================
    kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
    kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
    ...
    kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
    ...
    kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
    kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
    ...
    kernel: Movable zone start for each node
    kernel:  Node 3: 0x00001c0000000000
    kernel: Early memory node ranges
    ...
    =====================================================================

The original issue is fixed by the former patches, so let's revert commit
124049de ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved").

Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.comSigned-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9fd61bc9

mm: return zero_resv_unavail optimization · ec393a0f

Pavel Tatashin authored Oct 26, 2018

When checking for valid pfns in zero_resv_unavail(), it is not necessary
to verify that pfns within pageblock_nr_pages ranges are valid, only the
first one needs to be checked.  This is because memory for pages are
allocated in contiguous chunks that contain pageblock_nr_pages struct
pages.

Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ec393a0f