1. 12 Apr, 2018 4 commits
    • Andreas Gruenbacher's avatar
      lockref: Add lockref_put_not_zero · 450b1f6f
      Andreas Gruenbacher authored
      Put a lockref unless the lockref is dead or its count would become zero.
      This is the same as lockref_put_or_lock except that the lock is never
      left held.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      450b1f6f
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · e241e3f2
      Linus Torvalds authored
      Pull virtio update from Michael Tsirkin:
       "This adds reporting hugepage stats to virtio-balloon"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        virtio_balloon: export hugetlb page allocation counts
      e241e3f2
    • Linus Torvalds's avatar
      Merge tag 'iommu-updates-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · e5c37228
      Linus Torvalds authored
      Pull IOMMU updates from Joerg Roedel:
      
       - OF_IOMMU support for the Rockchip iommu driver so that it can use
         generic DT bindings
      
       - rework of locking in the AMD IOMMU interrupt remapping code to make
         it work better in RT kernels
      
       - support for improved iotlb flushing in the AMD IOMMU driver
      
       - support for 52-bit physical and virtual addressing in the ARM-SMMU
      
       - various other small fixes and cleanups
      
      * tag 'iommu-updates-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (53 commits)
        iommu/io-pgtable-arm: Avoid warning with 32-bit phys_addr_t
        iommu/rockchip: Support sharing IOMMU between masters
        iommu/rockchip: Add runtime PM support
        iommu/rockchip: Fix error handling in init
        iommu/rockchip: Use OF_IOMMU to attach devices automatically
        iommu/rockchip: Use IOMMU device for dma mapping operations
        dt-bindings: iommu/rockchip: Add clock property
        iommu/rockchip: Control clocks needed to access the IOMMU
        iommu/rockchip: Fix TLB flush of secondary IOMMUs
        iommu/rockchip: Use iopoll helpers to wait for hardware
        iommu/rockchip: Fix error handling in attach
        iommu/rockchip: Request irqs in rk_iommu_probe()
        iommu/rockchip: Fix error handling in probe
        iommu/rockchip: Prohibit unbind and remove
        iommu/amd: Return proper error code in irq_remapping_alloc()
        iommu/amd: Make amd_iommu_devtable_lock a spin_lock
        iommu/amd: Drop the lock while allocating new irq remap table
        iommu/amd: Factor out setting the remap table for a devid
        iommu/amd: Use `table' instead `irt' as variable name in amd_iommu_update_ga()
        iommu/amd: Remove the special case from alloc_irq_table()
        ...
      e5c37228
    • Linus Torvalds's avatar
      Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 1fe43114
      Linus Torvalds authored
      Pull more power management updates from Rafael Wysocki:
       "These include one big-ticket item which is the rework of the idle loop
        in order to prevent CPUs from spending too much time in shallow idle
        states. It reduces idle power on some systems by 10% or more and may
        improve performance of workloads in which the idle loop overhead
        matters. This has been in the works for several weeks and it has been
        tested and reviewed quite thoroughly.
      
        Also included are changes that finalize the cpufreq cleanup moving
        frequency table validation from drivers to the core, a few fixes and
        cleanups of cpufreq drivers, a cpuidle documentation update and a PM
        QoS core update to mark the expected switch fall-throughs in it.
      
        Specifics:
      
         - Rework the idle loop in order to prevent CPUs from spending too
           much time in shallow idle states by making it stop the scheduler
           tick before putting the CPU into an idle state only if the idle
           duration predicted by the idle governor is long enough.
      
           That required the code to be reordered to invoke the idle governor
           before stopping the tick, among other things (Rafael Wysocki,
           Frederic Weisbecker, Arnd Bergmann).
      
         - Add the missing description of the residency sysfs attribute to the
           cpuidle documentation (Prashanth Prakash).
      
         - Finalize the cpufreq cleanup moving frequency table validation from
           drivers to the core (Viresh Kumar).
      
         - Fix a clock leak regression in the armada-37xx cpufreq driver
           (Gregory Clement).
      
         - Fix the initialization of the CPU performance data structures for
           shared policies in the CPPC cpufreq driver (Shunyong Yang).
      
         - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
           bit (Viresh Kumar, Rafael Wysocki).
      
         - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
           Silva)"
      
      * tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
        tick-sched: avoid a maybe-uninitialized warning
        cpufreq: Drop cpufreq_table_validate_and_show()
        cpufreq: SCMI: Don't validate the frequency table twice
        cpufreq: CPPC: Initialize shared perf capabilities of CPUs
        cpufreq: armada-37xx: Fix clock leak
        cpufreq: CPPC: Don't set transition_latency
        cpufreq: ti-cpufreq: Use builtin_platform_driver()
        cpufreq: intel_pstate: Do not include debugfs.h
        PM / QoS: mark expected switch fall-throughs
        cpuidle: Add definition of residency to sysfs documentation
        time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
        nohz: Avoid duplication of code related to got_idle_tick
        nohz: Gather tick_sched booleans under a common flag field
        cpuidle: menu: Avoid selecting shallow states with stopped tick
        cpuidle: menu: Refine idle state selection for running tick
        sched: idle: Select idle state before stopping the tick
        time: hrtimer: Introduce hrtimer_next_event_without()
        time: tick-sched: Split tick_nohz_stop_sched_tick()
        cpuidle: Return nohz hint from cpuidle_select()
        jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
        ...
      1fe43114
  2. 11 Apr, 2018 36 commits
    • Linus Torvalds's avatar
      Merge tag 'ktest-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest · 96973767
      Linus Torvalds authored
      Pull ktest updates from Steven Rostedt:
       "These commits have either been sitting in my INBOX or have been in my
        local tree for some time. I need to push them upstream:
      
         - Separate out config-bisect.pl from ktest.pl.
      
           This allows users to do config bisects without full ktest setup.
      
         - Email on status change.
      
           Allow the user to be emailed on test start, finish, failure, etc.
      
         - Other small fixes and enhancements"
      
      * tag 'ktest-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest: (24 commits)
        ktest: Take submenu into account for grub2 menus
        ktest.pl: Add MAIL_COMMAND option to define how to send email
        ktest.pl: Use run_command to execute sending mail
        ktest.pl: Allow dodie be recursive
        ktest.pl: Kill test if mailer is not supported
        ktest.pl: Add MAIL_PATH option to define where to find the mailer
        ktest.pl: No need to print no mailer is specified when mailto is not
        Ktest: add email options to sample.config
        Ktest: Use dodie for critical falures
        Ktest: Add SigInt handling
        Ktest: Add email support
        ktest.pl: Detect if a config-bisect was interrupted
        ktest.pl: Make finding config-bisect.pl dynamic
        ktest.pl: Have ktest.pl pass -r to config-bisect.pl to reset bisect
        ktest.pl: Use diffconfig if available for failed config bisects
        ktest.pl: Allow for the config-bisect.pl output to display to console
        ktest: Use config-bisect.pl in ktest.pl
        ktest: Add standalone config-bisect.pl program
        ktest: Set do_not_reboot=y for CONFIG_BISECT_TYPE=build
        ktest: Set buildonly=1 for CONFIG_BISECT_TYPE=build
        ...
      96973767
    • Linus Torvalds's avatar
      Merge tag 'tags/upstream-4.17-rc1' of git://git.infradead.org/linux-ubifs · 77cb51e6
      Linus Torvalds authored
      Pull UBI and UBIFS updates from Richard Weinberger:
       "Minor bug fixes and improvements"
      
      * tag 'tags/upstream-4.17-rc1' of git://git.infradead.org/linux-ubifs:
        ubi: Reject MLC NAND
        ubifs: Remove useless parameter of lpt_heap_replace
        ubifs: Constify struct ubifs_lprops in scan_for_leb_for_idx
        ubifs: remove unnecessary assignment
        ubi: Fix error for write access
        ubi: fastmap: Don't flush fastmap work on detach
        ubifs: Check ubifs_wbuf_sync() return code
      77cb51e6
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · 375479c3
      Linus Torvalds authored
      Pull UML updates from Richard Weinberger:
      
       - a new and faster epoll based IRQ controller and NIC driver
      
       - misc fixes and janitorial updates
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        Fix vector raw inintialization logic
        Migrate vector timers to new timer API
        um: Compile with modern headers
        um: vector: Fix an error handling path in 'vector_parse()'
        um: vector: Fix a memory allocation check
        um: vector: fix missing unlock on error in vector_net_open()
        um: Add missing EXPORT for free_irq_by_fd()
        High Performance UML Vector Network Driver
        Epoll based IRQ controller
        um: Use POSIX ucontext_t instead of struct ucontext
        um: time: Use timespec64 for persistent clock
        um: Restore symbol versions for __memcpy and memcpy
      375479c3
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 45df60cd
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "Here is a very small set of fixes for inclusion in linux-4.17-rc1: Two
        changes for the maintainer file, and one more fix for the newly added
        npcm platform, to enable the level 2 cache controller"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        MAINTAINERS: Update ASPEED entry with details
        MAINTAINERS: Migrate oxnas list to groups.io
        arm: npcm: enable L2 cache in NPCM7xx architecture
      45df60cd
    • Linus Torvalds's avatar
      Merge tag 'nios2-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · b82b6813
      Linus Torvalds authored
      Pull nios2 update from Ley Foon Tan:
       "Use read_persistent_clock64() instead of read_persistent_clock()"
      
      * tag 'nios2-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: Use read_persistent_clock64() instead of read_persistent_clock()
      b82b6813
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 8837c70d
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
      
       - almost all of the rest of MM
      
       - kasan updates
      
       - lots of procfs work
      
       - misc things
      
       - lib/ updates
      
       - checkpatch
      
       - rapidio
      
       - ipc/shm updates
      
       - the start of willy's XArray conversion
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
        page cache: use xa_lock
        xarray: add the xa_lock to the radix_tree_root
        fscache: use appropriate radix tree accessors
        export __set_page_dirty
        unicore32: turn flush_dcache_mmap_lock into a no-op
        arm64: turn flush_dcache_mmap_lock into a no-op
        mac80211_hwsim: use DEFINE_IDA
        radix tree: use GFP_ZONEMASK bits of gfp_t for flags
        linux/const.h: refactor _BITUL and _BITULL a bit
        linux/const.h: move UL() macro to include/linux/const.h
        linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
        xen, mm: allow deferred page initialization for xen pv domains
        elf: enforce MAP_FIXED on overlaying elf segments
        fs, elf: drop MAP_FIXED usage from elf_map
        mm: introduce MAP_FIXED_NOREPLACE
        MAINTAINERS: update bouncing aacraid@adaptec.com addresses
        fs/dcache.c: add cond_resched() in shrink_dentry_list()
        include/linux/kfifo.h: fix comment
        ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
        kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
        ...
      8837c70d
    • Matthew Wilcox's avatar
      page cache: use xa_lock · b93b0163
      Matthew Wilcox authored
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • Matthew Wilcox's avatar
      xarray: add the xa_lock to the radix_tree_root · f6bb2a2c
      Matthew Wilcox authored
      This results in no change in structure size on 64-bit machines as it
      fits in the padding between the gfp_t and the void *.  32-bit machines
      will grow the structure from 8 to 12 bytes.  Almost all radix trees are
      protected with (at least) a spinlock, so as they are converted from
      radix trees to xarrays, the data structures will shrink again.
      
      Initialising the spinlock requires a name for the benefit of lockdep, so
      RADIX_TREE_INIT() now needs to know the name of the radix tree it's
      initialising, and so do IDR_INIT() and IDA_INIT().
      
      Also add the xa_lock() and xa_unlock() family of wrappers to make it
      easier to use the lock.  If we could rely on -fplan9-extensions in the
      compiler, we could avoid all of this syntactic sugar, but that wasn't
      added until gcc 4.6.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6bb2a2c
    • Matthew Wilcox's avatar
      fscache: use appropriate radix tree accessors · e5a95541
      Matthew Wilcox authored
      Don't open-code accesses to data structure internals.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5a95541
    • Matthew Wilcox's avatar
      export __set_page_dirty · f82b3764
      Matthew Wilcox authored
      XFS currently contains a copy-and-paste of __set_page_dirty().  Export
      it from buffer.c instead.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f82b3764
    • Matthew Wilcox's avatar
      unicore32: turn flush_dcache_mmap_lock into a no-op · d339d705
      Matthew Wilcox authored
      Unicore doesn't walk the VMA tree in its flush_dcache_page()
      implementation, so has no need to take the tree_lock.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d339d705
    • Matthew Wilcox's avatar
      arm64: turn flush_dcache_mmap_lock into a no-op · 427c896f
      Matthew Wilcox authored
      ARM64 doesn't walk the VMA tree in its flush_dcache_page()
      implementation, so has no need to take the tree_lock.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      427c896f
    • Matthew Wilcox's avatar
      60a05271
    • Matthew Wilcox's avatar
      radix tree: use GFP_ZONEMASK bits of gfp_t for flags · fa290cda
      Matthew Wilcox authored
      Patch series "XArray", v9.  (First part thereof).
      
      This patchset is, I believe, appropriate for merging for 4.17.  It
      contains the XArray implementation, to eventually replace the radix
      tree, and converts the page cache to use it.
      
      This conversion keeps the radix tree and XArray data structures in sync
      at all times.  That allows us to convert the page cache one function at
      a time and should allow for easier bisection.  Other than renaming some
      elements of the structures, the data structures are fundamentally
      unchanged; a radix tree walk and an XArray walk will touch the same
      number of cachelines.  I have changes planned to the XArray data
      structure, but those will happen in future patches.
      
      Improvements the XArray has over the radix tree:
      
       - The radix tree provides operations like other trees do; 'insert' and
         'delete'. But what most users really want is an automatically
         resizing array, and so it makes more sense to give users an API that
         is like an array -- 'load' and 'store'. We still have an 'insert'
         operation for users that really want that semantic.
      
       - The XArray considers locking as part of its API. This simplifies a
         lot of users who formerly had to manage their own locking just for
         the radix tree. It also improves code generation as we can now tell
         RCU that we're holding a lock and it doesn't need to generate as much
         fencing code. The other advantage is that tree nodes can be moved
         (not yet implemented).
      
       - GFP flags are now parameters to calls which may need to allocate
         memory. The radix tree forced users to decide what the allocation
         flags would be at creation time. It's much clearer to specify them at
         allocation time.
      
       - Memory is not preloaded; we don't tie up dozens of pages on the off
         chance that the slab allocator fails. Instead, we drop the lock,
         allocate a new node and retry the operation. We have to convert all
         the radix tree, IDA and IDR preload users before we can realise this
         benefit, but I have not yet found a user which cannot be converted.
      
       - The XArray provides a cmpxchg operation. The radix tree forces users
         to roll their own (and at least four have).
      
       - Iterators take a 'max' parameter. That simplifies many users and will
         reduce the amount of iteration done.
      
       - Iteration can proceed backwards. We only have one user for this, but
         since it's called as part of the pagefault readahead algorithm, that
         seemed worth mentioning.
      
       - RCU-protected pointers are not exposed as part of the API. There are
         some fun bugs where the page cache forgets to use rcu_dereference()
         in the current codebase.
      
       - Value entries gain an extra bit compared to radix tree exceptional
         entries. That gives us the extra bit we need to put huge page swap
         entries in the page cache.
      
       - Some iterators now take a 'filter' argument instead of having
         separate iterators for tagged/untagged iterations.
      
      The page cache is improved by this:
      
       - Shorter, easier to read code
      
       - More efficient iterations
      
       - Reduction in size of struct address_space
      
       - Fewer walks from the top of the data structure; the XArray API
         encourages staying at the leaf node and conducting operations there.
      
      This patch (of 8):
      
      None of these bits may be used for slab allocations, so we can use them
      as radix tree flags as long as we mask them off before passing them to
      the slab allocator. Move the IDR flag from the high bits to the
      GFP_ZONEMASK bits.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa290cda
    • Masahiro Yamada's avatar
      linux/const.h: refactor _BITUL and _BITULL a bit · 21e7bc60
      Masahiro Yamada authored
      Minor cleanups available by _UL and _ULL.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21e7bc60
    • Masahiro Yamada's avatar
      linux/const.h: move UL() macro to include/linux/const.h · 2dd8a62c
      Masahiro Yamada authored
      ARM, ARM64 and UniCore32 duplicate the definition of UL():
      
        #define UL(x) _AC(x, UL)
      
      This is not actually arch-specific, so it will be useful to move it to a
      common header.  Currently, we only have the uapi variant for
      linux/const.h, so I am creating include/linux/const.h.
      
      I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
      the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
      replaced in follow-up cleanups.  The underscore-prefixed ones should
      be used for exported headers.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarGuan Xuetao <gxt@mprc.pku.edu.cn>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dd8a62c
    • Masahiro Yamada's avatar
      linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI · 2a6cc8a6
      Masahiro Yamada authored
      Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
      BIT() etc", v3.
      
      ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
      architectures may introduce it in the future.
      
      UL() is arch-agnostic, and useful. So let's move it to
      include/linux/const.h
      
      Currently, <asm/memory.h> must be included to use UL().  It pulls in more
      bloats just for defining some bit macros.
      
      I posted V2 one year ago.
      
      The previous posts are:
      https://patchwork.kernel.org/patch/9498273/
      https://patchwork.kernel.org/patch/9498275/
      https://patchwork.kernel.org/patch/9498269/
      https://patchwork.kernel.org/patch/9498271/
      
      At that time, what blocked this series was a comment from
      David Howells:
        You need to be very careful doing this.  Some userspace stuff
        depends on the guard macro names on the kernel header files.
      
      (https://patchwork.kernel.org/patch/9498275/)
      
      Looking at the code closer, I noticed this is not a problem.
      
      See the following line.
      https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40
      
      scripts/headers_install.sh rips off _UAPI prefix from guard macro names.
      
      I ran "make headers_install" and confirmed the result is what I expect.
      
      So, we can prefix the include guard of include/uapi/linux/const.h,
      and add a new include/linux/const.h.
      
      This patch (of 4):
      
      I am going to add include/linux/const.h for the kernel space.
      
      Add _UAPI to the include guard of include/uapi/linux/const.h to
      prepare for that.
      
      Please notice the guard name of the exported one will be kept as-is.
      So, this commit has no impact to the userspace even if some userspace
      stuff depends on the guard macro names.
      
      scripts/headers_install.sh processes exported headers by SED, and
      rips off "_UAPI" from guard macro names.
      
        #ifndef _UAPI_LINUX_CONST_H
        #define _UAPI_LINUX_CONST_H
      
      will be turned into
      
        #ifndef _LINUX_CONST_H
        #define _LINUX_CONST_H
      
      Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a6cc8a6
    • Pavel Tatashin's avatar
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin authored
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Tested-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • Michal Hocko's avatar
      elf: enforce MAP_FIXED on overlaying elf segments · ad55eac7
      Michal Hocko authored
      Anshuman has reported that with "fs, elf: drop MAP_FIXED usage from
      elf_map" applied, some ELF binaries in his environment fail to start
      with
      
       [   23.423642] 9148 (sed): Uhuuh, elf segment at 0000000010030000 requested but the memory is mapped already
       [   23.423706] requested [10030000, 10040000] mapped [10030000, 10040000] 100073 anon
      
      The reason is that the above binary has overlapping elf segments:
      
        LOAD           0x0000000000000000 0x0000000010000000 0x0000000010000000
                       0x0000000000013a8c 0x0000000000013a8c  R E    10000
        LOAD           0x000000000001fd40 0x000000001002fd40 0x000000001002fd40
                       0x00000000000002c0 0x00000000000005e8  RW     10000
        LOAD           0x0000000000020328 0x0000000010030328 0x0000000010030328
                       0x0000000000000384 0x00000000000094a0  RW     10000
      
      That binary has two RW LOAD segments, the first crosses a page border
      into the second
      
        0x1002fd40 (LOAD2-vaddr) + 0x5e8 (LOAD2-memlen) == 0x10030328 (LOAD3-vaddr)
      
      Handle this situation by enforcing MAP_FIXED when we establish a
      temporary brk VMA to handle overlapping segments.  All other mappings
      will still use MAP_FIXED_NOREPLACE.
      
      Link: http://lkml.kernel.org/r/20180213100440.GM3443@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Brown <broonie@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad55eac7
    • Michal Hocko's avatar
      fs, elf: drop MAP_FIXED usage from elf_map · 4ed28639
      Michal Hocko authored
      Both load_elf_interp and load_elf_binary rely on elf_map to map segments
      on a controlled address and they use MAP_FIXED to enforce that.  This is
      however dangerous thing prone to silent data corruption which can be
      even exploitable.
      
      Let's take CVE-2017-1000253 as an example.  At the time (before commit
      eab09532: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
      ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
      the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
      we could end up mapping over the existing stack with some luck.
      
      The issue has been fixed since then (a87938b2: "fs/binfmt_elf.c: fix
      bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
      further from the stack (eab09532 and later by c715b72c: "mm:
      revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
      stack consumption early during execve fully stopped by da029c11
      ("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
      safe and any attack should be impractical.  On the other hand this is
      just too subtle assumption so it can break quite easily and hard to
      spot.
      
      I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
      fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
      at the early process stage and so there shouldn't be unrelated mappings
      (except for stack and loader) existing so mmap for a given address should
      succeed even without MAP_FIXED.  Something is terribly wrong if this is
      not the case and we should rather fail than silently corrupt the
      underlying mapping.
      
      Address this issue by changing MAP_FIXED to the newly added
      MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
      existing mapping clashing with the requested one without clobbering it.
      
      [mhocko@suse.com: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      [avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
        Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
      Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@openvz.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ed28639
    • Michal Hocko's avatar
      mm: introduce MAP_FIXED_NOREPLACE · a4ff8e86
      Michal Hocko authored
      Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
      
      This has started as a follow up discussion [3][4] resulting in the
      runtime failure caused by hardening patch [5] which removes MAP_FIXED
      from the elf loader because MAP_FIXED is inherently dangerous as it
      might silently clobber an existing underlying mapping (e.g.  stack).
      The reason for the failure is that some architectures enforce an
      alignment for the given address hint without MAP_FIXED used (e.g.  for
      shared or file backed mappings).
      
      One way around this would be excluding those archs which do alignment
      tricks from the hardening [6].  The patch is really trivial but it has
      been objected, rightfully so, that this screams for a more generic
      solution.  We basically want a non-destructive MAP_FIXED.
      
      The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
      address but unlike MAP_FIXED it fails with EEXIST if the given range
      conflicts with an existing one.  The flag is introduced as a completely
      new one rather than a MAP_FIXED extension because of the backward
      compatibility.  We really want a never-clobber semantic even on older
      kernels which do not recognize the flag.  Unfortunately mmap sucks
      wrt flags evaluation because we do not EINVAL on unknown flags.  On
      those kernels we would simply use the traditional hint based semantic so
      the caller can still get a different address (which sucks) but at least
      not silently corrupt an existing mapping.  I do not see a good way
      around that.  Except we won't export expose the new semantic to the
      userspace at all.
      
      It seems there are users who would like to have something like that.
      Jemalloc has been mentioned by Michael Ellerman [7]
      
      Florian Weimer has mentioned the following:
      : glibc ld.so currently maps DSOs without hints.  This means that the kernel
      : will map right next to each other, and the offsets between them a completely
      : predictable.  We would like to change that and supply a random address in a
      : window of the address space.  If there is a conflict, we do not want the
      : kernel to pick a non-random address. Instead, we would try again with a
      : random address.
      
      John Hubbard has mentioned CUDA example
      : a) Searches /proc/<pid>/maps for a "suitable" region of available
      : VA space.  "Suitable" generally means it has to have a base address
      : within a certain limited range (a particular device model might
      : have odd limitations, for example), it has to be large enough, and
      : alignment has to be large enough (again, various devices may have
      : constraints that lead us to do this).
      :
      : This is of course subject to races with other threads in the process.
      :
      : Let's say it finds a region starting at va.
      :
      : b) Next it does:
      :     p = mmap(va, ...)
      :
      : *without* setting MAP_FIXED, of course (so va is just a hint), to
      : attempt to safely reserve that region. If p != va, then in most cases,
      : this is a failure (almost certainly due to another thread getting a
      : mapping from that region before we did), and so this layer now has to
      : call munmap(), before returning a "failure: retry" to upper layers.
      :
      :     IMPROVEMENT: --> if instead, we could call this:
      :
      :             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
      :
      :         , then we could skip the munmap() call upon failure. This
      :         is a small thing, but it is useful here. (Thanks to Piotr
      :         Jaroszynski and Mark Hairgrove for helping me get that detail
      :         exactly right, btw.)
      :
      : c) After that, CUDA suballocates from p, via:
      :
      :      q = mmap(sub_region_start, ... MAP_FIXED ...)
      :
      : Interestingly enough, "freeing" is also done via MAP_FIXED, and
      : setting PROT_NONE to the subregion. Anyway, I just included (c) for
      : general interest.
      
      Atomic address range probing in the multithreaded programs in general
      sounds like an interesting thing to me.
      
      The second patch simply replaces MAP_FIXED use in elf loader by
      MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
      should follow.  Actually real MAP_FIXED usages should be docummented
      properly and they should be more of an exception.
      
      [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
      [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
      [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
      [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
      [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
      [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
      
      This patch (of 2):
      
      MAP_FIXED is used quite often to enforce mapping at the particular range.
      The main problem of this flag is, however, that it is inherently dangerous
      because it unmaps existing mappings covered by the requested range.  This
      can cause silent memory corruptions.  Some of them even with serious
      security implications.  While the current semantic might be really
      desiderable in many cases there are others which would want to enforce the
      given range but rather see a failure than a silent memory corruption on a
      clashing range.  Please note that there is no guarantee that a given range
      is obeyed by the mmap even when it is free - e.g.  arch specific code is
      allowed to apply an alignment.
      
      Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
      behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
      request with a single exception that it fails with EEXIST if the requested
      address is already covered by an existing mapping.  We still do rely on
      get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
      check for a conflicting vma after it returns.
      
      The flag is introduced as a completely new one rather than a MAP_FIXED
      extension because of the backward compatibility.  We really want a
      never-clobber semantic even on older kernels which do not recognize the
      flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
      EINVAL on unknown flags.  On those kernels we would simply use the
      traditional hint based semantic so the caller can still get a different
      address (which sucks) but at least not silently corrupt an existing
      mapping.  I do not see a good way around that.
      
      [mpe@ellerman.id.au: fix whitespace]
      [fail on clashing range with EEXIST as per Florian Weimer]
      [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
      Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.orgReviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4ff8e86
    • Joe Perches's avatar
      MAINTAINERS: update bouncing aacraid@adaptec.com addresses · 721d8b41
      Joe Perches authored
      Adaptec is now part of Microsemi.
      
      Commit 2a81ffdd ("MAINTAINERS: Update email address for aacraid")
      updated only one of the driver maintainer addresses.
      
      Update the other two sections as the aacraid@adaptec.com address
      bounces.
      
      Link: http://lkml.kernel.org/r/1522103936.12357.27.camel@perches.comSigned-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Dave Carroll <david.carroll@microsemi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      721d8b41
    • Nikolay Borisov's avatar
      fs/dcache.c: add cond_resched() in shrink_dentry_list() · 32785c05
      Nikolay Borisov authored
      As previously reported (https://patchwork.kernel.org/patch/8642031/)
      it's possible to call shrink_dentry_list with a large number of dentries
      (> 10000).  This, in turn, could trigger the softlockup detector and
      possibly trigger a panic.  In addition to the unmount path being
      vulnerable to this scenario, at SuSE we've observed similar situation
      happening during process exit on processes that touch a lot of dentries.
      Here is an excerpt from a crash dump.  The number after the colon are
      the number of dentries on the list passed to shrink_dentry_list:
      
      PID 99760: 10722
      PID 107530: 215
      PID 108809: 24134
      PID 108877: 21331
      PID 141708: 16487
      
      So we want to kill between 15k-25k dentries without yielding.
      
      And one possible call stack looks like:
      
      4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
      5 [ffff8839ece41db0] evict at ffffffff811c3026
      6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
      7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
      8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
      9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
      10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
      11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
      12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
      13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
      
      While some of the callers of shrink_dentry_list do use cond_resched,
      this is not sufficient to prevent softlockups.  So just move
      cond_resched into shrink_dentry_list from its callers.
      
      David said: I've found hundreds of occurrences of warnings that we emit
      when need_resched stays set for a prolonged period of time with the
      stack trace that is included in the change log.
      
      Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.comSigned-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32785c05
    • Valentin Vidic's avatar
      include/linux/kfifo.h: fix comment · de99626c
      Valentin Vidic authored
      Clean up unusual formatting in the note about locking.
      
      Link: http://lkml.kernel.org/r/20180324002630.13046-1-Valentin.Vidic@CARNet.hrSigned-off-by: default avatarValentin Vidic <Valentin.Vidic@CARNet.hr>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Sean Young <sean@mess.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de99626c
    • Andrew Morton's avatar
      ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops · a61fc2cb
      Andrew Morton authored
      This was added by the recent "ipc/shm.c: add split function to
      shm_vm_ops", but it is not necessary.
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a61fc2cb
    • Waiman Long's avatar
      kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param · 24704f36
      Waiman Long authored
      Kdoc comments are added to the do_proc_dointvec_minmax_conv_param and
      do_proc_douintvec_minmax_conv_param structures thare are used internally
      for range checking.
      
      The error codes returned by proc_dointvec_minmax() and
      proc_douintvec_minmax() are also documented.
      
      Link: http://lkml.kernel.org/r/1519926220-7453-3-git-send-email-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24704f36
    • Waiman Long's avatar
      fs/proc/proc_sysctl.c: fix typo in sysctl_check_table_array() · 64a11f3d
      Waiman Long authored
      Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.
      
      The sysctl parameters msgmni, shmmni and semmni have an inherent limit
      of IPC_MNI (32k).  However, users may not be aware of that because they
      can write a value much higher than that without getting any error or
      notification.  Reading the parameters back will show the newly written
      values which are not real.
      
      Enforcing the limit by failing sysctl parameter write, however, can
      break existing user applications.  To address this delemma, a new flags
      field is introduced into the ctl_table.  The value CTL_FLAGS_CLAMP_RANGE
      can be added to any ctl_table entries to enable a looser range clamping
      without returning any error.  For example,
      
        .flags = CTL_FLAGS_CLAMP_RANGE,
      
      This flags value are now used for the range checking of shmmni, msgmni
      and semmni without breaking existing applications.  If any out of range
      value is written to those sysctl parameters, the following warning will
      be printed instead.
      
        Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.
      
      Reading the values back will show 32768 instead of some fake values.
      
      This patch (of 6):
      
      Fix a typo.
      
      Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64a11f3d
    • Davidlohr Bueso's avatar
      ipc/msg: introduce msgctl(MSG_STAT_ANY) · 23c8cec8
      Davidlohr Bueso authored
      There is a permission discrepancy when consulting msq ipc object
      metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the msq metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new MSG_STAT_ANY command such that the msq ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reported-by: default avatarRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23c8cec8
    • Davidlohr Bueso's avatar
      ipc/sem: introduce semctl(SEM_STAT_ANY) · a280d6dc
      Davidlohr Bueso authored
      There is a permission discrepancy when consulting shm ipc object
      metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the sma metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new SEM_STAT_ANY command such that the sem ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reported-by: default avatarRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a280d6dc
    • Davidlohr Bueso's avatar
      ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970
      Davidlohr Bueso authored
      Patch series "sysvipc: introduce STAT_ANY commands", v2.
      
      The following patches adds the discussed (see [1]) new command for shm
      as well as for sems and msq as they are subject to the same
      discrepancies for ipc object permission checks between the syscall and
      via procfs.  These new commands are justified in that (1) we are stuck
      with this semantics as changing syscall and procfs can break userland;
      and (2) some users can benefit from performance (for large amounts of
      shm segments, for example) from not having to parse the procfs
      interface.
      
      Once merged, I will submit the necesary manpage updates.  But I'm thinking
      something like:
      
      : diff --git a/man2/shmctl.2 b/man2/shmctl.2
      : index 7bb503999941..bb00bbe21a57 100644
      : --- a/man2/shmctl.2
      : +++ b/man2/shmctl.2
      : @@ -41,6 +41,7 @@
      :  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
      :  .\"	attaches to a segment that has already been marked for deletion.
      :  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
      : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
      :  .\"
      :  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
      :  .SH NAME
      : @@ -242,6 +243,18 @@ However, the
      :  argument is not a segment identifier, but instead an index into
      :  the kernel's internal array that maintains information about
      :  all shared memory segments on the system.
      : +.TP
      : +.BR SHM_STAT_ANY " (Linux-specific)"
      : +Return a
      : +.I shmid_ds
      : +structure as for
      : +.BR SHM_STAT .
      : +However, the
      : +.I shm_perm.mode
      : +is not checked for read access for
      : +.IR shmid ,
      : +resembing the behaviour of
      : +/proc/sysvipc/shm.
      :  .PP
      :  The caller can prevent or allow swapping of a shared
      :  memory segment with the following \fIcmd\fP values:
      : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
      :  kernel's internal array recording information about all
      :  shared memory segments.
      :  (This information can be used with repeated
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operations to obtain information about all shared memory segments
      :  on the system.)
      :  A successful
      : @@ -328,7 +341,7 @@ isn't accessible.
      :  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
      :  is not a valid command.
      :  Or: for a
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operation, the index value specified in
      :  .I shmid
      :  referred to an array slot that is currently unused.
      
      This patch (of 3):
      
      There is a permission discrepancy when consulting shm ipc object metadata
      between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
      later does permission checks for the object vs S_IRUGO.  As such there can
      be cases where EACCESS is returned via syscall but the info is displayed
      anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the shm metadata), this behavior goes way back and showing all
      the objects regardless of the permissions was most likely an overlook - so
      we are stuck with it.  Furthermore, modifying either the syscall or the
      procfs file can cause userspace programs to break (ie ipcs).  Some
      applications require getting the procfs info (without root privileges) and
      can be rather slow in comparison with a syscall -- up to 500x in some
      reported cases.
      
      This patch introduces a new SHM_STAT_ANY command such that the shm ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      [1] https://lkml.org/lkml/2017/12/19/220
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Kettler <robert.kettler@outlook.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c21a6970
    • Chris Wilson's avatar
      kernel/params.c: downgrade warning for unsafe parameters · edc41b3c
      Chris Wilson authored
      As using an unsafe module parameter is, by its very definition, an
      expected user action, emitting a warning is overkill.  Nothing has yet
      gone wrong, and we add a taint flag for any future oops should something
      actually go wrong.  So instead of having a user controllable pr_warn,
      downgrade it to a pr_notice for "a normal, but significant condition".
      
      We make use of unsafe kernel parameters in igt
      (https://cgit.freedesktop.org/drm/igt-gpu-tools/) (we have not yet
      succeeded in removing all such debugging options), which generates a
      warning and taints the kernel.  The warning is unhelpful as we then need
      to filter it out again as we check that every test themselves do not
      provoke any kernel warnings.
      
      Link: http://lkml.kernel.org/r/20180226151919.9674-1-chris@chris-wilson.co.uk
      Fixes: 91f9d330 ("module: make it possible to have unsafe, tainting module params")
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: default avatarJani Nikula <jani.nikula@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Petri Latvala <petri.latvala@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc41b3c
    • Randy Dunlap's avatar
      kernel/sysctl.c: fix sizeof argument to match variable name · 2d87b309
      Randy Dunlap authored
      Fix sizeof argument to be the same as the data variable name.  Probably
      a copy/paste error.
      
      Mostly harmless since both variables are unsigned int.
      
      Fixes kernel bugzilla #197371:
        Possible access to unintended variable in "kernel/sysctl.c" line 1339
      https://bugzilla.kernel.org/show_bug.cgi?id=197371
      
      Link: http://lkml.kernel.org/r/e0d0531f-361e-ef5f-8499-32743ba907e1@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarPetru Mihancea <petrum@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d87b309
    • Ioan Nicu's avatar
      rapidio: use a reference count for struct mport_dma_req · bbd876ad
      Ioan Nicu authored
      Once the dma request is passed to the DMA engine, the DMA subsystem
      would hold a pointer to this structure and could call the completion
      callback after do_dma_request() has timed out.
      
      The current code deals with this by putting timed out SYNC requests to a
      pending list and freeing them later, when the mport cdev device is
      released.  This still does not guarantee that the DMA subsystem is
      really done with those transfers, so in theory
      dma_xfer_callback/dma_req_free could be called after
      mport_cdev_release_dma and could potentially access already freed
      memory.
      
      This patch simplifies the current handling by using a kref in the mport
      dma request structure, so that it gets freed only when nobody uses it
      anymore.
      
      This also simplifies the code a bit, as FAF transfers are now handled in
      the same way as SYNC and ASYNC transfers.  There is no need anymore for
      the pending list and for the dma workqueue which was used in case of FAF
      transfers, so we remove them both.
      
      Link: http://lkml.kernel.org/r/20180405203342.GA16191@nokia.comSigned-off-by: default avatarIoan Nicu <ioan.nicu.ext@nokia.com>
      Acked-by: default avatarAlexandre Bounine <alex.bou9@gmail.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Frank Kunz <frank.kunz@nokia.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbd876ad
    • Vasyl Gomonovych's avatar
      drivers/rapidio/rio-scan.c: fix typo in comment · b94bb1f6
      Vasyl Gomonovych authored
      Fix typo in the words 'receiver', 'specified', 'during'
      
      Link: http://lkml.kernel.org/r/20180321211035.8904-1-gomonovych@gmail.comSigned-off-by: default avatarVasyl Gomonovych <gomonovych@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Alexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b94bb1f6
    • Kees Cook's avatar
      exec: pin stack limit during exec · c31dbb14
      Kees Cook authored
      Since the stack rlimit is used in multiple places during exec and it can
      be changed via other threads (via setrlimit()) or processes (via
      prlimit()), the assumption that the value doesn't change cannot be made.
      This leads to races with mm layout selection and argument size
      calculations.  This changes the exec path to use the rlimit stored in
      bprm instead of in current.  Before starting the thread, the bprm stack
      rlimit is stored back to current.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
      Fixes: 64701dee ("exec: Use sane stack rlimit under secureexec")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Reported-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reported-by: default avatarBrad Spengler <spender@grsecurity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c31dbb14
    • Kees Cook's avatar
      exec: introduce finalize_exec() before start_thread() · b8383831
      Kees Cook authored
      Provide a final callback into fs/exec.c before start_thread() takes
      over, to handle any last-minute changes, like the coming restoration of
      the stack limit.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8383831