1. 07 Apr, 2018 1 commit
  2. 06 Apr, 2018 39 commits
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 9eda2d2d
      Linus Torvalds authored
      Pull SELinux updates from Paul Moore:
       "A bigger than usual pull request for SELinux, 13 patches (lucky!)
        along with a scary looking diffstat.
      
        Although if you look a bit closer, excluding the usual minor
        tweaks/fixes, there are really only two significant changes in this
        pull request: the addition of proper SELinux access controls for SCTP
        and the encapsulation of a lot of internal SELinux state.
      
        The SCTP changes are the result of a multi-month effort (maybe even a
        year or longer?) between the SELinux folks and the SCTP folks to add
        proper SELinux controls. A special thanks go to Richard for seeing
        this through and keeping the effort moving forward.
      
        The state encapsulation work is a bit of janitorial work that came out
        of some early work on SELinux namespacing. The question of namespacing
        is still an open one, but I believe there is some real value in the
        encapsulation work so we've split that out and are now sending that up
        to you"
      
      * tag 'selinux-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: wrap AVC state
        selinux: wrap selinuxfs state
        selinux: fix handling of uninitialized selinux state in get_bools/classes
        selinux: Update SELinux SCTP documentation
        selinux: Fix ltp test connect-syscall failure
        selinux: rename the {is,set}_enforcing() functions
        selinux: wrap global selinux state
        selinux: fix typo in selinux_netlbl_sctp_sk_clone declaration
        selinux: Add SCTP support
        sctp: Add LSM hooks
        sctp: Add ip option support
        security: Add support for SCTP security hooks
        netlabel: If PF_INET6, check sk_buff ip header version
      9eda2d2d
    • Linus Torvalds's avatar
      Merge tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit · 6ad11bdd
      Linus Torvalds authored
      Pull audit updates from Paul Moore:
       "We didn't have anything to send for v4.16, but we're back with a
        little more than usual for v4.17.
      
        Eleven patches in total, most fall into the small fix category, but
        there are three non-trivial changes worth calling out:
      
         - the audit entry filter is being removed after deprecating it for
           quite a while (years of no one really using it because it turns out
           to be not very practical)
      
         - created our own version of "__mutex_owner()" because the locking
           folks were upset we were using theirs
      
         - improved our handling of kernel command line parameters to make
           them more forgiving
      
         - we fixed auditing of symlink operations
      
        Everything passes the audit-testsuite and as of a few minutes ago it
        merges well with your tree"
      
      * tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
        audit: add refused symlink to audit_names
        audit: remove path param from link denied function
        audit: link denied should not directly generate PATH record
        audit: make ANOM_LINK obey audit_enabled and audit_dummy_context
        audit: do not panic on invalid boot parameter
        audit: track the owner of the command mutex ourselves
        audit: return on memory error to avoid null pointer dereference
        audit: bail before bug check if audit disabled
        audit: deprecate the AUDIT_FILTER_ENTRY filter
        audit: session ID should not set arch quick field pointer
        audit: update bugtracker and source URIs
      6ad11bdd
    • Linus Torvalds's avatar
      Merge tag 'pstore-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 69824bcc
      Linus Torvalds authored
      Pull pstore updates from Kees Cook:
       "This cycle was almost entirely improvements to the pstore compression
        options, noted below:
      
         - Add lz4hc and 842 to pstore compression options (Geliang Tang)
      
         - Refactor to use crypto compression API (Geliang Tang)
      
         - Fix up Kconfig dependencies for compression (Arnd Bergmann)
      
         - Allow for run-time compression selection
      
         - Remove stack VLA usage"
      
      * tag 'pstore-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        pstore: fix crypto dependencies
        pstore: Use crypto compress API
        pstore/ram: Do not use stack VLA for parity workspace
        pstore: Select compression at runtime
        pstore: Avoid size casts for 842 compression
        pstore: Add lz4hc and 842 compression support
      69824bcc
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 3b54765c
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 updates
      
       - the v9fs maintainers have been missing for a long time. I've taken
         over v9fs patch slinging.
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
        mm,oom_reaper: check for MMF_OOM_SKIP before complaining
        mm/ksm: fix interaction with THP
        mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
        headers: untangle kmemleak.h from mm.h
        include/linux/mmdebug.h: make VM_WARN* non-rvals
        mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
        mm: change return type to vm_fault_t
        mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
        mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
        kernel/fork.c: detect early free of a live mm
        mm: make counting of list_lru_one::nr_items lockless
        mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
        block_invalidatepage(): only release page if the full page was invalidated
        mm: kernel-doc: add missing parameter descriptions
        mm/swap.c: remove @cold parameter description for release_pages()
        mm/nommu: remove description of alloc_vm_area
        zram: drop max_zpage_size and use zs_huge_class_size()
        zsmalloc: introduce zs_huge_class_size()
        mm: fix races between swapoff and flush dcache
        fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
        ...
      3b54765c
    • Linus Torvalds's avatar
      Merge tag 'mtd/for-4.17' of git://git.infradead.org/linux-mtd · 3fd14cdc
      Linus Torvalds authored
      Pull MTD updates from Boris Brezillon:
       "MTD Core:
         - Remove support for asynchronous erase (not implemented by any of
           the existing drivers anyway)
         - Remove Cyrille from the list of SPI NOR and MTD maintainers
         - Fix kernel doc headers
         - Allow users to define the partitions parsers they want to test
           through a DT property (compatible of the partitions subnode)
         - Remove the bfin-async-flash driver (the only architecture using it
           has been removed)
         - Fix pagetest test
         - Add extra checks in mtd_erase()
         - Simplify the MTD partition creation logic and get rid of
           mtd_add_device_partitions()
      
        MTD Drivers:
         - Add endianness information to the physmap DT binding
         - Add Eon EN29LV400A IDs to JEDEC probe logic
         - Use %*ph where appropriate
      
        SPI NOR Drivers:
         - Make fsl-quaspi assign different names to MTD devices connected to
           the same QSPI controller
         - Remove an unneeded driver.bus assigned in the fsl-qspi driver
      
        NAND Core:
         - Prepare arrival of the SPI NAND subsystem by implementing a generic
           (interface-agnostic) layer to ease manipulation of NAND devices
         - Move onenand code base to the drivers/mtd/nand/ dir
         - Rework timing mode selection
         - Provide a generic way for NAND chip drivers to flag a specific
           GET/SET FEATURE operation as supported/unsupported
         - Stop embedding ONFI/JEDEC param page in nand_chip
      
        NAND Drivers:
         - Rework/cleanup of the mxc driver
         - Various cleanups in the vf610 driver
         - Migrate the fsmc and vf610 to ->exec_op()
         - Get rid of the pxa driver (replaced by marvell_nand)
         - Support ->setup_data_interface() in the GPMI driver
         - Fix probe error path in several drivers
         - Remove support for unused hw_syndrome mode in sunxi_nand
         - Various minor improvements"
      
      * tag 'mtd/for-4.17' of git://git.infradead.org/linux-mtd: (89 commits)
        dt-bindings: fsl-quadspi: Add the example of two SPI NOR
        mtd: fsl-quadspi: Distinguish the mtd device names
        mtd: nand: Fix some function description mismatches in core.c
        mtd: fsl-quadspi: Remove unneeded driver.bus assignment
        mtd: rawnand: marvell: Rename ->ecc_clk into ->core_clk
        mtd: rawnand: s3c2410: enhance the probe function error path
        mtd: rawnand: tango: fix probe function error path
        mtd: rawnand: sh_flctl: fix the probe function error path
        mtd: rawnand: omap2: fix the probe function error path
        mtd: rawnand: mxc: fix probe function error path
        mtd: rawnand: denali: fix probe function error path
        mtd: rawnand: davinci: fix probe function error path
        mtd: rawnand: cafe: fix probe function error path
        mtd: rawnand: brcmnand: fix probe function error path
        mtd: rawnand: sunxi: Stop supporting ECC_HW_SYNDROME mode
        mtd: rawnand: marvell: Fix clock resource by adding a register clock
        mtd: ftl: Use DIV_ROUND_UP()
        mtd: Fix some function description mismatches in mtdcore.c
        mtd: physmap_of: update struct map_info's swap as per map requirement
        dt-bindings: mtd-physmap: Add endianness supports
        ...
      3fd14cdc
    • Linus Torvalds's avatar
      Merge tag 'for-4.17/dm-changes' of... · 83c7c18b
      Linus Torvalds authored
      Merge tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - DM core passthrough ioctl fix to retain reference to DM table, and
         that table's block devices, while issuing the ioctl to one of those
         block devices.
      
       - DM core passthrough ioctl fix to _not_ override the fmode_t used to
         issue the ioctl. Overriding by using the fmode_t that the block
         device was originally open with during DM table load is a liability.
      
       - Add DM core support for secure erase forwarding and update the DM
         linear and DM striped targets to support them.
      
       - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write
         same, write zeroes) for targets that make use of the non-splitting IO
         variant (as is done for multipath or thinp when layered directly on
         NVMe).
      
       - Allow DM targets to return a payload in response to a DM message that
         they are sent. This is useful for DM targets that would like to
         provide statistics data in response to DM messages.
      
       - Update DM bufio to support non-power-of-2 block sizes. Numerous other
         related changes prepare the DM bufio code for this support.
      
       - Fix DM crypt to use a bounded amount of memory across the entire
         system. This is to avoid OOM that can otherwise occur in response to
         certain pathological IO workloads (e.g. discarding a large DM crypt
         device).
      
       - Add a 'check_at_most_once' feature to the DM verity target to allow
         verity to be used on mobile devices that have very limited resources.
      
       - Fix the DM integrity target to fail early if a keyed algorithm (e.g.
         HMAC) is to be used but the key isn't set.
      
       - Add non-power-of-2 support to the DM unstripe target.
      
       - Eliminate the use of a Variable Length Array in the DM stripe target.
      
       - Update the DM log-writes target to record metadata (REQ_META flag).
      
       - DM raid fixes for its nosync status and some variable range issues.
      
      * tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits)
        dm: remove fmode_t argument from .prepare_ioctl hook
        dm: hold DM table for duration of ioctl rather than use blkdev_get
        dm raid: fix parse_raid_params() variable range issue
        dm verity: make verity_for_io_block static
        dm verity: add 'check_at_most_once' option to only validate hashes once
        dm bufio: don't embed a bio in the dm_buffer structure
        dm bufio: support non-power-of-two block sizes
        dm bufio: use slab cache for dm_buffer structure allocations
        dm bufio: reorder fields in dm_buffer structure
        dm bufio: relax alignment constraint on slab cache
        dm bufio: remove code that merges slab caches
        dm bufio: get rid of slab cache name allocations
        dm bufio: move dm-bufio.h to include/linux/
        dm bufio: delete outdated comment
        dm: add support for secure erase forwarding
        dm: backfill abnormal IO support to non-splitting IO submission
        dm raid: fix nosync status
        dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios()
        dm stripe: get rid of a Variable Length Array (VLA)
        dm log writes: record metadata flag for better flags record
        ...
      83c7c18b
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9022ca6b
      Linus Torvalds authored
      Pull misc vfs updates from Al Viro:
       "Assorted stuff, including Christoph's I_DIRTY patches"
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fs: move I_DIRTY_INODE to fs.h
        ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
        ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
        gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
        fs: fold open_check_o_direct into do_dentry_open
        vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
        vfs: make sure struct filename->iname is word-aligned
        get rid of pointless includes of fs_struct.h
        [poll] annotate SAA6588_CMD_POLL users
      9022ca6b
    • Tetsuo Handa's avatar
      mm,oom_reaper: check for MMF_OOM_SKIP before complaining · 97b1255c
      Tetsuo Handa authored
      I got "oom_reaper: unable to reap pid:" messages when the victim thread
      was blocked inside free_pgtables() (which occurred after returning from
      unmap_vmas() and setting MMF_OOM_SKIP).  We don't need to complain when
      exit_mmap() already set MMF_OOM_SKIP.
      
        Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: unable to reap pid:7558 (a.out)
        a.out           D13272  7558   6931 0x00100084
        Call Trace:
         schedule+0x2d/0x80
         rwsem_down_write_failed+0x2bb/0x440
         call_rwsem_down_write_failed+0x13/0x20
         down_write+0x49/0x60
         unlink_file_vma+0x28/0x50
         free_pgtables+0x36/0x100
         exit_mmap+0xbb/0x180
         mmput+0x50/0x110
         copy_process.part.41+0xb61/0x1fe0
         _do_fork+0xe6/0x560
         do_syscall_64+0x74/0x230
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97b1255c
    • Claudio Imbrenda's avatar
      mm/ksm: fix interaction with THP · 77da2ba0
      Claudio Imbrenda authored
      This patch fixes a corner case for KSM.  When two pages belong or
      belonged to the same transparent hugepage, and they should be merged,
      KSM fails to split the page, and therefore no merging happens.
      
      This bug can be reproduced by:
      * making sure ksm is running (in case disabling ksmtuned)
      * enabling transparent hugepages
      * allocating a THP-aligned 1-THP-sized buffer
        e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
      * filling it with the same values
        e.g. memset(p, 42, 1<<21)
      * performing madvise to make it mergeable
        e.g. madvise(p, 1<<21, MADV_MERGEABLE)
      * waiting for KSM to perform a few scans
      
      The expected outcome is that the all the pages get merged (1 shared and
      the rest sharing); the actual outcome is that no pages get merged (1
      unshared and the rest volatile)
      
      The reason of this behaviour is that we increase the reference count
      once for both pages we want to merge, but if they belong to the same
      hugepage (or compound page), the reference counter used in both cases is
      the one of the head of the compound page.  This means that
      split_huge_page will find a value of the reference counter too high and
      will fail.
      
      This patch solves this problem by testing if the two pages to merge
      belong to the same hugepage when attempting to merge them.  If so, the
      hugepage is split safely.  This means that the hugepage is not split if
      not necessary.
      
      Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.comSigned-off-by: default avatarClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Co-authored-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77da2ba0
    • Stefan Agner's avatar
      mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t · 644d87dc
      Stefan Agner authored
      This fixes a warning shown when phys_addr_t is 32-bit int when compiling
      with clang:
      
        mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
              to 'phys_addr_t' (aka 'unsigned int') changes value from
              18446744073709551615 to 4294967295 [-Wconstant-conversion]
                                        r->base : ULLONG_MAX;
                                                  ^~~~~~~~~~
        ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
        #define ULLONG_MAX      (~0ULL)
                                 ^~~~~
      
      Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.chSigned-off-by: default avatarStefan Agner <stefan@agner.ch>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      644d87dc
    • Randy Dunlap's avatar
      headers: untangle kmemleak.h from mm.h · 514c6032
      Randy Dunlap authored
      Currently <linux/slab.h> #includes <linux/kmemleak.h> for no obvious
      reason.  It looks like it's only a convenience, so remove kmemleak.h
      from slab.h and add <linux/kmemleak.h> to any users of kmemleak_* that
      don't already #include it.  Also remove <linux/kmemleak.h> from source
      files that do not use it.
      
      This is tested on i386 allmodconfig and x86_64 allmodconfig.  It would
      be good to run it through the 0day bot for other $ARCHes.  I have
      neither the horsepower nor the storage space for the other $ARCHes.
      
      Update: This patch has been extensively build-tested by both the 0day
      bot & kisskb/ozlabs build farms.  Both of them reported 2 build failures
      for which patches are included here (in v2).
      
      [ slab.h is the second most used header file after module.h; kernel.h is
        right there with slab.h. There could be some minor error in the
        counting due to some #includes having comments after them and I didn't
        combine all of those. ]
      
      [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
      Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
      Link: http://kisskb.ellerman.id.au/kisskb/head/13396/Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Reported-by: Michael Ellerman <mpe@ellerman.id.au>	[2 build failures]
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>	[2 build failures]
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Luis R. Rodriguez <mcgrof@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      514c6032
    • Michal Hocko's avatar
      include/linux/mmdebug.h: make VM_WARN* non-rvals · 91241681
      Michal Hocko authored
      At present the construct
      
      	if (VM_WARN(...))
      
      will compile OK with CONFIG_DEBUG_VM=y and will fail with
      CONFIG_DEBUG_VM=n.  The reason is that VM_{WARN,BUG}* have always been
      special wrt.  {WARN/BUG}* and never generate any code when DEBUG_VM is
      disabled.  So we cannot really use it in conditionals.
      
      We considered changing things so that this construct works in both cases
      but that might cause unwanted code generation with CONFIG_DEBUG_VM=n.
      It is safer and simpler to make the build fail in both cases.
      
      [akpm@linux-foundation.org: changelog]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91241681
    • Mike Kravetz's avatar
      mm/page_isolation.c: make start_isolate_page_range() fail if already isolated · 2c7452a0
      Mike Kravetz authored
      start_isolate_page_range() is used to set the migrate type of a set of
      pageblocks to MIGRATE_ISOLATE while attempting to start a migration
      operation.  It assumes that only one thread is calling it for the
      specified range.  This routine is used by CMA, memory hotplug and
      gigantic huge pages.  Each of these users synchronize access to the
      range within their subsystem.  However, two subsystems (CMA and gigantic
      huge pages for example) could attempt operations on the same range.  If
      this happens, one thread may 'undo' the work another thread is doing.
      This can result in pageblocks being incorrectly left marked as
      MIGRATE_ISOLATE and therefore not available for page allocation.
      
      What is ideally needed is a way to synchronize access to a set of
      pageblocks that are undergoing isolation and migration.  The only thing
      we know about these pageblocks is that they are all in the same zone.  A
      per-node mutex is too coarse as we want to allow multiple operations on
      different ranges within the same zone concurrently.  Instead, we will
      use the migration type of the pageblocks themselves as a form of
      synchronization.
      
      start_isolate_page_range sets the migration type on a set of page-
      blocks going in order from the one associated with the smallest pfn to
      the largest pfn.  The zone lock is acquired to check and set the
      migration type.  When going through the list of pageblocks check if
      MIGRATE_ISOLATE is already set.  If so, this indicates another thread is
      working on this pageblock.  We know exactly which pageblocks we set, so
      clean up by undo those and return -EBUSY.
      
      This allows start_isolate_page_range to serve as a synchronization
      mechanism and will allow for more general use of callers making use of
      these interfaces.  Update comments in alloc_contig_range to reflect this
      new functionality.
      
      Each CPU holds the associated zone lock to modify or examine the
      migration type of a pageblock.  And, it will only examine/update a
      single pageblock per lock acquire/release cycle.
      
      Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c7452a0
    • Souptick Joarder's avatar
      mm: change return type to vm_fault_t · 1c8f4220
      Souptick Joarder authored
      The plan for these patches is to introduce the typedef, initially just
      as documentation ("These functions should return a VM_FAULT_ status").
      We'll trickle the patches to individual drivers/filesystems in through
      the maintainers, as far as possible.  Then we'll change the typedef to
      an unsigned int and break the compilation of any unconverted
      drivers/filesystems.
      
      vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three
      newly added functions.  The various drivers/filesystems where return
      value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get
      converted, will need them.  These functions will return correct
      VM_FAULT_ code based on err value.
      
      We've had bugs before where drivers returned -EFOO.  And we have this
      silly inefficiency where vm_insert_xxx() return an errno which (afaict)
      every driver then converts into a VM_FAULT code.  In many cases drivers
      failed to return correct VM_FAULT code value despite of vm_insert_xxx()
      fails.  We have indentified and clean up all those existing bugs and
      silly inefficiencies in driver/filesystems by adding these three new
      inline wrappers.  As mentioned above, we will trickle those patches to
      individual drivers/filesystems in through maintainers after these three
      wrapper functions are merged.
      
      Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and
      remove these inline wrappers, but these are a good intermediate step.
      
      Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PCSigned-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c8f4220
    • David Rientjes's avatar
      mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes · d46078b2
      David Rientjes authored
      Since the 2.6 kernel, the oom killer has slightly biased away from
      CAP_SYS_ADMIN processes by discounting some of its memory usage in
      comparison to other processes.
      
      This has always been implicit and nothing exactly relies on the
      behavior.
      
      Gaurav notices that __task_cred() can dereference a potentially freed
      pointer if the task under consideration is exiting because a reference
      to the task_struct is not held.
      
      Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.
      
      If any CAP_SYS_ADMIN process would like to be biased against, it is
      always allowed to adjust /proc/pid/oom_score_adj.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarGaurav Kohli <gkohli@codeaurora.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d46078b2
    • David Rientjes's avatar
      mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory · 5ecd9d40
      David Rientjes authored
      Kswapd will not wakeup if per-zone watermarks are not failing or if too
      many previous attempts at background reclaim have failed.
      
      This can be true if there is a lot of free memory available.  For high-
      order allocations, kswapd is responsible for waking up kcompactd for
      background compaction.  If the zone is not below its watermarks or
      reclaim has recently failed (lots of free memory, nothing left to
      reclaim), kcompactd does not get woken up.
      
      When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
      woken up even if kswapd will not reclaim.  This allows high-order
      allocations, such as thp, to still trigger background compaction even
      when the zone has an abundance of free memory.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ecd9d40
    • Mark Rutland's avatar
      kernel/fork.c: detect early free of a live mm · 3eda69c9
      Mark Rutland authored
      KASAN splats indicate that in some cases we free a live mm, then
      continue to access it, with potentially disastrous results.  This is
      likely due to a mismatched mmdrop() somewhere in the kernel, but so far
      the culprit remains elusive.
      
      Let's have __mmdrop() verify that the mm isn't live for the current
      task, similar to the existing check for init_mm.  This way, we can catch
      this class of issue earlier, and without requiring KASAN.
      
      Currently, idle_task_exit() leaves active_mm stale after it switches to
      init_mm.  This isn't harmful, but will trigger the new assertions, so we
      must adjust idle_task_exit() to update active_mm.
      
      Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.comSigned-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eda69c9
    • Kirill Tkhai's avatar
      mm: make counting of list_lru_one::nr_items lockless · 0c7c1bed
      Kirill Tkhai authored
      During the reclaiming slab of a memcg, shrink_slab iterates over all
      registered shrinkers in the system, and tries to count and consume
      objects related to the cgroup.  In case of memory pressure, this behaves
      bad: I observe high system time and time spent in list_lru_count_one()
      for many processes on RHEL7 kernel.
      
      This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
      skip taking spinlock in list_lru_count_one().
      
      Shakeel Butt with the patch observes significant perf graph change.  He
      says:
      
      ========================================================================
      Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
      VM and recording the trace with 'perf record -g -a'.
      
      The trace without the patch:
      
      +  34.19%     fb.sh  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
      +  30.77%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_lock
      +   3.53%     fb.sh  [kernel.kallsyms]  [k] list_lru_count_one
      +   2.26%     fb.sh  [kernel.kallsyms]  [k] super_cache_count
      +   1.68%     fb.sh  [kernel.kallsyms]  [k] shrink_slab
      +   0.59%     fb.sh  [kernel.kallsyms]  [k] down_read_trylock
      +   0.48%     fb.sh  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
      +   0.38%     fb.sh  [kernel.kallsyms]  [k] shrink_node_memcg
      +   0.32%     fb.sh  [kernel.kallsyms]  [k] queue_work_on
      +   0.26%     fb.sh  [kernel.kallsyms]  [k] count_shadow_nodes
      
      With the patch:
      
      +   0.16%     swapper  [kernel.kallsyms]    [k] default_idle
      +   0.13%     oom_reaper  [kernel.kallsyms]    [k] mutex_spin_on_owner
      +   0.05%     perf  [kernel.kallsyms]    [k] copy_user_generic_string
      +   0.05%     init.real  [kernel.kallsyms]    [k] wait_consider_task
      +   0.05%     kworker/0:0  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/2:1  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/3:1  [kernel.kallsyms]    [k] finish_task_switch
      +   0.04%     kworker/1:0  [kernel.kallsyms]    [k] finish_task_switch
      +   0.03%     binary  [kernel.kallsyms]    [k] copy_page
      ========================================================================
      
      Thanks Shakeel for the testing.
      
      [ktkhai@virtuozzo.com: v2]
        Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c7c1bed
    • Colin Ian King's avatar
      mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static · f5c754d6
      Colin Ian King authored
      The bool enable_vma_readahead and swap_vma_readahead() are local to the
      source and do not need to be in global scope, so make them static.
      
      Cleans up sparse warnings:
      
        mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
        mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5c754d6
    • Jeff Moyer's avatar
      block_invalidatepage(): only release page if the full page was invalidated · 3172485f
      Jeff Moyer authored
      Prior to commit d47992f8 ("mm: change invalidatepage prototype to
      accept length"), an offset of 0 meant that the full page was being
      invalidated.  After that commit, we need to instead check the length.
      
      Jan said:
      :
      : The only possible issue is that try_to_release_page() was called more
      : often than necessary.  Otherwise the issue is harmless but still it's good
      : to have this fixed.
      
      Link: http://lkml.kernel.org/r/x49fu5rtnzs.fsf@segfault.boston.devel.redhat.com
      Fixes: d47992f8 ("mm: change invalidatepage prototype to accept length")
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3172485f
    • Mike Rapoport's avatar
    • Mike Rapoport's avatar
      mm/swap.c: remove @cold parameter description for release_pages() · 002843de
      Mike Rapoport authored
      The 'cold' parameter was removed from release_pages function by commit
      c6f92f9f ("mm: remove cold parameter for release_pages").
      
      Update the description to match the code.
      
      Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      002843de
    • Mike Rapoport's avatar
      mm/nommu: remove description of alloc_vm_area · e48e3c59
      Mike Rapoport authored
      The alloc_mm_area in nommu is a stub, but its description states it
      allocates kernel address space.  Remove the description to make the code
      and the documentation agree.
      
      Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e48e3c59
    • Sergey Senozhatsky's avatar
      zram: drop max_zpage_size and use zs_huge_class_size() · 60f5921a
      Sergey Senozhatsky authored
      Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
      watermark instead, which makes more sense.
      
      TEST
      - I used a 1G zram device, LZO compression back-end, original
        data set size was 444MB. Looking at zsmalloc classes stats the
        test ended up to be pretty fair.
      
      BASE ZRAM/ZSMALLOC
      =====================
      zram mm_stat
      
      498978816 191482495 199831552        0 199831552    15634        0
      
      zsmalloc classes
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      ...
         151  2448           0            0          1240       1240        744                3        0
         168  2720           0            0          4200       4200       2800                2        0
         190  3072           0            0         10100      10100       7575                3        0
         202  3264           0            0           380        380        304                4        0
         254  4096           0            0         10620      10620      10620                1        0
      
       Total                 7           46        106982     106187      48787                         0
      
      PATCHED ZRAM/ZSMALLOC
      =====================
      
      zram mm_stat
      
      498978816 182579184 194248704        0 194248704    15628        0
      
      zsmalloc classes
      
       class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
      ...
         151  2448           0            0          1240       1240        744                3        0
         168  2720           0            0          4200       4200       2800                2        0
         190  3072           0            0         10100      10100       7575                3        0
         202  3264           0            0          7180       7180       5744                4        0
         254  4096           0            0          3820       3820       3820                1        0
      
       Total                 8           45        106959     106193      47424                         0
      
      As we can see, we reduced the number of objects stored in class-4096,
      because a huge number of objects which we previously forcibly stored in
      class-4096 now stored in non-huge class-3264.  This results in lower
      memory consumption:
      
      - zsmalloc now uses 47424 physical pages, which is less than 48787 pages
        zsmalloc used before.
      
      - objects that we store in class-3264 share zspages.  That's why overall
        the number of pages that both class-4096 and class-3264 consumed went
        down from 10924 to 9564.
      
      [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
        Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60f5921a
    • Sergey Senozhatsky's avatar
      zsmalloc: introduce zs_huge_class_size() · 010b495e
      Sergey Senozhatsky authored
      Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.
      
      ZRAM's max_zpage_size is a bad thing.  It forces zsmalloc to store
      normal objects as huge ones, which results in bigger zsmalloc memory
      usage.  Drop it and use actual zsmalloc huge-class value when decide if
      the object is huge or not.
      
      This patch (of 2):
      
      Not every object can be share its zspage with other objects, e.g.  when
      the object is as big as zspage or nearly as big a zspage.  For such
      objects zsmalloc has a so called huge class - every object which belongs
      to huge class consumes the entire zspage (which consists of a physical
      page).  On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
      3264, so starting down from size 3264, objects can share page(-s) and
      thus minimize memory wastage.
      
      ZRAM, however, has its own statically defined watermark for huge
      objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
      object larger than this watermark (3072) as a PAGE_SIZE object, in other
      words, to a huge class, while zsmalloc can keep some of those objects in
      non-huge classes.  This results in increased memory consumption.
      
      zsmalloc knows better if the object is huge or not.  Introduce
      zs_huge_class_size() function which tells if the given object can be
      stored in one of non-huge classes or not.  This will let us to drop
      ZRAM's huge object watermark and fully rely on zsmalloc when we decide
      if the object is huge.
      
      [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
        Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      010b495e
    • Huang Ying's avatar
      mm: fix races between swapoff and flush dcache · cb9f753a
      Huang Ying authored
      Thanks to commit 4b3ef9da ("mm/swap: split swap cache into 64MB
      trunks"), after swapoff the address_space associated with the swap
      device will be freed.  So page_mapping() users which may touch the
      address_space need some kind of mechanism to prevent the address_space
      from being freed during accessing.
      
      The dcache flushing functions (flush_dcache_page(), etc) in architecture
      specific code may access the address_space of swap device for anonymous
      pages in swap cache via page_mapping() function.  But in some cases
      there are no mechanisms to prevent the swap device from being swapoff,
      for example,
      
        CPU1					CPU2
        __get_user_pages()			swapoff()
          flush_dcache_page()
            mapping = page_mapping()
              ...				  exit_swap_address_space()
              ...				    kvfree(spaces)
              mapping_mapped(mapping)
      
      The address space may be accessed after being freed.
      
      But from cachetlb.txt and Russell King, flush_dcache_page() only care
      about file cache pages, for anonymous pages, flush_anon_page() should be
      used.  The implementation of flush_dcache_page() in all architectures
      follows this too.  They will check whether page_mapping() is NULL and
      whether mapping_mapped() is true to determine whether to flush the
      dcache immediately.  And they will use interval tree (mapping->i_mmap)
      to find all user space mappings.  While mapping_mapped() and
      mapping->i_mmap isn't used by anonymous pages in swap cache at all.
      
      So, to fix the race between swapoff and flush dcache, __page_mapping()
      is add to return the address_space for file cache pages and NULL
      otherwise.  All page_mapping() invoking in flush dcache functions are
      replaced with page_mapping_file().
      
      [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
      Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb9f753a
    • Nikolay Borisov's avatar
      fs/direct-io.c: minor cleanups in do_blockdev_direct_IO · 1c0ff0f1
      Nikolay Borisov authored
      We already get the block counts and calculate the end block at the
      beginning of the function.  Let's use the local variables for
      consistency and readability.  No functional changes
      
      [akpm@linux-foundation.org: constify the locals to prevent future slipups]
      Link: http://lkml.kernel.org/r/1519638870-17756-1-git-send-email-nborisov@suse.comSigned-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c0ff0f1
    • Guenter Roeck's avatar
      include/linux/mm.h: provide consistent declaration for num_poisoned_pages · 5844a486
      Guenter Roeck authored
      clang reports the following compile warning.
      
        In file included from mm/vmscan.c:56:
        ./include/linux/swapops.h:327:22: warning:
      	section attribute is specified on redeclared variable [-Wsection]
        extern atomic_long_t num_poisoned_pages __read_mostly;
                             ^
        ./include/linux/mm.h:2585:22: note: previous declaration is here
        extern atomic_long_t num_poisoned_pages;
                           ^
      
      Let's use __read_mostly everywhere.
      
      Link: http://lkml.kernel.org/r/1519686565-8224-1-git-send-email-linux@roeck-us.netSigned-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5844a486
    • Dan Williams's avatar
      device-dax: implement ->pagesize() for smaps to report MMUPageSize · c1d53b92
      Dan Williams authored
      Given that device-dax is making similar page mapping size guarantees as
      hugetlbfs, emit the size in smaps and any other kernel path that
      requests the mapping size of a vma.
      
      Link: http://lkml.kernel.org/r/151996255287.27922.18397777516059080245.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1d53b92
    • Dan Williams's avatar
      mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct · 05ea8860
      Dan Williams authored
      When device-dax is operating in huge-page mode we want it to behave like
      hugetlbfs and report the MMU page mapping size that is being enforced by
      the vma.
      
      Similar to commit 31383c68 "mm, hugetlbfs: introduce ->split() to
      vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
      about device-dax page mapping sizes in the same (hstate) way that
      hugetlbfs communicates this attribute.  Instead, these patches introduce
      a new ->pagesize() vm operation.
      
      Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05ea8860
    • Dan Williams's avatar
      mm, powerpc: use vma_kernel_pagesize() in vma_mmu_pagesize() · 09135cc5
      Dan Williams authored
      Patch series "mm, smaps: MMUPageSize for device-dax", v3.
      
      Similar to commit 31383c68 ("mm, hugetlbfs: introduce ->split() to
      vm_operations_struct") here is another occasion where we want
      special-case hugetlbfs/hstate enabling to also apply to device-dax.
      
      This prompts the question what other hstate conversions we might do
      beyond ->split() and ->pagesize(), but this appears to be the last of
      the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.
      
      This patch (of 3):
      
      The current powerpc definition of vma_mmu_pagesize() open codes looking
      up the page size via hstate.  It is identical to the generic
      vma_kernel_pagesize() implementation.
      
      Now, vma_kernel_pagesize() is growing support for determining the page
      size of Device-DAX vmas in addition to the existing Hugetlbfs page size
      determination.
      
      Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
      would automatically benefit from any new vma-type support that is added
      to vma_kernel_pagesize().  However, the powerpc vma_mmu_pagesize() is
      prevented from calling vma_kernel_pagesize() due to a circular header
      dependency that requires vma_mmu_pagesize() to be defined before
      including <linux/hugetlb.h>.
      
      Break this circular dependency by defining the default vma_mmu_pagesize()
      as a __weak symbol to be overridden by the powerpc version.
      
      Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Jane Chu <jane.chu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09135cc5
    • Mario Leinweber's avatar
      mm/gup.c: fix coding style issues. · 2923117b
      Mario Leinweber authored
      - Fixed style error: 8 spaces -> 1 tab.
      - Fixed style warning: Corrected misleading indentation.
      
      Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.deSigned-off-by: default avatarMario Leinweber <marioleinweber@web.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2923117b
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: prefetch buddy while not holding lock · 97334162
      Aaron Lu authored
      When a page is freed back to the global pool, its buddy will be checked
      to see if it's possible to do a merge.  This requires accessing buddy's
      page structure and that access could take a long time if it's cache
      cold.
      
      This patch adds a prefetch to the to-be-freed page's buddy outside of
      zone->lock in hope of accessing buddy's page structure later under
      zone->lock will be faster.  Since we *always* do buddy merging and check
      an order-0 page's buddy to try to merge it when it goes into the main
      allocator, the cacheline will always come in, i.e.  the prefetched data
      will never be unused.
      
      Normally, the number of prefetch will be pcp->batch(default=31 and has
      an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
      pcp's pages get all drained, it will be pcp->count which has an upper
      limit of pcp->high.  pcp->high, although has a default value of 186
      (pcp->batch=31 * 6), can be changed by user through
      /proc/sys/vm/percpu_pagelist_fraction and there is no software upper
      limit so could be large, like several thousand.  For this reason, only
      the first pcp->batch number of page's buddy structure is prefetched to
      avoid excessive prefetching.
      
      In the meantime, there are two concerns:
      
       1. the prefetch could potentially evict existing cachelines, especially
          for L1D cache since it is not huge
      
       2. there is some additional instruction overhead, namely calculating
          buddy pfn twice
      
      For 1, it's hard to say, this microbenchmark though shows good result
      but the actual benefit of this patch will be workload/CPU dependant;
      
      For 2, since the calculation is a XOR on two local variables, it's
      expected in many cases that cycles spent will be offset by reduced
      memory latency later.  This is especially true for NUMA machines where
      multiple CPUs are contending on zone->lock and the most time consuming
      part under zone->lock is the wait of 'struct page' cacheline of the
      to-be-freed pages and their buddies.
      
      Test with will-it-scale/page_fault1 full load:
      
        kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
        v4.16-rc2+  9034215        7971818       13667135       15677465
        patch2/3    9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
        this patch 10180856 +6.8%  8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
      
      Note: this patch's performance improvement percent is against patch2/3.
      
      (Changelog stolen from Dave Hansen and Mel Gorman's comments at
      http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)
      
      [aaron.lu@intel.com: use helper function, avoid disordering pages]
        Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
        Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
      [aaron.lu@intel.com: v4]
        Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
        Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
      Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97334162
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: do not hold lock when picking pages to free · 0a5f4e5b
      Aaron Lu authored
      When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
      zone->lock is held and then pages are chosen from PCP's migratetype
      list.  While there is actually no need to do this 'choose part' under
      lock since it's PCP pages, the only CPU that can touch them is us and
      irq is also disabled.
      
      Moving this part outside could reduce lock held time and improve
      performance.  Test with will-it-scale/page_fault1 full load:
      
        kernel      Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
        v4.16-rc2+  9034215        7971818       13667135       15677465
        this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
      
      What the test does is: starts $nr_cpu processes and each will repeatedly
      do the following for 5 minutes:
      
       - mmap 128M anonymouse space
      
       - write access to that space
      
       - munmap.
      
      The score is the aggregated iteration.
      
      https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a5f4e5b
    • Aaron Lu's avatar
      mm/free_pcppages_bulk: update pcp->count inside · 77ba9062
      Aaron Lu authored
      Matthew Wilcox found that all callers of free_pcppages_bulk() currently
      update pcp->count immediately after so it's natural to do it inside
      free_pcppages_bulk().
      
      No functionality or performance change is expected from this patch.
      
      Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77ba9062
    • David Rientjes's avatar
      mm, compaction: drain pcps for zone when kcompactd fails · bc3106b2
      David Rientjes authored
      It's possible for free pages to become stranded on per-cpu pagesets
      (pcps) that, if drained, could be merged with buddy pages on the zone's
      free area to form large order pages, including up to MAX_ORDER.
      
      Consider a verbose example using the tools/vm/page-types tool at the
      beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
      a slab page).  Pages on pcps do not have any page flags set.
      
        109954  1       _______S________________________________________________________
        109955  2       __________B_____________________________________________________
        109957  1       ________________________________________________________________
        109958  1       __________B_____________________________________________________
        109959  7       ________________________________________________________________
        109960  1       __________B_____________________________________________________
        109961  9       ________________________________________________________________
        10996a  1       __________B_____________________________________________________
        10996b  3       ________________________________________________________________
        10996e  1       __________B_____________________________________________________
        10996f  1       ________________________________________________________________
        ...
        109f8c  1       __________B_____________________________________________________
        109f8d  2       ________________________________________________________________
        109f8f  2       __________B_____________________________________________________
        109f91  f       ________________________________________________________________
        109fa0  1       __________B_____________________________________________________
        109fa1  7       ________________________________________________________________
        109fa8  1       __________B_____________________________________________________
        109fa9  1       ________________________________________________________________
        109faa  1       __________B_____________________________________________________
        109fab  1       _______S________________________________________________________
      
      The compaction migration scanner is attempting to defragment this memory
      since it is at the beginning of the zone.  It has done so quite well,
      all movable pages have been migrated.  From pfn [0x109955, 0x109fab),
      there are only buddy pages and pages without flags set.
      
      These pages may be stranded on pcps that could otherwise allow this
      memory to be coalesced if freed back to the zone free area.  It is
      possible that some of these pages may not be on pcps and that something
      has called alloc_pages() and used the memory directly, but we rely on
      the absence of __GFP_MOVABLE in these cases to allocate from
      MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
      pageblocks as free as possible.
      
      These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
      allow for three transparent hugepages to be dynamically allocated.
      Running the numbers for all such spans on the system, it was found that
      there were over 400 such spans of only buddy pages and pages without
      flags set at the time this /proc/kpageflags sample was collected.
      Without this support, there were _no_ order-9 or order-10 pages free.
      
      When kcompactd fails to defragment memory such that a cc.order page can
      be allocated, drain all pcps for the zone back to the buddy allocator so
      this stranding cannot occur.  Compaction for that order will
      subsequently be deferred, which acts as a ratelimit on this drain.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc3106b2
    • Howard McLauchlan's avatar
      mm: make should_failslab always available for fault injection · 4f6923fb
      Howard McLauchlan authored
      should_failslab() is a convenient function to hook into for directed
      error injection into kmalloc().  However, it is only available if a
      config flag is set.
      
      The following BCC script, for example, fails kmalloc() calls after a
      btrfs umount:
      
          from bcc import BPF
      
          prog = r"""
          BPF_HASH(flag);
      
          #include <linux/mm.h>
      
          int kprobe__btrfs_close_devices(void *ctx) {
                  u64 key = 1;
                  flag.update(&key, &key);
                  return 0;
          }
      
          int kprobe__should_failslab(struct pt_regs *ctx) {
                  u64 key = 1;
                  u64 *res;
                  res = flag.lookup(&key);
                  if (res != 0) {
                      bpf_override_return(ctx, -ENOMEM);
                  }
                  return 0;
          }
          """
          b = BPF(text=prog)
      
          while 1:
              b.kprobe_poll()
      
      This patch refactors the should_failslab implementation so that the
      function is always available for error injection, independent of flags.
      
      This change would be similar in nature to commit f5490d3ec921 ("block:
      Add should_fail_bio() for bpf error injection").
      
      Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.comSigned-off-by: default avatarHoward McLauchlan <hmclauchlan@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f6923fb
    • Dou Liyang's avatar
      mm/page_poison.c: make early_page_poison_param() __init · 14298d36
      Dou Liyang authored
      The early_param() is only called during kernel initialization, So Linux
      marks the function of it with __init macro to save memory.
      
      But it forgot to mark the early_page_poison_param().  So, Make it __init
      as well.
      
      Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.comSigned-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14298d36
    • Dou Liyang's avatar
      mm/page_owner.c: make early_page_owner_param() __init · 1173194e
      Dou Liyang authored
      The early_param() is only called during kernel initialization, So Linux
      marks the functions of it with __init macro to save memory.
      
      But it forgot to mark the early_page_owner_param().  So, Make it __init
      as well.
      
      Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.comSigned-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1173194e