1. 16 Nov, 2017 40 commits
    • Linus Torvalds's avatar
      Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 487e2c9f
      Linus Torvalds authored
      Pull AFS updates from David Howells:
       "kAFS filesystem driver overhaul.
      
        The major points of the overhaul are:
      
         (1) Preliminary groundwork is laid for supporting network-namespacing
             of kAFS. The remainder of the namespacing work requires some way
             to pass namespace information to submounts triggered by an
             automount. This requires something like the mount overhaul that's
             in progress.
      
         (2) sockaddr_rxrpc is used in preference to in_addr for holding
             addresses internally and add support for talking to the YFS VL
             server. With this, kAFS can do everything over IPv6 as well as
             IPv4 if it's talking to servers that support it.
      
         (3) Callback handling is overhauled to be generally passive rather
             than active. 'Callbacks' are promises by the server to tell us
             about data and metadata changes. Callbacks are now checked when
             we next touch an inode rather than actively going and looking for
             it where possible.
      
         (4) File access permit caching is overhauled to store the caching
             information per-inode rather than per-directory, shared over
             subordinate files. Whilst older AFS servers only allow ACLs on
             directories (shared to the files in that directory), newer AFS
             servers break that restriction.
      
             To improve memory usage and to make it easier to do mass-key
             removal, permit combinations are cached and shared.
      
         (5) Cell database management is overhauled to allow lighter locks to
             be used and to make cell records autonomous state machines that
             look after getting their own DNS records and cleaning themselves
             up, in particular preventing races in acquiring and relinquishing
             the fscache token for the cell.
      
         (6) Volume caching is overhauled. The afs_vlocation record is got rid
             of to simplify things and the superblock is now keyed on the cell
             and the numeric volume ID only. The volume record is tied to a
             superblock and normal superblock management is used to mediate
             the lifetime of the volume fscache token.
      
         (7) File server record caching is overhauled to make server records
             independent of cells and volumes. A server can be in multiple
             cells (in such a case, the administrator must make sure that the
             VL services for all cells correctly reflect the volumes shared
             between those cells).
      
             Server records are now indexed using the UUID of the server
             rather than the address since a server can have multiple
             addresses.
      
         (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
             similar), VOFFLINE and VNOVOL indications and to handle rotation
             both of servers and addresses of those servers. The rotation will
             also wait and retry if the server says it is busy.
      
         (9) Data writeback is overhauled. Each inode no longer stores a list
             of modified sections tagged with the key that authorised it in
             favour of noting the modified region of a page in page->private
             and storing a list of keys that made modifications in the inode.
      
             This simplifies things and allows other keys to be used to
             actually write to the server if a key that made a modification
             becomes useless.
      
        (10) Writable mmap() is implemented. This allows a kernel to be build
             entirely on AFS.
      
        Note that Pre AFS-3.4 servers are no longer supported, though this can
        be added back if necessary (AFS-3.4 was released in 1998)"
      
      * tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
        afs: Protect call->state changes against signals
        afs: Trace page dirty/clean
        afs: Implement shared-writeable mmap
        afs: Get rid of the afs_writeback record
        afs: Introduce a file-private data record
        afs: Use a dynamic port if 7001 is in use
        afs: Fix directory read/modify race
        afs: Trace the sending of pages
        afs: Trace the initiation and completion of client calls
        afs: Fix documentation on # vs % prefix in mount source specification
        afs: Fix total-length calculation for multiple-page send
        afs: Only progress call state at end of Tx phase from rxrpc callback
        afs: Make use of the YFS service upgrade to fully support IPv6
        afs: Overhaul volume and server record caching and fileserver rotation
        afs: Move server rotation code into its own file
        afs: Add an address list concept
        afs: Overhaul cell database management
        afs: Overhaul permit caching
        afs: Overhaul the callback handling
        afs: Rename struct afs_call server member to cm_server
        ...
      487e2c9f
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · b630a23a
      Linus Torvalds authored
      Pull pin control updates from Linus Walleij:
       "This is the bulk of pin control changes for the v4.15 kernel cycle:
      
        Core:
      
         - The pin control Kconfig entry PINCTRL is now turned into a
           menuconfig option. This obviously has the implication of making the
           subsystem menu visible in menuconfig. This is happening because of
           two things:
      
            (a) Intel have started to deploy and depend on pin controllers in
                a way that is affecting users directly. This happens on the
                highly integrated laptop chipsets named after geographical
                places: baytrail, broxton, cannonlake, cedarfork, cherryview,
                denverton, geminilake, lewisburg, merrifield, sunrisepoint...
                It started a while back and now it is ever more evident that
                this is crucial infrastructure for x86 laptops and not an
                embedded obscurity anymore. Users need to be aware.
      
            (b) Pin control expanders on I2C and SPI that are arch-agnostic.
                Currently Semtech SX150X and Microchip MCP28x08 but more are
                expected. Users will have to be able to configure these in
                directly for their set-up.
      
         - Just go and select GPIOLIB now that we made sure that GPIOLIB is a
           very vanilla subsystem. Do not depend on it, if we need it, select
           it.
      
         - Exposing the pin control subsystem in menuconfig uncovered a bunch
           of obscure bugs that are now hopefully fixed, all more or less
           pertaining to Blackfin.
      
         - Unified namespace for cross-calls between pin control and GPIO.
      
         - New support for clock skew/delay generic DT bindings and generic
           pin config options for this.
      
         - Minor documentation improvements.
      
        Various:
      
         - The Renesas SH-PFC pin controller has evolved a lot. It seems
           Renesas are churning out new SoCs by the minute.
      
         - A bunch of non-critical fixes for the Rockchip driver.
      
         - Improve the use of library functions instead of open coding.
      
         - Support the MCP28018 variant in the MCP28x08 driver.
      
         - Static constifying"
      
      * tag 'pinctrl-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (91 commits)
        pinctrl: gemini: Fix missing pad descriptions
        pinctrl: Add some depends on HAS_IOMEM
        pinctrl: samsung/s3c24xx: add CONFIG_OF dependency
        pinctrl: gemini: Fix GMAC groups
        pinctrl: qcom: spmi-gpio: Add pmi8994 gpio support
        pinctrl: ti-iodelay: remove redundant unused variable dev
        pinctrl: max77620: Use common error handling code in max77620_pinconf_set()
        pinctrl: gemini: Implement clock skew/delay config
        pinctrl: gemini: Use generic DT parser
        pinctrl: Add skew-delay pin config and bindings
        pinctrl: armada-37xx: Add edge both type gpio irq support
        pinctrl: uniphier: remove eMMC hardware reset pin-mux
        pinctrl: rockchip: Add iomux-route switching support for rk3288
        pinctrl: intel: Add Intel Cedar Fork PCH pin controller support
        pinctrl: intel: Make offset to interrupt status register configurable
        pinctrl: sunxi: Enforce the strict mode by default
        pinctrl: sunxi: Disable strict mode for old pinctrl drivers
        pinctrl: sunxi: Introduce the strict flag
        pinctrl: sh-pfc: Save/restore registers for PSCI system suspend
        pinctrl: sh-pfc: r8a7796: Use generic IOCTRL register description
        ...
      b630a23a
    • Linus Torvalds's avatar
      Merge tag 'backlight-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · 9c7a867e
      Linus Torvalds authored
      Pull backlight updates from Lee Jones:
      
         - handle 32bit overflow in pwm_bl
      
         - remove redundant code/checks in tps65217_bl and ili922x
      
      * tag 'backlight-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: ili922x: Remove redundant variable len
        backlight: tps65217_bl: Remove unnecessary default brightness check
        backlight: pwm_bl: Fix overflow condition
      9c7a867e
    • Linus Torvalds's avatar
      Merge tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · d3092e4e
      Linus Torvalds authored
      Pull MFD updates from Lee Jones:
       "New drivers:
         - Add support for Cherry Trail Dollar Cove TI PMIC
         - Add support for Add Spreadtrum SC27xx series PMICs
      
        New device support:
         - Add support Regulator to axp20x
      
        New functionality:
         - Add DT support; aspeed-scu sc27xx-pmic
         - Add power saving support; rts5249
      
        Fix-ups:
         - DT clean-up/rework; tps65217, max77693, iproc-cdru, iproc-mhb, tps65218
         - Staticise/constify; stw481x
         - Use new succinct IRQ API; fsl-imx25-tsadc
         - Kconfig fix-ups; MFD_TPS65218
         - Identify SPI method; lpc_ich
         - Use managed resources (devm_*) calls; ssbi
         - Remove unused/obsolete code/documentation; mc13xxx
      
        Bug fixes:
         - Fix typo in MAINTAINERS
         - Fix error handling; mxs-lradc
         - Clean-up IRQs on .remove; fsl-imx25-tsadc"
      
      * tag 'mfd-next-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (21 commits)
        dt-bindings: mfd: mc13xxx: Remove obsolete property
        mfd: axp20x: Add axp20x-regulator cell for AXP813
        mfd: Add Spreadtrum SC27xx series PMICs driver
        dt-bindings: mfd: Add Spreadtrum SC27xx PMIC documentation
        mfd: ssbi: Use devm_of_platform_populate()
        mfd: fsl-imx25: Clean up irq settings during removal
        mfd: mxs-lradc: Fix error handling in mxs_lradc_probe()
        mfd: lpc_ich: Avoton/Rangeley uses SPI_BYT method
        mfd: tps65218: Introduce dependency on CONFIG_OF
        mfd: tps65218: Correct the config description
        MAINTAINERS: Fix Dialog search term for watchdog binding file
        mfd: fsl-imx25: Set irq handler and data in one go
        mfd: rts5249: Add support for RTS5250S power saving
        ACPI / PMIC: Add opregion driver for Intel Dollar Cove TI PMIC
        mfd: Add support for Cherry Trail Dollar Cove TI PMIC
        syscon: dt-bindings: Add binding document for iProc MHB block
        syscon: dt-bindings: Add binding doc for Broadcom iProc CDRU
        mfd: max77693: Add muic of_compatible in mfd_cell
        mfd: stw481x: Make three arrays static const, reduces object code size
        mfd: tps65217: Introduce dependency on CONFIG_OF
        ...
      d3092e4e
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.15-rc1' of... · 2bf16b7a
      Linus Torvalds authored
      Merge tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
      
      Pull char/misc updates from Greg KH:
       "Here is the big set of char/misc and other driver subsystem patches
        for 4.15-rc1.
      
        There are small changes all over here, hyperv driver updates, pcmcia
        driver updates, w1 driver updats, vme driver updates, nvmem driver
        updates, and lots of other little one-off driver updates as well. The
        shortlog has the full details.
      
        All of these have been in linux-next for quite a while with no
        reported issues"
      
      * tag 'char-misc-4.15-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (90 commits)
        VME: Return -EBUSY when DMA list in use
        w1: keep balance of mutex locks and refcnts
        MAINTAINERS: Update VME subsystem tree.
        nvmem: sunxi-sid: add support for A64/H5's SID controller
        nvmem: imx-ocotp: Update module description
        nvmem: imx-ocotp: Enable i.MX7D OTP write support
        nvmem: imx-ocotp: Add i.MX7D timing write clock setup support
        nvmem: imx-ocotp: Move i.MX6 write clock setup to dedicated function
        nvmem: imx-ocotp: Add support for banked OTP addressing
        nvmem: imx-ocotp: Pass parameters via a struct
        nvmem: imx-ocotp: Restrict OTP write to IMX6 processors
        nvmem: uniphier: add UniPhier eFuse driver
        dt-bindings: nvmem: add description for UniPhier eFuse
        nvmem: set nvmem->owner to nvmem->dev->driver->owner if unset
        nvmem: qfprom: fix different address space warnings of sparse
        nvmem: mtk-efuse: fix different address space warnings of sparse
        nvmem: mtk-efuse: use stack for nvmem_config instead of malloc'ing it
        nvmem: imx-iim: use stack for nvmem_config instead of malloc'ing it
        thunderbolt: tb: fix use after free in tb_activate_pcie_devices
        MAINTAINERS: Add git tree for Thunderbolt development
        ...
      2bf16b7a
    • Linus Torvalds's avatar
      Merge tag 'driver-core-4.15-rc1' of... · b9743042
      Linus Torvalds authored
      Merge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core updates from Greg KH:
       "Here is the set of driver core / debugfs patches for 4.15-rc1.
      
        Not many here, mostly all are debugfs fixes to resolve some
        long-reported problems with files going away with references to them
        in userspace. There's also some SPDX cleanups for the debugfs code, as
        well as a few other minor driver core changes for issues reported by
        people.
      
        All of these have been in linux-next for a week or more with no
        reported issues"
      
      * tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        driver core: Fix device link deferred probe
        debugfs: Remove redundant license text
        debugfs: add SPDX identifiers to all debugfs files
        debugfs: defer debugfs_fsdata allocation to first usage
        debugfs: call debugfs_real_fops() only after debugfs_file_get()
        debugfs: purge obsolete SRCU based removal protection
        IB/hfi1: convert to debugfs_file_get() and -put()
        debugfs: convert to debugfs_file_get() and -put()
        debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
        debugfs: implement per-file removal protection
        debugfs: add support for more elaborate ->d_fsdata
        driver core: Move device_links_purge() after bus_remove_device()
        arch_topology: Fix section miss match warning due to free_raw_capacity()
        driver-core: pr_err() strings should end with newlines
      b9743042
    • Linus Torvalds's avatar
      Merge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux · e60e1ee6
      Linus Torvalds authored
      Pull drm updates from Dave Airlie:
       "This is the main drm pull request for v4.15.
      
        Core:
         - Atomic object lifetime fixes
         - Atomic iterator improvements
         - Sparse/smatch fixes
         - Legacy kms ioctls to be interruptible
         - EDID override improvements
         - fb/gem helper cleanups
         - Simple outreachy patches
         - Documentation improvements
         - Fix dma-buf rcu races
         - DRM mode object leasing for improving VR use cases.
         - vgaarb improvements for non-x86 platforms.
      
        New driver:
         - tve200: Faraday Technology TVE200 block.
      
           This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
           the StorLink SL3516 (later Cortina Systems CS3516) as well as the
           Grain Media GM8180.
      
        New bridges:
         - SiI9234 support
      
        New panels:
         - S6E63J0X03, OTM8009A, Seiko 43WVF1G, 7" rpi touch panel, Toshiba
           LT089AC19000, Innolux AT043TN24
      
        i915:
         - Remove Coffeelake from alpha support
         - Cannonlake workarounds
         - Infoframe refactoring for DisplayPort
         - VBT updates
         - DisplayPort vswing/emph/buffer translation refactoring
         - CCS fixes
         - Restore GPU clock boost on missed vblanks
         - Scatter list updates for userptr allocations
         - Gen9+ transition watermarks
         - Display IPC (Isochronous Priority Control)
         - Private PAT management
         - GVT: improved error handling and pci config sanitizing
         - Execlist refactoring
         - Transparent Huge Page support
         - User defined priorities support
         - HuC/GuC firmware refactoring
         - DP MST fixes
         - eDP power sequencing fixes
         - Use RCU instead of stop_machine
         - PSR state tracking support
         - Eviction fixes
         - BDW DP aux channel timeout fixes
         - LSPCON fixes
         - Cannonlake PLL fixes
      
        amdgpu:
         - Per VM BO support
         - Powerplay cleanups
         - CI powerplay support
         - PASID mgr for kfd
         - SR-IOV fixes
         - initial GPU reset for vega10
         - Prime mmap support
         - TTM updates
         - Clock query interface for Raven
         - Fence to handle ioctl
         - UVD encode ring support on Polaris
         - Transparent huge page DMA support
         - Compute LRU pipe tweaks
         - BO flag to allow buffers to opt out of implicit sync
         - CTX priority setting API
         - VRAM lost infrastructure plumbing
      
        qxl:
         - fix flicker since atomic rework
      
        amdkfd:
         - Further improvements from internal AMD tree
         - Usermode events
         - Drop radeon support
      
        nouveau:
         - Pascal temperature sensor support
         - Improved BAR2 handling
         - MMU rework to support Pascal MMU
      
        exynos:
         - Improved HDMI/mixer support
         - HDMI audio interface support
      
        tegra:
         - Prep work for tegra186
         - Cleanup/fixes
      
        msm:
         - Preemption support for a5xx
         - Display fixes for 8x96 (snapdragon 820)
         - Async cursor plane fixes
         - FW loading rework
         - GPU debugging improvements
      
        vc4:
         - Prep for DSI panels
         - fix T-format tiling scanout
         - New madvise ioctl
      
        Rockchip:
         - LVDS support
      
        omapdrm:
         - omap4 HDMI CEC support
      
        etnaviv:
         - GPU performance counters groundwork
      
        sun4i:
         - refactor driver load + TCON backend
         - HDMI improvements
         - A31 support
         - Misc fixes
      
        udl:
         - Probe/EDID read fixes.
      
        tilcdc:
         - Misc fixes.
      
        pl111:
         - Support more variants
      
        adv7511:
         - Improve EDID handling.
         - HDMI CEC support
      
        sii8620:
         - Add remote control support"
      
      * tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
        drm/rockchip: analogix_dp: Use mutex rather than spinlock
        drm/mode_object: fix documentation for object lookups.
        drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
        drm/i915: Move init_clock_gating() back to where it was
        drm/i915: Prune the reservation shared fence array
        drm/i915: Idle the GPU before shinking everything
        drm/i915: Lock llist_del_first() vs llist_del_all()
        drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
        drm/i915: Disable lazy PPGTT page table optimization for vGPU
        drm/i915/execlists: Remove the priority "optimisation"
        drm/i915: Filter out spurious execlists context-switch interrupts
        drm/amdgpu: use irq-safe lock for kiq->ring_lock
        drm/amdgpu: bypass lru touch for KIQ ring submission
        drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
        drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
        drm/amd/powerplay: initialize a variable before using it
        drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
        drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
        drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
        drm/rockchip: add CONFIG_OF dependency for lvds
        ...
      e60e1ee6
    • Linus Torvalds's avatar
      Merge tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 5d352e69
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - Documentation for digital TV (both kAPI and uAPI) are now in sync
         with the implementation (except for legacy/deprecated ioctls). This
         is a major step, as there were always a gap there
      
       - New sensor driver: imx274
      
       - New cec driver: cec-gpio
      
       - New platform driver for rockship rga and tegra CEC
      
       - New RC driver: tango-ir
      
       - Several cleanups at atomisp driver
      
       - Core improvements for RC, CEC, V4L2 async probing support and DVB
      
       - Lots of drivers cleanup, fixes and improvements.
      
      * tag 'media/v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (332 commits)
        dvb_frontend: don't use-after-free the frontend struct
        media: dib0700: fix invalid dvb_detach argument
        media: v4l2-ctrls: Don't validate BITMASK twice
        media: s5p-mfc: fix lockdep warning
        media: dvb-core: always call invoke_release() in fe_free()
        media: usb: dvb-usb-v2: dvb_usb_core: remove redundant code in dvb_usb_fe_sleep
        media: au0828: make const array addr_list static
        media: cx88: make const arrays default_addr_list and pvr2000_addr_list static
        media: drxd: make const array fastIncrDecLUT static
        media: usb: fix spelling mistake: "synchronuously" -> "synchronously"
        media: ddbridge: fix build warnings
        media: av7110: avoid 2038 overflow in debug print
        media: Don't do DMA on stack for firmware upload in the AS102 driver
        media: v4l: async: fix unregister for implicitly registered sub-device notifiers
        media: v4l: async: fix return of unitialized variable ret
        media: imx274: fix missing return assignment from call to imx274_mode_regs
        media: camss-vfe: always initialize reg at vfe_set_xbar_cfg()
        media: atomisp: make function calls cleaner
        media: atomisp: get rid of storage_class.h
        media: atomisp: get rid of wrong stddef.h include
        ...
      5d352e69
    • Linus Torvalds's avatar
      Merge tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux · 93ea0eb7
      Linus Torvalds authored
      Pull leaking_addresses script updates from Tobin Harding:
       "Here are development patches for the leaking_addresses.pl script.
      
        Changes include:
      
         - add summary reporting to the script
      
         - add 'SigIgn' to false positives
      
         - add a file read timeout so the script doesn't block indefinitely
      
         - add infrastructure to enable multi-arch support and add support for ppc
      
         - add some exclude files/paths suggested by various people
      
         - code clean up and refactoring
      
         - overhaul command line options"
      
      * tag 'leaks-4.15-rc1' of git://github.com/tcharding/linux:
        leaking_addresses: add SigIgn to false positives
        leaking_addresses: add timeout on file read
        leaking_addresses: add support for ppc64
        leaking_addresses: add summary reporting options
        leaking_addresses: add to exclude files/paths list
        leaking_addresses: fix comment string typo
        leaking_addresses: remove command line options
        leaking_addresses: remove dead/unused code
        leaking_addresses: use tabs instead of spaces
      93ea0eb7
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 7c225c69
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few misc bits
      
       - ocfs2 updates
      
       - almost all of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
        memory hotplug: fix comments when adding section
        mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
        mm: simplify nodemask printing
        mm,oom_reaper: remove pointless kthread_run() error check
        mm/page_ext.c: check if page_ext is not prepared
        writeback: remove unused function parameter
        mm: do not rely on preempt_count in print_vma_addr
        mm, sparse: do not swamp log with huge vmemmap allocation failures
        mm/hmm: remove redundant variable align_end
        mm/list_lru.c: mark expected switch fall-through
        mm/shmem.c: mark expected switch fall-through
        mm/page_alloc.c: broken deferred calculation
        mm: don't warn about allocations which stall for too long
        fs: fuse: account fuse_inode slab memory as reclaimable
        mm, page_alloc: fix potential false positive in __zone_watermark_ok
        mm: mlock: remove lru_add_drain_all()
        mm, sysctl: make NUMA stats configurable
        shmem: convert shmem_init_inodecache() to void
        Unify migrate_pages and move_pages access checks
        mm, pagevec: rename pagevec drained field
        ...
      7c225c69
    • Fan Du's avatar
    • Oscar Salvador's avatar
      mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP · 0cd842f9
      Oscar Salvador authored
      free_area_init_node() calls alloc_node_mem_map(), but this function does
      nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.
      
      As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
      alloc_node_mem_map() out of the function, and define a
      alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.
      
      This also moves the printk that lays within the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
      alloc_node_mem_map(), getting rid of the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().
      
      [akpm@linux-foundation.org: clean up the printk while we're there]
      Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cd842f9
    • Michal Hocko's avatar
      mm: simplify nodemask printing · 0205f755
      Michal Hocko authored
      alloc_warn() and dump_header() have to explicitly handle NULL nodemask
      which forces both paths to use pr_cont.  We can do better.  printk
      already handles NULL pointers properly so all we need is to teach
      nodemask_pr_args to handle NULL nodemask carefully.  This allows
      simplification of both alloc_warn() and dump_header() and gets rid of
      pr_cont altogether.
      
      This patch has been motivated by patch from Joe Perches
      
        http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com
      
      [akpm@linux-foundation.org: fix tile warning, per Arnd]
      Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoe Perches <joe@perches.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0205f755
    • Tetsuo Handa's avatar
      mm,oom_reaper: remove pointless kthread_run() error check · c50842c8
      Tetsuo Handa authored
      Since oom_init() is called before userspace processes start, memory
      allocation failure for creating the OOM reaper kernel thread will let
      the OOM killer call panic() rather than wake up the OOM reaper.
      
      Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c50842c8
    • Jaewon Kim's avatar
      mm/page_ext.c: check if page_ext is not prepared · e492080e
      Jaewon Kim authored
      online_page_ext() and page_ext_init() allocate page_ext for each
      section, but they do not allocate if the first PFN is !pfn_present(pfn)
      or !pfn_valid(pfn).  Then section->page_ext remains as NULL.
      lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled.  For a
      valid PFN, __set_page_owner will try to get page_ext through
      lookup_page_ext.  Without CONFIG_DEBUG_VM lookup_page_ext will misuse
      NULL pointer as value 0.  This incurrs invalid address access.
      
      This is the panic example when PFN 0x100000 is not valid but PFN
      0x13FC00 is being used for page_ext.  section->page_ext is NULL,
      get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
      0x13FC00.
      
      To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
      will be checked at all times.
      
        Unable to handle kernel paging request at virtual address 01dfa014
        ------------[ cut here ]------------
        Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
        Internal error: Oops: 96000045 [#1] PREEMPT SMP
        Modules linked in:
        PC is at __set_page_owner+0x48/0x78
        LR is at __set_page_owner+0x44/0x78
          __set_page_owner+0x48/0x78
          get_page_from_freelist+0x880/0x8e8
          __alloc_pages_nodemask+0x14c/0xc48
          __do_page_cache_readahead+0xdc/0x264
          filemap_fault+0x2ac/0x550
          ext4_filemap_fault+0x3c/0x58
          __do_fault+0x80/0x120
          handle_mm_fault+0x704/0xbb0
          do_page_fault+0x2e8/0x394
          do_mem_abort+0x88/0x124
      
      Pre-4.7 kernels also need commit f86e4271 ("mm: check the return
      value of lookup_page_ext for all call sites").
      
      Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
      Fixes: eefa864b ("mm/page_ext: resurrect struct page extending code for debugging")
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: <stable@vger.kernel.org>	[depends on f86e4271, see above]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e492080e
    • Wang Long's avatar
      writeback: remove unused function parameter · 2bce774e
      Wang Long authored
      The parameter `struct bdi_writeback *wb` is not been used in the
      function body.  Remove it.
      
      Link: http://lkml.kernel.org/r/1509685485-15278-1-git-send-email-wanglong19@meituan.comSigned-off-by: default avatarWang Long <wanglong19@meituan.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bce774e
    • Michal Hocko's avatar
      mm: do not rely on preempt_count in print_vma_addr · 0a7f682d
      Michal Hocko authored
      The preempt count check on print_vma_addr has been added by commit
      e8bff74a ("x86: fix "BUG: sleeping function called from invalid
      context" in print_vma_addr()") and it relied on the elevated preempt
      count from preempt_conditional_sti because preempt_count check doesn't
      work on non preemptive kernels by default.
      
      The code has evolved though and commit d99e1bd1 ("x86/entry/traps:
      Refactor preemption and interrupt flag handling") has replaced
      preempt_conditional_sti by an explicit preempt_disable which is noop on
      !PREEMPT so the check in print_vma_addr is broken.
      
      Fix the issue by using trylock on mmap_sem rather than chacking the
      preempt count.  The allocation we are relying on has to be GFP_NOWAIT as
      well.  There is a chance that we won't dump the vma state if the lock is
      contended or the memory short but this is acceptable outcome and much
      less fragile than the not working preemption check or tricks around it.
      
      Link: http://lkml.kernel.org/r/20171106134031.g6dbelg55mrbyc6i@dhcp22.suse.cz
      Fixes: d99e1bd1 ("x86/entry/traps: Refactor preemption and interrupt flag handling")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a7f682d
    • Michal Hocko's avatar
      mm, sparse: do not swamp log with huge vmemmap allocation failures · fcdaf842
      Michal Hocko authored
      While doing memory hotplug tests under heavy memory pressure we have
      noticed too many page allocation failures when allocating vmemmap memmap
      backed by huge page
      
        kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
        [...]
        Call Trace:
          dump_trace+0x59/0x310
          show_stack_log_lvl+0xea/0x170
          show_stack+0x21/0x40
          dump_stack+0x5c/0x7c
          warn_alloc_failed+0xe2/0x150
          __alloc_pages_nodemask+0x3ed/0xb20
          alloc_pages_current+0x7f/0x100
          vmemmap_alloc_block+0x79/0xb6
          __vmemmap_alloc_block_buf+0x136/0x145
          vmemmap_populate+0xd2/0x2b9
          sparse_mem_map_populate+0x23/0x30
          sparse_add_one_section+0x68/0x18e
          __add_pages+0x10a/0x1d0
          arch_add_memory+0x4a/0xc0
          add_memory_resource+0x89/0x160
          add_memory+0x6d/0xd0
          acpi_memory_device_add+0x181/0x251
          acpi_bus_attach+0xfd/0x19b
          acpi_bus_scan+0x59/0x69
          acpi_device_hotplug+0xd2/0x41f
          acpi_hotplug_work_fn+0x1a/0x23
          process_one_work+0x14e/0x410
          worker_thread+0x116/0x490
          kthread+0xbd/0xe0
          ret_from_fork+0x3f/0x70
      
      and we do see many of those because essentially every allocation fails
      for each memory section.  This is an excessive way to tell the user that
      there is nothing to really worry about because we do have a fallback
      mechanism to use base pages.  The only downside might be a performance
      degradation due to TLB pressure.
      
      This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
      explicitly once on the first allocation failure.  This will reduce the
      noise in the kernel log considerably, while we still have an indication
      that a performance might be impacted.
      
      [mhocko@kernel.org: forgot to git add the follow up fix]
        Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcdaf842
    • Colin Ian King's avatar
      mm/hmm: remove redundant variable align_end · fec11bc0
      Colin Ian King authored
      Variable align_end is assigned a value but it is never read, so the
      variable is redundant and can be removed.  Cleans up the clang warning:
      Value stored to 'align_end' is never read
      
      Link: http://lkml.kernel.org/r/20171017143837.23207-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec11bc0
    • Gustavo A. R. Silva's avatar
      mm/list_lru.c: mark expected switch fall-through · 5b568acc
      Gustavo A. R. Silva authored
      In preparation for enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Link: http://lkml.kernel.org/r/20171020190754.GA24332@embeddedor.comSigned-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b568acc
    • Gustavo A. R. Silva's avatar
      mm/shmem.c: mark expected switch fall-through · c8402871
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Link: http://lkml.kernel.org/r/20171020191058.GA24427@embeddedor.comSigned-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8402871
    • Pavel Tatashin's avatar
      mm/page_alloc.c: broken deferred calculation · d135e575
      Pavel Tatashin authored
      In reset_deferred_meminit() we determine number of pages that must not
      be deferred.  We initialize pages for at least 2G of memory, but also
      pages for reserved memory in this node.
      
      The reserved memory is determined in this function:
      memblock_reserved_memory_within(), which operates over physical
      addresses, and returns size in bytes.  However, reset_deferred_meminit()
      assumes that that this function operates with pfns, and returns page
      count.
      
      The result is that in the best case machine boots slower than expected
      due to initializing more pages than needed in single thread, and in the
      worst case panics because fewer than needed pages are initialized early.
      
      Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
      Fixes: 864b9a39 ("mm: consider memblock reservations for deferred memory initialization sizing")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d135e575
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · 400e2249
      Tetsuo Handa authored
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      400e2249
    • Johannes Weiner's avatar
      fs: fuse: account fuse_inode slab memory as reclaimable · df206988
      Johannes Weiner authored
      Fuse inodes are currently included in the unreclaimable slab counts -
      SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
      per-cgroup memory.stat.  But they are reclaimable just like other
      filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.
      
      Mark the slab cache reclaimable.
      
      Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df206988
    • Vlastimil Babka's avatar
      mm, page_alloc: fix potential false positive in __zone_watermark_ok · b050e376
      Vlastimil Babka authored
      Since commit 97a16fc8 ("mm, page_alloc: only enforce watermarks for
      order-0 allocations"), __zone_watermark_ok() check for high-order
      allocations will shortcut per-migratetype free list checks for
      ALLOC_HARDER allocations, and return true as long as there's free page
      of any migratetype.  The intention is that ALLOC_HARDER can allocate
      from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
      
      However, as a side effect, the watermark check will then also return
      true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
      CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
      allocation cannot actually obtain isolated pages, and might not be able
      to obtain CMA pages, this can result in a false positive.
      
      The condition should be rare and perhaps the outcome is not a fatal one.
      Still, it's better if the watermark check is correct.  There also
      shouldn't be a performance tradeoff here.
      
      Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
      Fixes: 97a16fc8 ("mm, page_alloc: only enforce watermarks for order-0 allocations")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b050e376
    • Shakeel Butt's avatar
      mm: mlock: remove lru_add_drain_all() · 72b03fcd
      Shakeel Butt authored
      lru_add_drain_all() is not required by mlock() and it will drain
      everything that has been cached at the time mlock is called.  And that
      is not really related to the memory which will be faulted in (and
      cached) and mlocked by the syscall itself.
      
      If anything lru_add_drain_all() should be called _after_ pages have been
      mlocked and faulted in but even that is not strictly needed because
      those pages would get to the appropriate LRUs lazily during the reclaim
      path.  Moreover follow_page_pte (gup) will drain the local pcp LRU
      cache.
      
      On larger machines the overhead of lru_add_drain_all() in mlock() can be
      significant when mlocking data already in memory.  We have observed high
      latency in mlock() due to lru_add_drain_all() when the users were
      mlocking in memory tmpfs files.
      
      [mhocko@suse.com: changelog fix]
      Link: http://lkml.kernel.org/r/20171019222507.2894-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b03fcd
    • Kemi Wang's avatar
      mm, sysctl: make NUMA stats configurable · 4518085e
      Kemi Wang authored
      This is the second step which introduces a tunable interface that allow
      numa stats configurable for optimizing zone_statistics(), as suggested
      by Dave Hansen and Ying Huang.
      
      =========================================================================
      
      When page allocation performance becomes a bottleneck and you can
      tolerate some possible tool breakage and decreased numa counter
      precision, you can do:
      
      	echo 0 > /proc/sys/vm/numa_stat
      
      In this case, numa counter update is ignored.  We can see about
      *4.8%*(185->176) drop of cpu cycles per single page allocation and
      reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
      drop of cpu cycles per single page allocation and reclaim on Jesper's
      page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
      (88 threads, 126G memory).
      
      Benchmark link provided by Jesper D Brouer (increase loop times to
      10000000):
      
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      =========================================================================
      
      When page allocation performance is not a bottleneck and you want all
      tooling to work, you can do:
      
      	echo 1 > /proc/sys/vm/numa_stat
      
      This is system default setting.
      
      Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
      for comments to help improve the original patch.
      
      [keescook@chromium.org: make sure mutex is a global static]
        Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
      Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.comSigned-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4518085e
    • weiping zhang's avatar
      shmem: convert shmem_init_inodecache() to void · 9a8ec03e
      weiping zhang authored
      shmem_inode_cachep was created with SLAB_PANIC flag and
      shmem_init_inodecache() never returns non-zero, so convert this
      function to return void.
      
      Link: http://lkml.kernel.org/r/20170909124542.GA35224@bogon.didichuxing.comSigned-off-by: default avatarweiping zhang <zhangweiping@didichuxing.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a8ec03e
    • Otto Ebeling's avatar
      Unify migrate_pages and move_pages access checks · 31367466
      Otto Ebeling authored
      Commit 197e7e52 ("Sanitize 'move_pages()' permission checks") fixed
      a security issue I reported in the move_pages syscall, and made it so
      that you can't act on set-uid processes unless you have the
      CAP_SYS_PTRACE capability.
      
      Unify the access check logic of migrate_pages to match the new behavior
      of move_pages.  We discussed this a bit in the security@ list and
      thought it'd be good for consistency even though there's no evident
      security impact.  The NUMA node access checks are left intact and
      require CAP_SYS_NICE as before.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1710011830320.6333@lakka.kapsi.fiSigned-off-by: default avatarOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31367466
    • Mel Gorman's avatar
      mm, pagevec: rename pagevec drained field · 7f0b5fb9
      Mel Gorman authored
      According to Vlastimil Babka, the drained field in pagevec is
      potentially misleading because it might be interpreted as draining this
      pagevec instead of the percpu lru pagevecs.  Rename the field for
      clarity.
      
      Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0b5fb9
    • Vlastimil Babka's avatar
      mm, page_alloc: simplify list handling in rmqueue_bulk() · 0fac3ba5
      Vlastimil Babka authored
      rmqueue_bulk() fills an empty pcplist with pages from the free list.  It
      tries to preserve increasing order by pfn to the caller, because it
      leads to better performance with some I/O controllers, as explained in
      commit e084b2d9 ("page-allocator: preserve PFN ordering when
      __GFP_COLD is set").
      
      To preserve the order, it's sufficient to add pages to the tail of the
      list as they are retrieved.  The current code instead adds to the head
      of the list, but then updates the list head pointer to the last added
      page, in each step.  This does result in the same order, but is
      needlessly confusing and potentially wasteful, with no apparent benefit.
      This patch simplifies the code and adjusts comment accordingly.
      
      Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fac3ba5
    • Mel Gorman's avatar
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman authored
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • Mel Gorman's avatar
      mm: remove cold parameter from free_hot_cold_page* · 2d4894b5
      Mel Gorman authored
      Most callers users of free_hot_cold_page claim the pages being released
      are cache hot.  The exception is the page reclaim paths where it is
      likely that enough pages will be freed in the near future that the
      per-cpu lists are going to be recycled and the cache hotness information
      is lost.  As no one really cares about the hotness of pages being
      released to the allocator, just ditch the parameter.
      
      The APIs are renamed to indicate that it's no longer about hot/cold
      pages.  It should also be less confusing as there are subtle differences
      between them.  __free_pages drops a reference and frees a page when the
      refcount reaches zero.  free_hot_cold_page handled pages whose refcount
      was already zero which is non-obvious from the name.  free_unref_page
      should be more obvious.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      [mgorman@techsingularity.net: add pages to head, not tail]
        Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d4894b5
    • Mel Gorman's avatar
      mm: remove cold parameter for release_pages · c6f92f9f
      Mel Gorman authored
      All callers of release_pages claim the pages being released are cache
      hot.  As no one cares about the hotness of pages being released to the
      allocator, just ditch the parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f92f9f
    • Mel Gorman's avatar
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman authored
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • Mel Gorman's avatar
      mm: only drain per-cpu pagevecs once per pagevec usage · d9ed0d08
      Mel Gorman authored
      When a pagevec is initialised on the stack, it is generally used
      multiple times over a range of pages, looking up entries and then
      releasing them.  On each pagevec_release, the per-cpu deferred LRU
      pagevecs are drained on the grounds the page being released may be on
      those queues and the pages may be cache hot.  In many cases only the
      first drain is necessary as it's unlikely that the range of pages being
      walked is racing against LRU addition.  Even if there is such a race,
      the impact is marginal where as constantly redraining the lru pagevecs
      costs.
      
      This patch ensures that pagevec is only drained once in a given
      lifecycle without increasing the cache footprint of the pagevec
      structure.  Only sparsetruncate tiny is shown here as large files have
      many exceptional entries and calls pagecache_release less frequently.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                              batchshadow-v1r1          onedrain-v1r1
      Min          Time      141.00 (   0.00%)      141.00 (   0.00%)
      1st-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      146.00 (   0.00%)      145.00 (   0.68%)
      Max-99%      Time      198.00 (   0.00%)      194.00 (   2.02%)
      Max          Time      254.00 (   0.00%)      208.00 (  18.11%)
      Amean        Time      145.12 (   0.00%)      144.30 (   0.56%)
      Stddev       Time       12.74 (   0.00%)        9.62 (  24.49%)
      Coeff        Time        8.78 (   0.00%)        6.67 (  24.06%)
      Best99%Amean Time      144.29 (   0.00%)      143.82 (   0.32%)
      Best95%Amean Time      142.68 (   0.00%)      142.31 (   0.26%)
      Best90%Amean Time      142.52 (   0.00%)      142.19 (   0.24%)
      Best75%Amean Time      142.26 (   0.00%)      141.98 (   0.20%)
      Best50%Amean Time      141.90 (   0.00%)      141.71 (   0.13%)
      Best25%Amean Time      141.80 (   0.00%)      141.43 (   0.26%)
      
      The impact on bonnie is marginal and within the noise because a
      significant percentage of the file being truncated has been reclaimed
      and consists of shadow entries which reduce the hotness of the
      pagevec_release path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9ed0d08
    • Mel Gorman's avatar
      mm, truncate: remove all exceptional entries from pagevec under one lock · f2187599
      Mel Gorman authored
      During truncate each entry in a pagevec is checked to see if it is an
      exceptional entry and if so, the shadow entry is cleaned up.  This is
      potentially expensive as multiple entries for a mapping locks/unlocks
      the tree lock.  This batches the operation such that any exceptional
      entries removed from a pagevec only acquire the mapping tree lock once.
      The corner case where this is more expensive is where there is only one
      exceptional entry but this is unlikely due to temporal locality and how
      it affects LRU ordering.  Note that for truncations of small files
      created recently, this patch should show no gain because it only batches
      the handling of exceptional entries.
      
      sparsetruncate (large)
                                    4.14.0-rc4             4.14.0-rc4
                               pickhelper-v1r1       batchshadow-v1r1
      Min          Time       38.00 (   0.00%)       27.00 (  28.95%)
      1st-qrtle    Time       40.00 (   0.00%)       28.00 (  30.00%)
      2nd-qrtle    Time       44.00 (   0.00%)       41.00 (   6.82%)
      3rd-qrtle    Time      146.00 (   0.00%)      147.00 (  -0.68%)
      Max-90%      Time      153.00 (   0.00%)      153.00 (   0.00%)
      Max-95%      Time      155.00 (   0.00%)      156.00 (  -0.65%)
      Max-99%      Time      181.00 (   0.00%)      171.00 (   5.52%)
      Amean        Time       93.04 (   0.00%)       88.43 (   4.96%)
      Best99%Amean Time       92.08 (   0.00%)       86.13 (   6.46%)
      Best95%Amean Time       89.19 (   0.00%)       83.13 (   6.80%)
      Best90%Amean Time       85.60 (   0.00%)       79.15 (   7.53%)
      Best75%Amean Time       72.95 (   0.00%)       65.09 (  10.78%)
      Best50%Amean Time       39.86 (   0.00%)       28.20 (  29.25%)
      Best25%Amean Time       39.44 (   0.00%)       27.70 (  29.77%)
      
      bonnie
                                            4.14.0-rc4             4.14.0-rc4
                                       pickhelper-v1r1       batchshadow-v1r1
      Hmean     SeqCreate ops         71.92 (   0.00%)       76.78 (   6.76%)
      Hmean     SeqCreate read        42.42 (   0.00%)       45.01 (   6.10%)
      Hmean     SeqCreate del      26519.88 (   0.00%)    27191.87 (   2.53%)
      Hmean     RandCreate ops        71.92 (   0.00%)       76.95 (   7.00%)
      Hmean     RandCreate read       44.44 (   0.00%)       49.23 (  10.78%)
      Hmean     RandCreate del     24948.62 (   0.00%)    24764.97 (  -0.74%)
      
      Truncation of a large number of files shows a substantial gain with 99%
      of files being truncated 6.46% faster.  bonnie shows a modest gain of
      2.53%
      
      [jack@suse.cz: fix truncate_exceptional_pvec_entries()]
        Link: http://lkml.kernel.org/r/20171108164226.26788-1-jack@suse.cz
      Link: http://lkml.kernel.org/r/20171018075952.10627-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2187599
    • Mel Gorman's avatar
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman authored
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
    • Mel Gorman's avatar
      mm, page_alloc: enable/disable IRQs once when freeing a list of pages · 9cca35d4
      Mel Gorman authored
      Patch series "Follow-up for speed up page cache truncation", v2.
      
      This series is a follow-on for Jan Kara's series "Speed up page cache
      truncation" series.  We both ended up looking at the same problem but
      saw different problems based on the same data.  This series builds upon
      his work.
      
      A variety of workloads were compared on four separate machines but each
      machine showed gains albeit at different levels.  Minimally, some of the
      differences are due to NUMA where truncating data from a remote node is
      slower than a local node.  The workloads checked were
      
      o sparse truncate microbenchmark, tiny
      o sparse truncate microbenchmark, large
      o reaim-io disk workfile
      o dbench4 (modified by mmtests to produce more stable results)
      o filebench varmail configuration for small memory size
      o bonnie, directory operations, working set size 2*RAM
      
      reaim-io, dbench and filebench all showed minor gains.  Truncation does
      not dominate those workloads but were tested to ensure no other
      regressions.  They will not be reported further.
      
      The sparse truncate microbench was written by Jan.  It creates a number
      of files and then times how long it takes to truncate each one.  The
      "tiny" configuraiton creates a number of files that easily fits in
      memory and times how long it takes to truncate files with page cache.
      The large configuration uses enough files to have data that is twice the
      size of memory and so timings there include truncating page cache and
      working set shadow entries in the radix tree.
      
      Patches 1-4 are the most relevant parts of this series.  Patches 5-8 are
      optional as they are deleting code that is essentially useless but has a
      negligible performance impact.
      
      The changelogs have more information on performance but just for bonnie
      delete options, the main comparison is
      
      bonnie
                                            4.14.0-rc5             4.14.0-rc5             4.14.0-rc5
                                                jan-v2                vanilla                 mel-v2
      Hmean     SeqCreate ops         76.20 (   0.00%)       75.80 (  -0.53%)       76.80 (   0.79%)
      Hmean     SeqCreate read        85.00 (   0.00%)       85.00 (   0.00%)       85.00 (   0.00%)
      Hmean     SeqCreate del      13752.31 (   0.00%)    12090.23 ( -12.09%)    15304.84 (  11.29%)
      Hmean     RandCreate ops        76.00 (   0.00%)       75.60 (  -0.53%)       77.00 (   1.32%)
      Hmean     RandCreate read       96.80 (   0.00%)       96.80 (   0.00%)       97.00 (   0.21%)
      Hmean     RandCreate del     13233.75 (   0.00%)    11525.35 ( -12.91%)    14446.61 (   9.16%)
      
      Jan's series is the baseline and the vanilla kernel is 12% slower where
      as this series on top gains another 11%.  This is from a different
      machine than the data in the changelogs but the detailed data was not
      collected as there was no substantial change in v2.
      
      This patch (of 8):
      
      Freeing a list of pages current enables/disables IRQs for each page
      freed.  This patch splits freeing a list of pages into two operations --
      preparing the pages for freeing and the actual freeing.  This is a
      tradeoff - we're taking two passes of the list to free in exchange for
      avoiding multiple enable/disable of IRQs.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                 janbatch-v1r1            oneirq-v1r1
      Min          Time      149.00 (   0.00%)      141.00 (   5.37%)
      1st-qrtle    Time      150.00 (   0.00%)      142.00 (   5.33%)
      2nd-qrtle    Time      151.00 (   0.00%)      142.00 (   5.96%)
      3rd-qrtle    Time      151.00 (   0.00%)      143.00 (   5.30%)
      Max-90%      Time      153.00 (   0.00%)      144.00 (   5.88%)
      Max-95%      Time      155.00 (   0.00%)      147.00 (   5.16%)
      Max-99%      Time      201.00 (   0.00%)      195.00 (   2.99%)
      Max          Time      236.00 (   0.00%)      230.00 (   2.54%)
      Amean        Time      152.65 (   0.00%)      144.37 (   5.43%)
      Stddev       Time        9.78 (   0.00%)       10.44 (  -6.72%)
      Coeff        Time        6.41 (   0.00%)        7.23 ( -12.84%)
      Best99%Amean Time      152.07 (   0.00%)      143.72 (   5.50%)
      Best95%Amean Time      150.75 (   0.00%)      142.37 (   5.56%)
      Best90%Amean Time      150.59 (   0.00%)      142.19 (   5.58%)
      Best75%Amean Time      150.36 (   0.00%)      141.92 (   5.61%)
      Best50%Amean Time      150.04 (   0.00%)      141.69 (   5.56%)
      Best25%Amean Time      149.85 (   0.00%)      141.38 (   5.65%)
      
      With a tiny number of files, each file truncated has resident page cache
      and it shows that time to truncate is roughtly 5-6% with some minor
      jitter.
      
                                            4.14.0-rc4             4.14.0-rc4
                                         janbatch-v1r1            oneirq-v1r1
      Hmean     SeqCreate ops         65.27 (   0.00%)       81.86 (  25.43%)
      Hmean     SeqCreate read        39.48 (   0.00%)       47.44 (  20.16%)
      Hmean     SeqCreate del      24963.95 (   0.00%)    26319.99 (   5.43%)
      Hmean     RandCreate ops        65.47 (   0.00%)       82.01 (  25.26%)
      Hmean     RandCreate read       42.04 (   0.00%)       51.75 (  23.09%)
      Hmean     RandCreate del     23377.66 (   0.00%)    23764.79 (   1.66%)
      
      As expected, there is a small gain for the delete operation.
      
      [mgorman@techsingularity.net: use page_private and set_page_private helpers]
        Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cca35d4
    • Jan Kara's avatar
      mm: batch radix tree operations when truncating pages · aa65c29c
      Jan Kara authored
      Currently we remove pages from the radix tree one by one.  To speed up
      page cache truncation, lock several pages at once and free them in one
      go.  This allows us to batch radix tree operations in a more efficient
      way and also save round-trips on mapping->tree_lock.  As a result we
      gain about 20% speed improvement in page cache truncation.
      
      Data from a simple benchmark timing 10000 truncates of 1024 pages (on
      ext4 on ramdisk but the filesystem is barely visible in the profiles).
      The range shows 1% and 95% percentiles of the measured times:
      
        4.14-rc2	4.14-rc2 + batched truncation
        248-256	209-219
        249-258	209-217
        248-255	211-239
        248-255	209-217
        247-256	210-218
      
      [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
        Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
      [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
      Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa65c29c