1. 18 Aug, 2018 12 commits
    • Linus Torvalds's avatar
      Merge tag 'rproc-v4.19' of git://github.com/andersson/remoteproc · c54fc865
      Linus Torvalds authored
      Pull remoteproc updates from Bjorn Andersson:
       "This adds support for pre-start and post-shutdown hooks for remoteproc
        subdevices, refactors the Qualcomm Hexagon support to allow reuse
        between several drivers, makes authentication in the MDT file loader
        optional, migrates a few format strings to use %pK and migrates the
        Davinci driver to use the reset framework"
      
      * tag 'rproc-v4.19' of git://github.com/andersson/remoteproc:
        remoteproc/davinci: use the reset framework
        remoteproc/davinci: Mark error recovery as disabled
        remoteproc: st_slim: replace "%p" with "%pK"
        remoteproc: replace "%p" with "%pK"
        remoteproc: qcom: fix Q6V5_WCSS dependencies
        remoteproc: Reset table_ptr in rproc_start() failure paths
        remoteproc: qcom: q6v5-pil: fix modem hang on SDM845 after axis2 clk unvote
        remoteproc: qcom q6v5: fix modular build
        remoteproc: Introduce prepare and unprepare for subdevices
        remoteproc: rename subdev probe and remove functions
        remoteproc: Make client initialize ops in rproc_subdev
        remoteproc: Make start and stop in subdev optional
        remoteproc: Rename subdev functions to start/stop
        remoteproc: qcom: Introduce Hexagon V5 based WCSS driver
        remoteproc: qcom: q6v5-pil: Use common q6v5 helpers
        remoteproc: qcom: adsp: Use common q6v5 helpers
        remoteproc: q6v5: Extract common resource handling
        remoteproc: qcom: mdt_loader: Make the firmware authentication optional
      c54fc865
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-4.19-rc1' of git://www.linux-watchdog.org/linux-watchdog · 6eaac34f
      Linus Torvalds authored
      Pull watchdog updates from Wim Van Sebroeck:
      
       - add MEN 16z069 IP-Core driver
      
       - renesas-wdt: add support for the R8A77990 wdt
      
       - stm32_iwdg: Add stm32mp1 support and pclk feature
      
       - sp805_wdt, orion_wdt, sprd_wdt: several improvements
      
       - imx2_wdt, stmp3xxx: switch to SPDX identifier
      
      * tag 'linux-watchdog-4.19-rc1' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: fix dependencies of menz69_wdt.o
        watchdog: sp805: Add clock-frequency property
        watchdog: add driver for the MEN 16z069 IP-Core
        watchdog: sprd_wdt: Remove redundant dev_err call in sprd_wdt_probe()
        watchdog: stmp3xxx: Switch to SPDX identifier
        watchdog: imx2_wdt: Switch to SPDX identifier
        watchdog: sp805: set WDOG_HW_RUNNING when appropriate
        watchdog: sp805: add 'timeout-sec' DT property support
        dt-bindings: watchdog: Add optional 'timeout-sec' property for sp805
        dt-bindings: watchdog: Consolidate SP805 binding docs
        watchdog: orion_wdt: Mark watchdog as active when running at probe
        watchdog: stm32: add pclk feature for stm32mp1
        dt-bindings: watchdog: add stm32mp1 support
        dt-bindings: watchdog: renesas-wdt: Add support for the R8A77990 wdt
      6eaac34f
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dma · 13bf2cf9
      Linus Torvalds authored
      Pull DMAengine updates from Vinod Koul:
       "This round brings couple of framework changes, a new driver and usual
        driver updates:
      
         - new managed helper for dmaengine framework registration
      
         - split dmaengine pause capability to pause and resume and allow
           drivers to report that individually
      
         - update dma_request_chan_by_mask() to handle deferred probing
      
         - move imx-sdma to use virt-dma
      
         - new driver for Actions Semi Owl family S900 controller
      
         - minor updates to intel, renesas, mv_xor, pl330 etc"
      
      * tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (46 commits)
        dmaengine: Add Actions Semi Owl family S900 DMA driver
        dt-bindings: dmaengine: Add binding for Actions Semi Owl SoCs
        dmaengine: sh: rcar-dmac: Should not stop the DMAC by rcar_dmac_sync_tcr()
        dmaengine: mic_x100_dma: use the new helper to simplify the code
        dmaengine: add a new helper dmaenginem_async_device_register
        dmaengine: imx-sdma: add memcpy interface
        dmaengine: imx-sdma: add SDMA_BD_MAX_CNT to replace '0xffff'
        dmaengine: dma_request_chan_by_mask() to handle deferred probing
        dmaengine: pl330: fix irq race with terminate_all
        dmaengine: Revert "dmaengine: mv_xor_v2: enable COMPILE_TEST"
        dmaengine: mv_xor_v2: use {lower,upper}_32_bits to configure HW descriptor address
        dmaengine: mv_xor_v2: enable COMPILE_TEST
        dmaengine: mv_xor_v2: move unmap to before callback
        dmaengine: mv_xor_v2: convert callback to helper function
        dmaengine: mv_xor_v2: kill the tasklets upon exit
        dmaengine: mv_xor_v2: explicitly freeup irq
        dmaengine: sh: rcar-dmac: Add dma_pause operation
        dmaengine: sh: rcar-dmac: add a new function to clear CHCR.DE with barrier
        dmaengine: idma64: Support dmaengine_terminate_sync()
        dmaengine: hsu: Support dmaengine_terminate_sync()
        ...
      13bf2cf9
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · bbd60bff
      Linus Torvalds authored
      Pull MMC updates from Ulf Hansson:
       "Updates for MMC for v4.19.
      
        MMC core:
         - Add some fine-grained hooks to further support HS400 tuning
         - Improve error path for bus width setting for HS400es
         - Use a common method when checking R1 status
      
        MMC host:
         - renesas_sdhi: Add r8a77990 support
         - renesas_sdhi: Add eMMC HS400 mode support
         - tmio/renesas_sdhi: Improve tuning/clock management
         - tmio: Add eMMC HS400 mode support
         - sunxi: Add support for 3.3V eMMC DDR mode
         - mmci: Initial support to manage variant specific callbacks
         - sdhci: Don't try 3.3V I/O voltage if not supported
         - sdhci-pci-dwc-mshc: Add driver to support Synopsys dwc mshc SDHCI PCI
         - sdhci-of-dwcmshc: Add driver to support Synopsys DWC MSHC SDHCI
         - sdhci-msm: Add support for new version sdcc V5
         - sdhci-pci-o2micro: Add support for O2 eMMC HS200 mode
         - sdhci-pci-o2micro: Add support for O2 hardware tuning
         - sdhci-pci-o2micro: Add MSI interrupt support for O2 SD host
         - sdhci-pci: Add support for Intel ICP
         - sdhci-tegra: Prevent ACMD23 and HS200 mode on Tegra 3
         - sdhci-tegra: Fix eMMC DDR52 mode
         - sdhci-tegra: Improve clock management
         - dw_mmc-rockchip: Document compatible string for px30
         - sdhci-esdhc-imx: Add support for 3.3V eMMC DDR mode
         - sdhci-of-esdhc: Set proper DMA mask for ls104x chips
         - sdhci-of-esdhc: Improve clock management
         - sdhci-of-arasan: Add a quirk to manage unstable clocks
         - dw_mmc-exynos: Address potential external abort during system resume
         - pxamci: Add support for common MMC DT bindings
         - pxamci: Several cleanups and improvements
         - pxamci: Merge immutable branch for pxa to switch to DMA slave maps"
      
      * tag 'mmc-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (56 commits)
        mmc: core: improve reasonableness of bus width setting for HS400es
        mmc: tmio: remove unneeded variable in tmio_mmc_start_command()
        mmc: renesas_sdhi: Fix sampling clock position selecting
        mmc: tmio: Fix tuning flow
        mmc: sunxi: remove output of virtual base address
        dt-bindings: mmc: rockchip-dw-mshc: add description for px30
        mmc: renesas_sdhi: Add r8a77990 support
        mmc: sunxi: allow 3.3V DDR when DDR is available
        mmc: mmci: Add and implement a ->dma_setup() callback for qcom dml
        mmc: mmci: Initial support to manage variant specific callbacks
        mmc: tegra: Force correct divider calculation on DDR50/52
        mmc: sdhci: Add MSI interrupt support for O2 SD host
        mmc: sdhci: Add support for O2 hardware tuning
        mmc: sdhci: Export sdhci tuning function symbol
        mmc: sdhci: Change O2 Host HS200 mode clock frequency to 200MHz
        mmc: sdhci: Add support for O2 eMMC HS200 mode
        mmc: tegra: Add and use tegra_sdhci_get_max_clock()
        mmc: sdhci-esdhc-imx: fix indent
        mmc: sdhci-esdhc-imx: disable clocks before changing frequency
        mmc: tegra: prevent ACMD23 on Tegra 3
        ...
      bbd60bff
    • Linus Torvalds's avatar
      pcmcia: remove long deprecated pcmcia_request_exclusive_irq() function · 30779715
      Linus Torvalds authored
      This function was created as a deprecated fallback case back in 2010 by
      commit eb14120f ("pcmcia: re-work pcmcia_request_irq()") for legacy
      cases.
      
      Actual in-kernel users haven't been around for a long while.  The last
      in-kernel user was apparently removed four years ago by commit
      5f5316fc ("am2150: Update nmclan_cs.c to use update PCMCIA API").
      
      Just remove it entirely.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30779715
    • Linus Torvalds's avatar
      deprecate the '__deprecated' attribute warnings entirely and for good · 771c0353
      Linus Torvalds authored
      We haven't had lots of deprecation warnings lately, but the rdma use of
      it made them flare up again.
      
      They are not useful.  They annoy everybody, and nobody ever does
      anything about them, because it's always "somebody elses problem".  And
      when people start thinking that warnings are normal, they stop looking
      at them, and the real warnings that mean something go unnoticed.
      
      If you want to get rid of a function, just get rid of it.  Convert every
      user to the new world order.
      
      And if you can't do that, then don't annoy everybody else with your
      marking that says "I couldn't be bothered to fix this, so I'll just spam
      everybody elses build logs with warnings about my laziness".
      
      Make a kernelnewbies wiki page about things that could be cleaned up,
      write a blog post about it, or talk to people on the mailing lists.  But
      don't add warnings to the kernel build about cleanup that you think
      should happen but you aren't doing yourself.
      
      Don't.  Just don't.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      771c0353
    • Linus Torvalds's avatar
      Merge tag 'driver-core-4.19-rc1' of... · a18d783f
      Linus Torvalds authored
      Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core updates from Greg KH:
       "Here are all of the driver core and related patches for 4.19-rc1.
      
        Nothing huge here, just a number of small cleanups and the ability to
        now stop the deferred probing after init happens.
      
        All of these have been in linux-next for a while with only a merge
        issue reported"
      
      * tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
        base: core: Remove WARN_ON from link dependencies check
        drivers/base: stop new probing during shutdown
        drivers: core: Remove glue dirs from sysfs earlier
        driver core: remove unnecessary function extern declare
        sysfs.h: fix non-kernel-doc comment
        PM / Domains: Stop deferring probe at the end of initcall
        iommu: Remove IOMMU_OF_DECLARE
        iommu: Stop deferring probe at end of initcalls
        pinctrl: Support stopping deferred probe after initcalls
        dt-bindings: pinctrl: add a 'pinctrl-use-default' property
        driver core: allow stopping deferred probe after init
        driver core: add a debugfs entry to show deferred devices
        sysfs: Fix internal_create_group() for named group updates
        base: fix order of OF initialization
        linux/device.h: fix kernel-doc notation warning
        Documentation: update firmware loader fallback reference
        kobject: Replace strncpy with memcpy
        drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
        kernfs: Replace strncpy with memcpy
        device: Add #define dev_fmt similar to #define pr_fmt
        ...
      a18d783f
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · d5acba26
      Linus Torvalds authored
      Pull char/misc driver updates from Greg KH:
       "Here is the bit set of char/misc drivers for 4.19-rc1
      
        There is a lot here, much more than normal, seems like everyone is
        writing new driver subsystems these days... Anyway, major things here
        are:
      
         - new FSI driver subsystem, yet-another-powerpc low-level hardware
           bus
      
         - gnss, finally an in-kernel GPS subsystem to try to tame all of the
           crazy out-of-tree drivers that have been floating around for years,
           combined with some really hacky userspace implementations. This is
           only for GNSS receivers, but you have to start somewhere, and this
           is great to see.
      
        Other than that, there are new slimbus drivers, new coresight drivers,
        new fpga drivers, and loads of DT bindings for all of these and
        existing drivers.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
        android: binder: Rate-limit debug and userspace triggered err msgs
        fsi: sbefifo: Bump max command length
        fsi: scom: Fix NULL dereference
        misc: mic: SCIF Fix scif_get_new_port() error handling
        misc: cxl: changed asterisk position
        genwqe: card_base: Use true and false for boolean values
        misc: eeprom: assignment outside the if statement
        uio: potential double frees if __uio_register_device() fails
        eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
        misc: ti-st: Fix memory leak in the error path of probe()
        android: binder: Show extra_buffers_size in trace
        firmware: vpd: Fix section enabled flag on vpd_section_destroy
        platform: goldfish: Retire pdev_bus
        goldfish: Use dedicated macros instead of manual bit shifting
        goldfish: Add missing includes to goldfish.h
        mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
        dt-bindings: mux: add adi,adgs1408
        Drivers: hv: vmbus: Cleanup synic memory free path
        Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
        Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
        ...
      d5acba26
    • Linus Torvalds's avatar
      Merge tag 'staging-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 2475c515
      Linus Torvalds authored
      Pull staging and IIO updates from Greg KH:
       "Here are the big staging/iio patches for 4.19-rc1.
      
        Lots of churn here, with tons of cleanups happening in staging
        drivers, a removal of an old crypto driver that no one was using
        (skein), and the addition of some new IIO drivers. Also added was a
        "gasket" driver from Google that needs loads of work and the erofs
        filesystem.
      
        Even with adding all of the new drivers and a new filesystem, we are
        only adding about 1000 lines overall to the kernel linecount, which
        shows just how much cleanup happened, and how big the unused crypto
        driver was.
      
        All of these have been in the linux-next tree for a while now with no
        reported issues"
      
      * tag 'staging-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (903 commits)
        staging:rtl8192u: Remove unused macro definitions - Style
        staging:rtl8192u: Add spaces around '+' operator - Style
        staging:rtl8192u: Remove stale comment - Style
        staging: rtl8188eu: remove unused mp_custom_oid.h
        staging: fbtft: Add spaces around / - Style
        staging: fbtft: Erases some repetitive usage of function name - Style
        staging: fbtft: Adjust some empty-line problems - Style
        staging: fbtft: Removes one nesting level to help readability - Style
        staging: fbtft: Changes gamma table to define.
        staging: fbtft: A bit more information on dev_err.
        staging: fbtft: Fixes some alignment issues - Style
        staging: fbtft: Puts macro arguments in parenthesis to avoid precedence issues - Style
        staging: rtl8188eu: remove unused array dB_Invert_Table
        staging: rtl8188eu: remove whitespace, add missing blank line
        staging: rtl8188eu: use is_multicast_ether_addr in rtw_sta_mgt.c
        staging: rtl8188eu: remove whitespace - style
        staging: rtl8188eu: cleanup block comment - style
        staging: rtl8188eu: use is_multicast_ether_addr in rtl8188eu_xmit.c
        staging: rtl8188eu: use is_multicast_ether_addr in recv_linux.c
        staging: rtlwifi: refactor rtl_get_tcb_desc
        ...
      2475c515
    • Linus Torvalds's avatar
      Merge tag 'tty-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 336722eb
      Linus Torvalds authored
      Pull tty/serial driver updates from Greg KH:
       "Here is the big tty and serial driver pull request for 4.19-rc1.
      
        It's not all that big, just a number of small serial driver updates
        and fixes, along with some better vt handling for unicode characters
        for those using braille terminals.
      
        All of these patches have been in linux-next for a long time with no
        reported issues"
      
      * tag 'tty-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (73 commits)
        tty: serial: 8250: Revert NXP SC16C2552 workaround
        serial: 8250_exar: Read INT0 from slave device, too
        tty: rocket: Fix possible buffer overwrite on register_PCI
        serial: 8250_dw: Add ACPI support for uart on Broadcom SoC
        serial: 8250_dw: always set baud rate in dw8250_set_termios
        dt-bindings: serial: Add binding for uartlite
        tty: serial: uartlite: Add support for suspend and resume
        tty: serial: uartlite: Add clock adaptation
        tty: serial: uartlite: Add structure for private data
        serial: sh-sci: Improve support for separate TEI and DRI interrupts
        serial: sh-sci: Remove SCIx_RZ_SCIFA_REGTYPE
        serial: sh-sci: Allow for compressed SCIF address
        serial: sh-sci: Improve interrupts description
        serial: 8250: Use cached port name directly in messages
        serial: 8250_exar: Drop unused variable in pci_xr17v35x_setup()
        vt: drop unused struct vt_struct
        vt: avoid a VLA in the unicode screen scroll function
        vt: add /dev/vcsu* to devices.txt
        vt: coherence validation code for the unicode screen buffer
        vt: selection: take screen contents from uniscr if available
        ...
      336722eb
    • Linus Torvalds's avatar
      Merge tag 'usb-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 5695d5d1
      Linus Torvalds authored
      Pull USB/PHY updates from Greg KH:
       "Here is the big USB and phy driver patch set for 4.19-rc1.
      
        Nothing huge but there was a lot of work that happened this
        development cycle:
      
         - lots of type-c work, with drivers graduating out of staging, and
           displayport support being added.
      
         - new PHY drivers
      
         - the normal collection of gadget driver updates and fixes
      
         - code churn to work on the urb handling path, using irqsave()
           everywhere in anticipation of making this codepath a lot simpler in
           the future.
      
         - usbserial driver fixes and reworks
      
         - other misc changes
      
        All of these have been in linux-next with no reported issues for a
        while"
      
      * tag 'usb-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (159 commits)
        USB: serial: pl2303: add a new device id for ATEN
        usb: renesas_usbhs: Kconfig: convert to SPDX identifiers
        usb: dwc3: gadget: Check MaxPacketSize from descriptor
        usb: dwc2: Turn on uframe_sched on "stm32f4x9_fsotg" platforms
        usb: dwc2: Turn on uframe_sched on "amlogic" platforms
        usb: dwc2: Turn on uframe_sched on "his" platforms
        usb: dwc2: Turn on uframe_sched on "bcm" platforms
        usb: dwc2: gadget: ISOC's starting flow improvement
        usb: dwc2: Make dwc2_readl/writel functions endianness-agnostic.
        usb: dwc3: core: Enable AutoRetry feature in the controller
        usb: dwc3: Set default mode for dwc_usb31
        usb: gadget: udc: renesas_usb3: Add register of usb role switch
        usb: dwc2: replace ioread32/iowrite32_rep with dwc2_readl/writel_rep
        usb: dwc2: Modify dwc2_readl/writel functions prototype
        usb: dwc3: pci: Intel Merrifield can be host
        usb: dwc3: pci: Supply device properties via driver data
        arm64: dts: dwc3: description of incr burst type
        usb: dwc3: Enable undefined length INCR burst type
        usb: dwc3: add global soc bus configuration reg0
        usb: dwc3: Describe 'wakeup_work' field of struct dwc3_pci
        ...
      5695d5d1
    • Linus Torvalds's avatar
      Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux · 1f7a4c73
      Linus Torvalds authored
      Pull 9p updates from Dominique Martinet:
       "This contains mostly fixes (6 to be backported to stable) and a few
        changes, here is the breakdown:
      
         - rework how fids are attributed by replacing some custom tracking in
           a list by an idr
      
         - for packet-based transports (virtio/rdma) validate that the packet
           length matches what the header says
      
         - a few race condition fixes found by syzkaller
      
         - missing argument check when NULL device is passed in sys_mount
      
         - a few virtio fixes
      
         - some spelling and style fixes"
      
      * tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
        net/9p/trans_virtio.c: add null terminal for mount tag
        9p/virtio: fix off-by-one error in sg list bounds check
        9p: fix whitespace issues
        9p: fix multiple NULL-pointer-dereferences
        fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
        9p: validate PDU length
        net/9p/trans_fd.c: fix race by holding the lock
        net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
        net/9p/virtio: Fix hard lockup in req_done
        net/9p/trans_virtio.c: fix some spell mistakes in comments
        9p/net: Fix zero-copy path in the 9p virtio transport
        9p: Embed wait_queue_head into p9_req_t
        9p: Replace the fidlist with an IDR
        9p: Change p9_fid_create calling convention
        9p: Fix comment on smp_wmb
        net/9p/client.c: version pointer uninitialized
        fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
        net/9p: fix error path of p9_virtio_probe
        9p/net/protocol.c: return -ENOMEM when kmalloc() failed
        net/9p/client.c: add missing '\n' at the end of p9_debug()
        ...
      1f7a4c73
  2. 17 Aug, 2018 28 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 6ada4e28
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few misc things
      
       - a few Y2038 fixes
      
       - ntfs fixes
      
       - arch/sh tweaks
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
        mm/hmm.c: remove unused variables align_start and align_end
        fs/userfaultfd.c: remove redundant pointer uwq
        mm, vmacache: hash addresses based on pmd
        mm/list_lru: introduce list_lru_shrink_walk_irq()
        mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
        mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
        mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
        mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
        mm/sparse: delete old sparse_init and enable new one
        mm/sparse: add new sparse_init_nid() and sparse_init()
        mm/sparse: move buffer init/fini to the common place
        mm/sparse: use the new sparse buffer functions in non-vmemmap
        mm/sparse: abstract sparse buffer allocations
        mm/hugetlb.c: don't zero 1GiB bootmem pages
        mm, page_alloc: double zone's batchsize
        mm/oom_kill.c: document oom_lock
        mm/hugetlb: remove gigantic page support for HIGHMEM
        mm, oom: remove sleep from under oom_lock
        kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
        mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
        ...
      6ada4e28
    • Colin Ian King's avatar
      mm/hmm.c: remove unused variables align_start and align_end · 1e926419
      Colin Ian King authored
      Variables align_start and align_end are being assigned but are never
      used hence they are redundant and can be removed.
      
      Cleans up clang warnings:
        warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
        warning: variable 'align_size' set but not used [-Wunused-but-set-variable]
      
      Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e926419
    • Colin Ian King's avatar
      fs/userfaultfd.c: remove redundant pointer uwq · 5241d472
      Colin Ian King authored
      Pointer uwq is being assigned but is never used hence it is redundant
      and can be removed.
      
      Cleans up clang warning:
        warning: variable 'uwq' set but not used [-Wunused-but-set-variable]
      
      Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5241d472
    • David Rientjes's avatar
      mm, vmacache: hash addresses based on pmd · ddbf369c
      David Rientjes authored
      When perf profiling a wide variety of different workloads, it was found
      that vmacache_find() had higher than expected cost: up to 0.08% of cpu
      utilization in some cases.  This was found to rival other core VM
      functions such as alloc_pages_vma() with thp enabled and default
      mempolicy, and the conditionals in __get_vma_policy().
      
      VMACACHE_HASH() determines which of the four per-task_struct slots a vma
      is cached for a particular address.  This currently depends on the pfn,
      so pfn 5212 occupies a different vmacache slot than its neighboring pfn
      5213.
      
      vmacache_find() iterates through all four of current's vmacache slots
      when looking up an address.  Hashing based on pfn, an address has
      ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
      about 25%, *if* the vma is cached.
      
      This patch hashes an address by its pmd instead of pte to optimize for
      workloads with good spatial locality.  This results in a higher
      probability of vmas being cached in the first slot that is checked:
      normally ~70% on the same workloads instead of 25%.
      
      [rientjes@google.com: various updates]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddbf369c
    • Sebastian Andrzej Siewior's avatar
      mm/list_lru: introduce list_lru_shrink_walk_irq() · 6b51e881
      Sebastian Andrzej Siewior authored
      Provide list_lru_shrink_walk_irq() and let it behave like
      list_lru_walk_one() except that it locks the spinlock with
      spin_lock_irq().  This is used by scan_shadow_nodes() because its lock
      nests within the i_pages lock which is acquired with IRQ.  This change
      allows to use proper locking promitives instead hand crafted
      lock_irq_disable() plus spin_lock().
      
      There is no EXPORT_SYMBOL provided because the current user is in-kernel
      only.
      
      Add list_lru_shrink_walk_irq() which acquires the spinlock with the
      proper locking primitives.
      
      Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b51e881
    • Sebastian Andrzej Siewior's avatar
      mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one() · 6e018968
      Sebastian Andrzej Siewior authored
      __list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
      the first two argument.  Those two are only used to retrieve struct
      list_lru_node.  Since this is already done by the caller of the function
      for the locking, we can pass struct list_lru_node* directly and avoid
      the dance around it.
      
      Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e018968
    • Sebastian Andrzej Siewior's avatar
      mm/list_lru.c: move locking from __list_lru_walk_one() to its caller · 6cfe57a9
      Sebastian Andrzej Siewior authored
      Move the locking inside __list_lru_walk_one() to its caller.  This is a
      preparation step in order to introduce list_lru_walk_one_irq() which
      does spin_lock_irq() instead of spin_lock() for the locking.
      
      Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cfe57a9
    • Sebastian Andrzej Siewior's avatar
      mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node() · 87a5ffc1
      Sebastian Andrzej Siewior authored
      Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".
      
      This series removes the local_irq_disable() around
      list_lru_shrink_walk() (as used by mm/workingset) by adding
      list_lru_shrink_walk_irq().
      
      Vladimir Davydov preferred this over `irq' argument which I added to
      struct list_lru.
      
      The initial post (of this series) received a Reviewed-by tag by Vladimir
      Davydov which I added to each patch of the series.  The series applies
      on top of akpm's tree which has Kirill's shrink_slab series and does not
      clash with it (akpm asked me to wait a week or so and repost it then).
      
      I tested the code paths by triggering the OOM-killer via memory over
      commit and lockdep did not complain (nor did I see any warnings).
      
      This patch (of 4):
      
      list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
      memcg_idx parameter.  The same can be achieved by list_lru_walk_one() and
      passing NULL as memcg argument which then gets converted into -1.  This is
      a preparation step when the spin_lock() function is lifted to the caller
      of __list_lru_walk_one().  Invoke list_lru_walk_one() instead
      __list_lru_walk_one() when possible.
      
      Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87a5ffc1
    • Huang Ying's avatar
      mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP · 14fef284
      Huang Ying authored
      CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
      to optimize swapping for THP (Transparent Huge Page) without basic
      swapping support.
      
      In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
      split_swap_cluster() will not be built because it is in swapfile.c, but
      it will be called in huge_memory.c.  This doesn't trigger a build error
      in practice because the call site is enclosed by PageSwapCache(), which
      is defined to be constant 0 when CONFIG_SWAP=n.  But this is fragile and
      should be fixed.
      
      The comments are fixed too to reflect the latest progress.
      
      Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
      Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14fef284
    • Pavel Tatashin's avatar
      mm/sparse: delete old sparse_init and enable new one · 2a3cb8ba
      Pavel Tatashin authored
      Rename new_sparse_init() to sparse_init() which enables it.  Delete old
      sparse_init() and all the code that became obsolete with.
      
      [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
        Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a3cb8ba
    • Pavel Tatashin's avatar
      mm/sparse: add new sparse_init_nid() and sparse_init() · 85c77f79
      Pavel Tatashin authored
      sparse_init() requires to temporary allocate two large buffers: usemap_map
      and map_map.  Baoquan He has identified that these buffers are so large
      that Linux is not bootable on small memory machines, such as a kdump boot.
      The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
      are scaled to the maximum physical memory size.
      
      Baoquan provided a fix, which reduces these sizes of these buffers, but it
      is much better to get rid of them entirely.
      
      Add a new way to initialize sparse memory: sparse_init_nid(), which only
      operates within one memory node, and thus allocates memory either in large
      contiguous block or allocates section by section.  This eliminates the
      need for use of temporary buffers.
      
      For simplified bisecting and review temporarly call sparse_init()
      new_sparse_init(), the new interface is going to be enabled as well as old
      code removed in the next patch.
      
      Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85c77f79
    • Pavel Tatashin's avatar
      mm/sparse: move buffer init/fini to the common place · afda57bc
      Pavel Tatashin authored
      Now that both variants of sparse memory use the same buffers to populate
      memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
      common place.
      
      Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afda57bc
    • Pavel Tatashin's avatar
      mm/sparse: use the new sparse buffer functions in non-vmemmap · e131c06b
      Pavel Tatashin authored
      non-vmemmap sparse also allocated large contiguous chunk of memory, and if
      fails falls back to smaller allocations.  Use the same functions to
      allocate buffer as the vmemmap-sparse
      
      Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e131c06b
    • Pavel Tatashin's avatar
      mm/sparse: abstract sparse buffer allocations · 35fd1eb1
      Pavel Tatashin authored
      Patch series "sparse_init rewrite", v6.
      
      In sparse_init() we allocate two large buffers to temporary hold usemap
      and memmap for the whole machine.  However, we can avoid doing that if
      we changed sparse_init() to operated on per-node bases instead of doing
      it on the whole machine beforehand.
      
      As shown by Baoquan
        http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com
      
      The buffers are large enough to cause machine stop to boot on small
      memory systems.
      
      Another benefit of these changes is that they also obsolete
      CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.
      
      This patch (of 5):
      
      When struct pages are allocated for sparse-vmemmap VA layout, we first try
      to allocate one large buffer, and than if that fails allocate struct pages
      for each section as we go.
      
      The code that allocates buffer is uses global variables and is spread
      across several call sites.
      
      Cleanup the code by introducing three functions to handle the global
      buffer:
      
      sparse_buffer_init()	initialize the buffer
      sparse_buffer_fini()	free the remaining part of the buffer
      sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
      return NULL
      
      Define these functions in sparse.c instead of sparse-vmemmap.c because
      later we will use them for non-vmemmap sparse allocations as well.
      
      [akpm@linux-foundation.org: use PTR_ALIGN()]
      [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
      Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35fd1eb1
    • Cannon Matthews's avatar
      mm/hugetlb.c: don't zero 1GiB bootmem pages · 330d6e48
      Cannon Matthews authored
      When using 1GiB pages during early boot, use the new
      memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
      Zeroing out hundreds or thousands of GiB in a single core memset() call
      is very slow, and can make early boot last upwards of 20-30 minutes on
      multi TiB machines.
      
      The memory does not need to be zero'd as the hugetlb pages are always
      zero'd on page fault.
      
      Tested: Booted with ~3800 1G pages, and it booted successfully in
      roughly the same amount of time as with 0, as opposed to the 25+ minutes
      it would take before.
      
      Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.comSigned-off-by: default avatarCannon Matthews <cannonmatthews@google.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      330d6e48
    • Aaron Lu's avatar
      mm, page_alloc: double zone's batchsize · d8a759b5
      Aaron Lu authored
      To improve page allocator's performance for order-0 pages, each CPU has
      a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
      PCP will be checked first before asking pages from Buddy.  When PCP is
      used up, a batch of pages will be fetched from Buddy to improve
      performance and the size of batch can affect performance.
      
      zone's batch size gets doubled last time by commit ba56e91c("mm:
      page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
      then, CPU has envolved a lot and CPU's cache sizes also increased.
      
      Dave Hansen is concerned the current batch size doesn't fit well with
      modern hardware and suggested me to do two things: first, use a page
      allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
      out how performance changes with different batch sizes on various
      machines and then choose a new default batch size; second, see how this
      new batch size work with other workloads.
      
      In the first test, we saw performance gains on high-core-count systems
      and little to no effect on older systems with more modest core counts.
      In this phase's test data, two candidates: 63 and 127 are chosen.
      
      In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
      and more will-it-scale sub-tests are tested to see how these two
      candidates work with these workloads and decides a new default according
      to their results.
      
      Most test results are flat.  will-it-scale/page_fault2 process mode has
      10%-18% performance increase on 4-sockets Skylake and Broadwell.
      vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
      4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
      drop.  Further analysis showed that, with a larger pcp->batch and thus
      larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
      maintained in this patch), zone lock contention shifted to LRU add side
      lock contention and that caused performance drop.  This performance drop
      might be mitigated by others' work on optimizing LRU lock.
      
      Another downside of increasing pcp->batch is, when PCP is used up and need
      to fetch a batch of pages from Buddy, since batch is increased, that time
      can be longer than before.  My understanding is, this doesn't affect
      slowpath where direct reclaim and compaction dominates.  For fastpath,
      throughput is a win(according to will-it-scale/page_fault1) but worst
      latency can be larger now.
      
      Overall, I think double the batch size from 31 to 63 is relatively safe
      and provide good performance boost for high-core-count systems.
      
      The two phase's test results are listed below(all tests are done with THP
      disabled).
      
      Phase one(will-it-scale/page_fault1) test results:
      
      Skylake-EX: increased batch size has a good effect on zone->lock
      contention, though LRU contention will rise at the same time and
      limited the final performance increase.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   15345900    +0.00%       64%                 8%           72%
       53   17903847   +16.67%       32%                38%           70%
       63   17992886   +17.25%       24%                45%           69%
       73   18022825   +17.44%       10%                61%           71%
      119   18023401   +17.45%        4%                66%           70%
      127   18029012   +17.48%        3%                66%           69%
      137   18036075   +17.53%        4%                66%           70%
      165   18035964   +17.53%        2%                67%           69%
      188   18101105   +17.95%        2%                67%           69%
      223   18130951   +18.15%        2%                67%           69%
      255   18118898   +18.07%        2%                67%           69%
      267   18101559   +17.96%        2%                67%           69%
      299   18160468   +18.34%        2%                68%           70%
      320   18139845   +18.21%        2%                67%           69%
      393   18160869   +18.34%        2%                68%           70%
      424   18170999   +18.41%        2%                68%           70%
      458   18144868   +18.24%        2%                68%           70%
      467   18142366   +18.22%        2%                68%           70%
      498   18154549   +18.30%        1%                68%           69%
      511   18134525   +18.17%        1%                69%           70%
      
      Broadwell-EX: similar pattern as Skylake-EX.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   16703983    +0.00%       67%                 7%           74%
       53   18195393    +8.93%       43%                28%           71%
       63   18288885    +9.49%       38%                33%           71%
       73   18344329    +9.82%       35%                37%           72%
      119   18535529   +10.96%       24%                46%           70%
      127   18513596   +10.83%       23%                48%           71%
      137   18514327   +10.84%       23%                48%           71%
      165   18511840   +10.82%       22%                49%           71%
      188   18593478   +11.31%       17%                53%           70%
      223   18601667   +11.36%       17%                52%           69%
      255   18774825   +12.40%       12%                58%           70%
      267   18754781   +12.28%        9%                60%           69%
      299   18892265   +13.10%        7%                63%           70%
      320   18873812   +12.99%        8%                62%           70%
      393   18891174   +13.09%        6%                64%           70%
      424   18975108   +13.60%        6%                64%           70%
      458   18932364   +13.34%        8%                62%           70%
      467   18960891   +13.51%        5%                65%           70%
      498   18944526   +13.41%        5%                64%           69%
      511   18960839   +13.51%        5%                64%           69%
      
      Skylake-EP: although increased batch reduced zone->lock contention, but
      the effect is not as good as EX: zone->lock contention is still as high as
      20% with a very high batch value instead of 1% on Skylake-EX or 5% on
      Broadwell-EX.  Also, total_contention actually decreased with a higher
      batch but that doesn't translate to performance increase.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   9554867    +0.00%       66%                 3%           69%
       53   9855486    +3.15%       63%                 3%           66%
       63   9980145    +4.45%       62%                 4%           66%
       73   10092774   +5.63%       62%                 5%           67%
      119   10310061   +7.90%       45%                19%           64%
      127   10342019   +8.24%       42%                19%           61%
      137   10358182   +8.41%       42%                21%           63%
      165   10397060   +8.81%       37%                24%           61%
      188   10341808   +8.24%       34%                26%           60%
      223   10349135   +8.31%       31%                27%           58%
      255   10327189   +8.08%       28%                29%           57%
      267   10344204   +8.26%       27%                29%           56%
      299   10325043   +8.06%       25%                30%           55%
      320   10310325   +7.91%       25%                31%           56%
      393   10293274   +7.73%       21%                31%           52%
      424   10311099   +7.91%       21%                32%           53%
      458   10321375   +8.02%       21%                32%           53%
      467   10303881   +7.84%       21%                32%           53%
      498   10332462   +8.14%       20%                33%           53%
      511   10325016   +8.06%       20%                32%           52%
      
      Broadwell-EP: zone->lock and lru lock had an agreement to make sure
      performance doesn't increase and they successfully managed to keep total
      contention at 70%.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   10121178   +0.00%       19%                50%           69%
       53   10142366   +0.21%        6%                63%           69%
       63   10117984   -0.03%       11%                58%           69%
       73   10123330   +0.02%        7%                63%           70%
      119   10108791   -0.12%        2%                67%           69%
      127   10166074   +0.44%        3%                66%           69%
      137   10141574   +0.20%        3%                66%           69%
      165   10154499   +0.33%        2%                68%           70%
      188   10124921   +0.04%        2%                67%           69%
      223   10137399   +0.16%        2%                67%           69%
      255   10143289   +0.22%        0%                68%           68%
      267   10123535   +0.02%        1%                68%           69%
      299   10140952   +0.20%        0%                68%           68%
      320   10163170   +0.41%        0%                68%           68%
      393   10000633   -1.19%        0%                69%           69%
      424   10087998   -0.33%        0%                69%           69%
      458   10187116   +0.65%        0%                69%           69%
      467   10146790   +0.25%        0%                69%           69%
      498   10197958   +0.76%        0%                69%           69%
      511   10152326   +0.31%        0%                69%           69%
      
      Haswell-EP: similar to Broadwell-EP.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   10442205   +0.00%       14%                48%           62%
       53   10442255   +0.00%        5%                57%           62%
       63   10452059   +0.09%        6%                57%           63%
       73   10482349   +0.38%        5%                59%           64%
      119   10454644   +0.12%        3%                60%           63%
      127   10431514   -0.10%        3%                59%           62%
      137   10423785   -0.18%        3%                60%           63%
      165   10481216   +0.37%        2%                61%           63%
      188   10448755   +0.06%        2%                61%           63%
      223   10467144   +0.24%        2%                61%           63%
      255   10480215   +0.36%        2%                61%           63%
      267   10484279   +0.40%        2%                61%           63%
      299   10466450   +0.23%        2%                61%           63%
      320   10452578   +0.10%        2%                61%           63%
      393   10499678   +0.55%        1%                62%           63%
      424   10481454   +0.38%        1%                62%           63%
      458   10473562   +0.30%        1%                62%           63%
      467   10484269   +0.40%        0%                62%           62%
      498   10505599   +0.61%        0%                62%           62%
      511   10483395   +0.39%        0%                62%           62%
      
      Westmere-EP: contention is pretty small so not interesting.  Note too high
      a batch value could hurt performance.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   4831523   +0.00%        2%                 3%            5%
       53   4834086   +0.05%        2%                 4%            6%
       63   4834262   +0.06%        2%                 3%            5%
       73   48328518   +0.03%        2%                 4%            6%
      119   4830534   -0.02%        1%                 3%            4%
      127   4827461   -0.08%        1%                 4%            5%
      137   4827459   -0.08%        1%                 3%            4%
      165   4820534   -0.23%        0%                 4%            4%
      188   4817947   -0.28%        0%                 3%            3%
      223   48096710   -0.45%        0%                 3%            3%
      255   4802463   -0.60%        0%                 4%            4%
      267   4801634   -0.62%        0%                 3%            3%
      299   4798047   -0.69%        0%                 3%            3%
      320   4793084   -0.80%        0%                 3%            3%
      393   4785877   -0.94%        0%                 3%            3%
      424   4782911   -1.01%        0%                 3%            3%
      458   4779346   -1.08%        0%                 3%            3%
      467   4780306   -1.06%        0%                 3%            3%
      498   4780589   -1.05%        0%                 3%            3%
      511   4773724   -1.20%        0%                 3%            3%
      
      Skylake-Desktop: similar to Westmere-EP, nothing interesting.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3906608   +0.00%        2%                 3%            5%
       53   3940164   +0.86%        2%                 3%            5%
       63   3937289   +0.79%        2%                 3%            5%
       73   3940201   +0.86%        2%                 3%            5%
      119   3933240   +0.68%        2%                 3%            5%
      127   3930514   +0.61%        2%                 4%            6%
      137   3938639   +0.82%        0%                 3%            3%
      165   3908755   +0.05%        0%                 3%            3%
      188   3905621   -0.03%        0%                 3%            3%
      223   3903015   -0.09%        0%                 4%            4%
      255   3889480   -0.44%        0%                 3%            3%
      267   3891669   -0.38%        0%                 4%            4%
      299   3898728   -0.20%        0%                 4%            4%
      320   3894547   -0.31%        0%                 4%            4%
      393   3875137   -0.81%        0%                 4%            4%
      424   3874521   -0.82%        0%                 3%            3%
      458   3880432   -0.67%        0%                 4%            4%
      467   3888715   -0.46%        0%                 3%            3%
      498   3888633   -0.46%        0%                 4%            4%
      511   3875305   -0.80%        0%                 5%            5%
      
      Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
      contention is higher than other desktops.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3511158   +0.00%        2%                 5%            7%
       53   3555445   +1.26%        2%                 6%            8%
       63   3561082   +1.42%        2%                 6%            8%
       73   3547218   +1.03%        2%                 6%            8%
      119   3571319   +1.71%        1%                 7%            8%
      127   3549375   +1.09%        0%                 6%            6%
      137   3560233   +1.40%        0%                 6%            6%
      165   3555176   +1.25%        2%                 6%            8%
      188   3551501   +1.15%        0%                 8%            8%
      223   3531462   +0.58%        0%                 7%            7%
      255   3570400   +1.69%        0%                 7%            7%
      267   3532235   +0.60%        1%                 8%            9%
      299   3562326   +1.46%        0%                 6%            6%
      320   3553569   +1.21%        0%                 8%            8%
      393   3539519   +0.81%        0%                 7%            7%
      424   3549271   +1.09%        0%                 8%            8%
      458   3528885   +0.50%        0%                 8%            8%
      467   3526554   +0.44%        0%                 7%            7%
      498   3525302   +0.40%        0%                 9%            9%
      511   3527556   +0.47%        0%                 8%            8%
      
      Sandybridge-Desktop: the 0% contention isn't accurate but caused by
      dropped fractional part. Since multiple contention path's contentions
      are all under 1% here, with some arithmetic operations like add, the
      final deviation could be as large as 3%.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   1744495   +0.00%        0%                 0%            0%
       53   1755341   +0.62%        0%                 0%            0%
       63   1758469   +0.80%        0%                 0%            0%
       73   1759626   +0.87%        0%                 0%            0%
      119   1770417   +1.49%        0%                 0%            0%
      127   1768252   +1.36%        0%                 0%            0%
      137   1767848   +1.34%        0%                 0%            0%
      165   1765088   +1.18%        0%                 0%            0%
      188   1766918   +1.29%        0%                 0%            0%
      223   1767866   +1.34%        0%                 0%            0%
      255   1768074   +1.35%        0%                 0%            0%
      267   1763187   +1.07%        0%                 0%            0%
      299   1765620   +1.21%        0%                 0%            0%
      320   1767603   +1.32%        0%                 0%            0%
      393   1764612   +1.15%        0%                 0%            0%
      424   1758476   +0.80%        0%                 0%            0%
      458   1758593   +0.81%        0%                 0%            0%
      467   1757915   +0.77%        0%                 0%            0%
      498   1753363   +0.51%        0%                 0%            0%
      511   1755548   +0.63%        0%                 0%            0%
      
      Phase two test results:
      Note: all percent change is against base(batch=31).
      
      ebizzy.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
      lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
      lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
      lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
      lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
      lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
      lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
      lkp-sb02          98937          98937    +0.0%       99030 +0.1%
      
      kbuild.buildtime (less is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
      lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
      lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
      lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
      lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
      lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
      lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
      
      netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
      lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
      lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
      lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
      lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
      lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
      lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
      lkp-sb02          29990         30137  +0.5%         30103 +0.4%
      
      oltp.transactions (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
      lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
      lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
      lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
      lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
      lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
      lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
      
      pigz.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
      lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
      lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
      lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
      lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
      lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
      lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
      lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
      
      will-it-scale.malloc1.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
      lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
      lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
      lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
      lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
      lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
      lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
      lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
      lkp-sb02          589286       589284 -0.0%         588101 -0.2%
      
      will-it-scale.malloc1.threads (higher is better)
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
      lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
      lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
      lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
      lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
      lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
      lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
      lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
      lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
      
      will-it-scale.malloc2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
      lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
      lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
      lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
      lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
      lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
      lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
      lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
      lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
      
      will-it-scale.malloc2.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
      lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
      lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
      lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
      lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
      lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
      lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
      lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
      lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
      
      will-it-scale.page_fault2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
      lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
      lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
      lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
      lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
      lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
      lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
      lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
      lkp-sb02          842616        837844  -0.6%         833921  -1.0%
      
      will-it-scale.page_fault2.threads
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
      lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
      lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
      lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
      lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
      lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
      lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
      lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
      lkp-sb02          750626        748615    -0.3%      746621    -0.5%
      
      will-it-scale.page_fault3.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
      lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
      lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
      lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
      lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
      lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
      lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
      lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
      lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
      
      will-it-scale.page_fault3.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
      lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
      lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
      lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
      lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
      lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
      lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
      lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
      lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
      
      will-it-scale.read1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
      lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
      lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
      lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
      lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
      lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
      lkp-skl-d01     10969487      10972650     +0.0%    no data
      lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
      lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
      
      will-it-scale.read1.threads (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
      lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
      lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
      lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
      lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
      lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
      lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
      lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
      lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
      
      will-it-scale.write1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
      lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
      lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
      lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
      lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
      lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
      lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
      lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
      lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
      
      will-it-scale.write1.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
      lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
      lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
      lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
      lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
      lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
      lkp-skl-d01       8346174       8235695     -1.3%     no data
      lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
      lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
      
      vm-scalability.anon-r-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
      lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
      lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
      lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
      lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
      lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
      lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
      lkp-sb02          570572        567854     -0.5%     568449     -0.4%
      
      vm-scalability.anon-r-rand-mt.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
      lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
      lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
      lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
      lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
      lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
      lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
      lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
      lkp-sb02          567137        567346     +0.0%     566144     -0.2%
      
      vm-scalability.lru-file-mmap-read.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
      lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
      lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
      lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
      lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
      lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
      lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
      lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
      lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
      
      vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
      lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
      lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
      lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
      lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
      lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
      lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
      lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
      lkp-sb02          258939        258787  -0.1%        258199  -0.3%
      
      Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8a759b5
    • Michal Hocko's avatar
      mm/oom_kill.c: document oom_lock · a195d3f5
      Michal Hocko authored
      Add comments describing oom_lock's scope.
      Requested-by: default avatarDavid Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a195d3f5
    • Mike Kravetz's avatar
      mm/hugetlb: remove gigantic page support for HIGHMEM · 40d18ebf
      Mike Kravetz authored
      This reverts ee8f248d ("hugetlb: add phys addr to struct
      huge_bootmem_page").
      
      At one time powerpc used this field and supporting code.  However that
      was removed with commit 79cc38de ("powerpc/mm/hugetlb: Add support
      for reserving gigantic huge pages via kernel command line").
      
      There are no users of this field and supporting code, so remove it.
      
      Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Becky Bruce <beckyb@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40d18ebf
    • Michal Hocko's avatar
      mm, oom: remove sleep from under oom_lock · 9bfe5ded
      Michal Hocko authored
      Tetsuo has pointed out that since 27ae357f ("mm, oom: fix concurrent
      munlock and oom reaper unmap, v3") we have a strong synchronization
      between the oom_killer and victim's exiting because both have to take
      the oom_lock.  Therefore the original heuristic to sleep for a short
      time in out_of_memory doesn't serve the original purpose.
      
      Moreover Tetsuo has noticed that the short sleep can be more harmful
      than actually useful.  Hammering the system with many processes can lead
      to a starvation when the task holding the oom_lock can block for a long
      time (minutes) and block any further progress because the oom_reaper
      depends on the oom_lock as well.
      
      Drop the short sleep from out_of_memory when we hold the lock.  Keep the
      sleep when the trylock fails to throttle the concurrent OOM paths a bit.
      This should be solved in a more reasonable way (e.g.  sleep proportional
      to the time spent in the active reclaiming etc.) but this is much more
      complex thing to achieve.  This is a quick fixup to remove a stale code.
      
      Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bfe5ded
    • Marek Szyprowski's avatar
      kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous() · d834c5ab
      Marek Szyprowski authored
      The CMA memory allocator doesn't support standard gfp flags for memory
      allocation, so there is no point having it as a parameter for
      dma_alloc_from_contiguous() function.  Replace it by a boolean no_warn
      argument, which covers all the underlaying cma_alloc() function
      supports.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.comSigned-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: default avatarMichał Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d834c5ab
    • Marek Szyprowski's avatar
      mm/cma: remove unsupported gfp_mask parameter from cma_alloc() · 65182029
      Marek Szyprowski authored
      cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
      convert gfp_mask parameter to boolean no_warn parameter.
      
      This will help to avoid giving false feeling that this function supports
      standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
      what has already been an issue: see commit dd65a941 ("arm64:
      dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
      
      Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.comSigned-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichał Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65182029
    • Rik van Riel's avatar
      Revert "mm: always flush VMA ranges affected by zap_page_range" · 50c150f2
      Rik van Riel authored
      There was a bug in Linux that could cause madvise (and mprotect?) system
      calls to return to userspace without the TLB having been flushed for all
      the pages involved.
      
      This could happen when multiple threads of a process made simultaneous
      madvise and/or mprotect calls.
      
      This was noticed in the summer of 2017, at which time two solutions
      were created:
      
        56236a59 ("mm: refactor TLB gathering API")
        99baac21 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
      and
        4647706e ("mm: always flush VMA ranges affected by zap_page_range")
      
      We need only one of these solutions, and the former appears to be a
      little more efficient than the latter, so revert that one.
      
      This reverts 4647706e ("mm: always flush VMA ranges affected by
      zap_page_range")
      
      Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.comSigned-off-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50c150f2
    • Baoquan He's avatar
      mm/sparse: optimize memmap allocation during sparse_init() · c98aff64
      Baoquan He authored
      In sparse_init(), two temporary pointer arrays, usemap_map and map_map
      are allocated with the size of NR_MEM_SECTIONS.  They are used to store
      each memory section's usemap and mem map if marked as present.  With the
      help of these two arrays, continuous memory chunk is allocated for
      usemap and memmap for memory sections on one node.  This avoids too many
      memory fragmentations.  Like below diagram, '1' indicates the present
      memory section, '0' means absent one.  The number 'n' could be much
      smaller than NR_MEM_SECTIONS on most of systems.
      
        |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
        -------------------------------------------------------
         0 1 2 3         4 5         i   i+1     n-1   n
      
      If we fail to populate the page tables to map one section's memmap, its
      ->section_mem_map will be cleared finally to indicate that it's not
      present.  After use, these two arrays will be released at the end of
      sparse_init().
      
      In 4-level paging mode, each array costs 4M which can be ignorable.
      While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
      kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
      they are temporarily allocated, still not acceptable.
      
      In fact, there's no need to allocate them with the size of
      NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
      to the last, the number of present memory sections are kept the same
      during sparse_init() until we finally clear out the memory section's
      ->section_mem_map if its usemap or memmap is not correctly handled.
      Thus in the middle whenever for_each_present_section_nr() loop is taken,
      the i-th present memory section is always the same one.
      
      Here only allocate usemap_map and map_map with the size of
      'nr_present_sections'.  For the i-th present memory section, install its
      usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
      Then in the last for_each_present_section_nr() loop which clears the
      failed memory section's ->section_mem_map, fetch usemap and memmap from
      usemap_map[] and map_map[] array and set them into mem_section[]
      accordingly.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oscar Salvador <osalvador@techadventures.net>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c98aff64
    • Baoquan He's avatar
      mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap · 9258631b
      Baoquan He authored
      It's used to pass the size of map data unit into
      alloc_usemap_and_memmap, and is preparation for next patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9258631b
    • Baoquan He's avatar
      mm/sparsemem.c: defer the ms->section_mem_map clearing · 07a34a8c
      Baoquan He authored
      In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
      will allocate one continuous memory chunk for mem maps on one node and
      populate the relevant page tables to map memory section one by one.  If
      fail to populate for a certain mem section, print warning and its
      ->section_mem_map will be cleared to cancel the marking of being
      present.  Like this, the number of mem sections marked as present could
      become less during sparse_init() execution.
      
      Here just defer the ms->section_mem_map clearing if failed to populate
      its page tables until the last for_each_present_section_nr() loop.  This
      is in preparation for later optimizing the mem map allocation.
      
      [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
      Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07a34a8c
    • Baoquan He's avatar
      mm/sparse.c: add a static variable nr_present_sections · f2fc10e0
      Baoquan He authored
      Patch series "mm/sparse: Optimize memmap allocation during
      sparse_init()", v6.
      
      In sparse_init(), two temporary pointer arrays, usemap_map and map_map
      are allocated with the size of NR_MEM_SECTIONS.  They are used to store
      each memory section's usemap and mem map if marked as present.  In
      5-level paging mode, this will cost 512M memory though they will be
      released at the end of sparse_init().  System with few memory, like
      kdump kernel which usually only has about 256M, will fail to boot
      because of allocation failure if CONFIG_X86_5LEVEL=y.
      
      In this patchset, optimize the memmap allocation code to only use
      usemap_map and map_map with the size of nr_present_sections.  This makes
      kdump kernel boot up with normal crashkernel='' setting when
      CONFIG_X86_5LEVEL=y.
      
      This patch (of 5):
      
      nr_present_sections is used to record how many memory sections are
      marked as present during system boot up, and will be used in the later
      patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2fc10e0
    • Kirill Tkhai's avatar
      mm: use special value SHRINKER_REGISTERING instead of list_empty() check · 7e010df5
      Kirill Tkhai authored
      The patch introduces a special value SHRINKER_REGISTERING to use instead
      of list_empty() to differ a registering shrinker from unregistered
      shrinker.  Why we need that at all?
      
      Shrinker registration is split in two parts.  The first one is
      prealloc_shrinker(), which allocates shrinker memory and reserves ID in
      shrinker_idr.  This function can fail.  The second is
      register_shrinker_prepared(), and it finalizes the registration.  This
      function actually makes shrinker available to be used from
      shrink_slab(), and it can't fail.
      
      One shrinker may be based on more then one LRU lists.  So, we never
      clear the bit in memcg shrinker maps, when (one of) corresponding LRU
      list becomes empty, since other LRU lists may be not empty.  See
      superblock shrinker for example: it is based on two LRU lists:
      s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
      when there are no inodes in s_inode_lru, as s_dentry_lru may contain
      dentries.
      
      Instead of that, we use special algorithm to detect shrinkers having no
      elements at all its LRU lists, and this is made in shrink_slab_memcg().
      See the comment in this function for the details.
      
      Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
      meet unregistered shrinker (bit is set, while there is no a shrinker in
      IDR).  Otherwise, we would have done that at the moment of shrinker
      unregistration for all memcgs (and this looks worse, since iteration
      over all memcg may take much time).  Also this would have imposed
      restrictions on shrinker unregistration order for its users: they would
      have had to guarantee, there are no new elements after
      unregister_shrinker() (otherwise, a new added element would have set a
      bit).
      
      So, if we meet a set bit in map and no shrinker in IDR when we're
      iterating over the map in shrink_slab_memcg(), this means the
      corresponding shrinker is unregistered, and we must clear the bit.
      
      Another case is shrinker registration.  We want two things there:
      
      1) do_shrink_slab() can be called only for completely registered
         shrinkers;
      
      2) shrinker internal lists may be populated in any order with
         register_shrinker_prepared() (let's talk on the example with sb).  Both
         of:
      
        a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
          memcg_set_shrinker_bit();                               [cpu0]
          ...
          register_shrinker_prepared();                           [cpu1]
      
        and
      
        b)register_shrinker_prepared();                           [cpu0]
          ...
          list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
          memcg_set_shrinker_bit();                               [cpu1]
      
         are legitimate.  We don't want to impose restriction here and to
         force people to use only (b) variant.  We don't want to force people to
         care, there is no elements in LRU lists before the shrinker is
         completely registered.  Internal users of LRU lists and shrinker code
         are two different subsystems, and they have to be closed in themselves
         each other.
      
      In (a) case we have the bit set before shrinker is completely
      registered.  We don't want do_shrink_slab() is called at this moment, so
      we have to detect such the registering shrinkers.
      
      Before this patch list_empty() (shrinker is not linked to the list)
      check was used for that.  So, in (a) there could be a bit set, but we
      don't call do_shrink_slab() unless shrinker is linked to the list.  It's
      just an indicator, I just overloaded linking to the list.
      
      This was not the best solution, since it's better not to touch the
      shrinker memory from shrink_slab_memcg() before it's completely
      registered (this also will be useful in the future to make shrink_slab()
      completely lockless).
      
      So, this patch introduces better way to detect registering shrinker,
      which allows not to dereference shrinker memory.  It's just a ~0UL
      value, which we insert into the IDR during ID allocation.  After
      shrinker is ready to be used, we insert actual shrinker pointer in the
      IDR, and it becomes available to shrink_slab_memcg().
      
      We can't use NULL instead of this new value for this purpose as:
      shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
      and we don't want the function sees NULL and clears the bit, otherwise
      (a) won't work.
      
      This is the only thing the patch makes: the better way to detect
      registering shrinker.  Nothing else this patch makes.
      
      Also this gives a better assembler, but it's minor side of the patch:
      
      Before:
        callq  <idr_find>
        mov    %rax,%r15
        test   %rax,%rax
        je     <shrink_slab_memcg+0x1d5>
        mov    0x20(%rax),%rax
        lea    0x20(%r15),%rdx
        cmp    %rax,%rdx
        je     <shrink_slab_memcg+0xbd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  <do_shrink_slab>
      
      After:
        callq  <idr_find>
        mov    %rax,%r15
        lea    -0x1(%rax),%rax
        cmp    $0xfffffffffffffffd,%rax
        ja     <shrink_slab_memcg+0x1cd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  ffffffff810cefd0 <do_shrink_slab>
      
      [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
        Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
      Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e010df5
    • Kirill Tkhai's avatar
      mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab() · ac7fb3ad
      Kirill Tkhai authored
      In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
      numa-aware.  This is not a real problem, since currently all memcg-aware
      shrinkers are numa-aware too (we have two: super_block shrinker and
      workingset shrinker), but something may change in the future.
      
      Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac7fb3ad