1. 30 Jun, 2021 2 commits
    • Linus Torvalds's avatar
      Merge tag 'fs.mount_setattr.nosymfollow.v5.14' of... · 30d1a556
      Linus Torvalds authored
      Merge tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
      
      Pull mount_setattr updates from Christian Brauner:
       "A few releases ago the old mount API gained support for a mount
        options which prevents following symlinks on a given mount. This adds
        support for it in the new mount api through the MOUNT_ATTR_NOSYMFOLLOW
        flag via mount_setattr() and fsmount(). With mount_setattr() that flag
        can even be applied recursively.
      
        There's an additional ack from Ross Zwisler who originally authored
        the nosymfollow patch. As I've already had the patches in my for-next
        I didn't add his ack explicitly"
      
      * tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        tests: test MOUNT_ATTR_NOSYMFOLLOW with mount_setattr()
        mount: Support "nosymfollow" in new mount api
      30d1a556
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 65090f30
      Linus Torvalds authored
      Merge misc updates from Andrew Morton:
       "191 patches.
      
        Subsystems affected by this patch series: kthread, ia64, scripts,
        ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
        slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
        mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
        pagealloc, and memory-failure)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
        mm,hwpoison: make get_hwpoison_page() call get_any_page()
        mm,hwpoison: send SIGBUS with error virutal address
        mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
        mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
        mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
        mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
        docs: remove description of DISCONTIGMEM
        arch, mm: remove stale mentions of DISCONIGMEM
        mm: remove CONFIG_DISCONTIGMEM
        m68k: remove support for DISCONTIGMEM
        arc: remove support for DISCONTIGMEM
        arc: update comment about HIGHMEM implementation
        alpha: remove DISCONTIGMEM and NUMA
        mm/page_alloc: move free_the_page
        mm/page_alloc: fix counting of managed_pages
        mm/page_alloc: improve memmap_pages dbg msg
        mm: drop SECTION_SHIFT in code comments
        mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
        mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
        mm/page_alloc: scale the number of pages that are batch freed
        ...
      65090f30
  2. 29 Jun, 2021 38 commits
    • Linus Torvalds's avatar
      Merge tag 'devprop-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 349a2d52
      Linus Torvalds authored
      Pull device properties framework updates from Rafael Wysocki:
       "These unify device properties access in some pieces of code and make
        related changes.
      
        Specifics:
      
         - Handle device properties with software node API in the ACPI IORT
           table parsing code (Heikki Krogerus).
      
         - Unify of_node access in the common device properties code, constify
           the acpi_dma_supported() argument pointer and fix up CONFIG_ACPI=n
           stubs of some functions related to device properties (Andy
           Shevchenko)"
      
      * tag 'devprop-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        device property: Unify access to of_node
        ACPI: scan: Constify acpi_dma_supported() helper function
        ACPI: property: Constify stubs for CONFIG_ACPI=n case
        ACPI: IORT: Handle device properties with software node API
        device property: Retrieve fwnode from of_node via accessor
      349a2d52
    • Linus Torvalds's avatar
      Merge tag 'pnp-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 72ad9f9d
      Linus Torvalds authored
      Pull PNP updates from Rafael Wysocki:
       "These get rid of unnecessary local variables and function, reduce code
        duplication and clean up message printing.
      
        Specifics:
      
         - Remove unnecessary local variables from isapnp_proc_attach_device()
           (Anupama K Patil).
      
         - Make the callers of pnp_alloc() use kzalloc() directly and drop the
           former (Heiner Kallweit).
      
         - Make two pieces of code use dev_dbg() instead of dev_printk() with
           the KERN_DEBUG message level (Heiner Kallweit).
      
         - Use DEVICE_ATTR_RO() instead of full DEVICE_ATTR() in some places
           in card.c (Zhen Lei).
      
         - Use list_for_each_entry() instead of list_for_each() in
           insert_device() (Zou Wei)"
      
      * tag 'pnp-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PNP: pnpbios: Use list_for_each_entry() instead of list_for_each()
        PNP: use DEVICE_ATTR_RO macro
        PNP: Switch over to dev_dbg()
        PNP: Remove pnp_alloc()
        drivers: pnp: isapnp: proc.c: Remove unnecessary local variables
      72ad9f9d
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 5e692824
      Linus Torvalds authored
      Pull ACPI updates from Rafael Wysocki:
       "These update the ACPICA code in the kernel to the 20210604 upstream
        revision, add preliminary support for the Platform Runtime Mechanism
        (PRM), address issues related to the handling of device dependencies
        in the ACPI device eunmeration code, improve the tracking of ACPI
        power resource states, improve the ACPI support for suspend-to-idle on
        AMD systems, continue the unification of message printing in the ACPI
        code, address assorted issues and clean up the code in a number of
        places.
      
        Specifics:
      
         - Update ACPICA code in the kernel to upstrea revision 20210604
           including the following changes:
      
            - Add defines for the CXL Host Bridge Structureand and add the
              CFMWS structure definition to CEDT (Alison Schofield).
            - iASL: Finish support for the IVRS ACPI table (Bob Moore).
            - iASL: Add support for the SVKL table (Bob Moore).
            - iASL: Add full support for RGRT ACPI table (Bob Moore).
            - iASL: Add support for the BDAT ACPI table (Bob Moore).
            - iASL: add disassembler support for PRMT (Erik Kaneda).
            - Fix memory leak caused by _CID repair function (Erik Kaneda).
            - Add support for PlatformRtMechanism OpRegion (Erik Kaneda).
            - Add PRMT module header to facilitate parsing (Erik Kaneda).
            - Add _PLD panel positions (Fabian Wüthrich).
            - MADT: add Multiprocessor Wakeup Mailbox Structure and the SVKL
              table headers (Kuppuswamy Sathyanarayanan).
            - Use ACPI_FALLTHROUGH (Wei Ming Chen).
      
         - Add preliminary support for the Platform Runtime Mechanism (PRM) to
           allow the AML interpreter to call PRM functions (Erik Kaneda).
      
         - Address some issues related to the handling of device dependencies
           reported by _DEP in the ACPI device enumeration code and clean up
           some related pieces of it (Rafael Wysocki).
      
         - Improve the tracking of states of ACPI power resources (Rafael
           Wysocki).
      
         - Improve ACPI support for suspend-to-idle on AMD systems (Alex
           Deucher, Mario Limonciello, Pratik Vishwakarma).
      
         - Continue the unification and cleanup of message printing in the
           ACPI code (Hanjun Guo, Heiner Kallweit).
      
         - Fix possible buffer overrun issue with the description_show() sysfs
           attribute method (Krzysztof Wilczyński).
      
         - Improve the acpi_mask_gpe kernel command line parameter handling
           and clean up the core ACPI code related to sysfs (Andy Shevchenko,
           Baokun Li, Clayton Casciato).
      
         - Postpone bringing devices in the general ACPI PM domain to D0
           during resume from system-wide suspend until they are really needed
           (Dmitry Torokhov).
      
         - Make the ACPI processor driver fix up C-state latency if not
           ordered (Mario Limonciello).
      
         - Add support for identifying devices depening on the given one that
           are not its direct descendants with the help of _DEP (Daniel
           Scally).
      
         - Extend the checks related to ACPI IRQ overrides on x86 in order to
           avoid false-positives (Hui Wang).
      
         - Add battery DPTF participant for Intel SoCs (Sumeet Pawnikar).
      
         - Rearrange the ACPI fan driver and device power management code to
           use a common list of device IDs (Rafael Wysocki).
      
         - Fix clang CFI violation in the ACPI BGRT table parsing code and
           clean it up (Nathan Chancellor).
      
         - Add GPE-related quirks for some laptops to the EC driver (Chris
           Chiu, Zhang Rui).
      
         - Make the ACPI PPTT table parsing code populate the cache-id value
           if present in the firmware (James Morse).
      
         - Remove redundant clearing of context->ret.pointer from
           acpi_run_osc() (Hans de Goede).
      
         - Add missing acpi_put_table() in acpi_init_fpdt() (Jing Xiangfeng).
      
         - Make ACPI APEI handle ARM Processor Error CPER records like Memory
           Error ones to avoid user space task lockups (Xiaofei Tan).
      
         - Stop warning about disabled ACPI in APEI (Jon Hunter).
      
         - Fix fall-through warning for Clang in the SBSHC driver (Gustavo A.
           R. Silva).
      
         - Add custom DSDT file as Makefile prerequisite (Richard Fitzgerald).
      
         - Initialize local variable to avoid garbage being returned (Colin
           Ian King).
      
         - Simplify assorted pieces of code, address assorted coding style and
           documentation issues and comment typos (Baokun Li, Christophe
           JAILLET, Clayton Casciato, Liu Shixin, Shaokun Zhang, Wei Yongjun,
           Yang Li, Zhen Lei)"
      
      * tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (97 commits)
        ACPI: PM: postpone bringing devices to D0 unless we need them
        ACPI: tables: Add custom DSDT file as makefile prerequisite
        ACPI: bgrt: Use sysfs_emit
        ACPI: bgrt: Fix CFI violation
        ACPI: EC: trust DSDT GPE for certain HP laptop
        ACPI: scan: Simplify acpi_table_events_fn()
        ACPI: PM: Adjust behavior for field problems on AMD systems
        ACPI: PM: s2idle: Add support for new Microsoft UUID
        ACPI: PM: s2idle: Add support for multiple func mask
        ACPI: PM: s2idle: Refactor common code
        ACPI: PM: s2idle: Use correct revision id
        ACPI: sysfs: Remove tailing return statement in void function
        ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros
        ACPI: sysfs: Sort headers alphabetically
        ACPI: sysfs: Refactor param_get_trace_state() to drop dead code
        ACPI: sysfs: Unify pattern of memory allocations
        ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe
        ACPI: sysfs: Make sparse happy about address space in use
        ACPI: scan: Fix race related to dropping dependencies
        ACPI: scan: Reorganize acpi_device_add()
        ...
      5e692824
    • Linus Torvalds's avatar
      Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3563f55c
      Linus Torvalds authored
      Pull power management updates from Rafael Wysocki:
       "These add hybrid processors support to the intel_pstate driver and
        make it work with more processor models when HWP is disabled, make the
        intel_idle driver use special C6 idle state paremeters when package
        C-states are disabled, add cooling support to the tegra30 devfreq
        driver, rework the TEO (timer events oriented) cpuidle governor,
        extend the OPP (operating performance points) framework to use the
        required-opps DT property in more cases, fix some issues and clean up
        a number of assorted pieces of code.
      
        Specifics:
      
         - Make intel_pstate support hybrid processors using abstract
           performance units in the HWP interface (Rafael Wysocki).
      
         - Add Icelake servers and Cometlake support in no-HWP mode to
           intel_pstate (Giovanni Gherdovich).
      
         - Make cpufreq_online() error path be consistent with the CPU device
           removal path in cpufreq (Rafael Wysocki).
      
         - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
           Randy Dunlap, Shaokun Zhang).
      
         - Make intel_idle use special idle state parameters for C6 when
           package C-states are disabled (Chen Yu).
      
         - Rework the TEO (timer events oriented) cpuidle governor to address
           some theoretical shortcomings in it (Rafael Wysocki).
      
         - Drop unneeded semicolon from the TEO governor (Wan Jiabing).
      
         - Modify the runtime PM framework to accept unassigned suspend and
           resume callback pointers (Ulf Hansson).
      
         - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).
      
         - Improve device performance states support in the generic power
           domains (genpd) framework (Ulf Hansson).
      
         - Fix some documentation issues in genpd (Yang Yingliang).
      
         - Make the operating performance points (OPP) framework use the
           required-opps DT property in use cases that are not related to
           genpd (Hsin-Yi Wang).
      
         - Make lazy_link_required_opp_table() use list_del_init instead of
           list_del/INIT_LIST_HEAD (Yang Yingliang).
      
         - Simplify wake IRQs handling in the core system-wide sleep support
           code and clean up some coding style inconsistencies in it (Tian
           Tao, Zhen Lei).
      
         - Add cooling support to the tegra30 devfreq driver and improve its
           DT bindings (Dmitry Osipenko).
      
         - Fix some assorted issues in the devfreq core and drivers (Chanwoo
           Choi, Dong Aisheng, YueHaibing)"
      
      * tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
        PM / devfreq: passive: Fix get_target_freq when not using required-opp
        cpufreq: Make cpufreq_online() call driver->offline() on errors
        opp: Allow required-opps to be used for non genpd use cases
        cpuidle: teo: remove unneeded semicolon in teo_select()
        dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
        dt-bindings: devfreq: tegra30-actmon: Convert to schema
        PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
        PM: runtime: Clarify documentation when callbacks are unassigned
        PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
        PM: runtime: Improve path in rpm_idle() when no callback
        PM: hibernate: remove leading spaces before tabs
        PM: sleep: remove trailing spaces and tabs
        PM: domains: Drop/restore performance state votes for devices at runtime PM
        PM: domains: Return early if perf state is already set for the device
        PM: domains: Split code in dev_pm_genpd_set_performance_state()
        cpuidle: teo: Use kerneldoc documentation in admin-guide
        cpuidle: teo: Rework most recent idle duration values treatment
        cpuidle: teo: Change the main idle state selection logic
        cpuidle: teo: Cosmetic modification of teo_select()
        cpuidle: teo: Cosmetic modifications of teo_update()
        ...
      3563f55c
    • Linus Torvalds's avatar
      Merge tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1dfb0f47
      Linus Torvalds authored
      Pull x86 entry code related updates from Thomas Gleixner:
      
       - Consolidate the macros for .byte ... opcode sequences
      
       - Deduplicate register offset defines in include files
      
       - Simplify the ia32,x32 compat handling of the related syscall tables
         to get rid of #ifdeffery.
      
       - Clear all EFLAGS which are not required for syscall handling
      
       - Consolidate the syscall tables and switch the generation over to the
         generic shell script and remove the CFLAGS tweaks which are not
         longer required.
      
       - Use 'int' type for system call numbers to match the generic code.
      
       - Add more selftests for syscalls
      
      * tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/syscalls: Don't adjust CFLAGS for syscall tables
        x86/syscalls: Remove -Wno-override-init for syscall tables
        x86/uml/syscalls: Remove array index from syscall initializers
        x86/syscalls: Clear 'offset' and 'prefix' in case they are set in env
        x86/entry: Use int everywhere for system call numbers
        x86/entry: Treat out of range and gap system calls the same
        x86/entry/64: Sign-extend system calls on entry to int
        selftests/x86/syscall: Add tests under ptrace to syscall_numbering_64
        selftests/x86/syscall: Simplify message reporting in syscall_numbering
        selftests/x86/syscall: Update and extend syscall_numbering_64
        x86/syscalls: Switch to generic syscallhdr.sh
        x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max
        x86/unistd: Define X32_NR_syscalls only for 64-bit kernel
        x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall
        x86/syscalls: Switch to generic syscalltbl.sh
        x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*
      1dfb0f47
    • Linus Torvalds's avatar
      Merge tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a22c3f61
      Linus Torvalds authored
      Pull x86 interrupt related updates from Thomas Gleixner:
      
       - Consolidate the VECTOR defines and the usage sites.
      
       - Cleanup GDT/IDT related code and replace open coded ASM with proper
         native helper functions.
      
      * tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c
        x86: Add native_[ig]dt_invalidate()
        x86/idt: Remove address argument from idt_invalidate()
        x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
        x86/irq: Remove unused vectors defines
      a22c3f61
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a941a034
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Time and clocksource/clockevent related updates:
      
        Core changes:
      
         - Infrastructure to support per CPU "broadcast" devices for per CPU
           clockevent devices which stop in deep idle states. This allows us
           to utilize the more efficient architected timer on certain ARM SoCs
           for normal operation instead of permanentely using the slow to
           access SoC specific clockevent device.
      
         - Print the name of the broadcast/wakeup device in /proc/timer_list
      
         - Make the clocksource watchdog more robust against delays between
           reading the current active clocksource and the watchdog
           clocksource. Such delays can be caused by NMIs, SMIs and vCPU
           preemption.
      
           Handle this by reading the watchdog clocksource twice, i.e. before
           and after reading the current active clocksource. In case that the
           two watchdog reads shows an excessive time delta, the read sequence
           is repeated up to 3 times.
      
         - Improve the debug output and add a test module for the watchdog
           mechanism.
      
         - Reimplementation of the venerable time64_to_tm() function with a
           faster and significantly smaller version. Straight from the source,
           i.e. the author of the related research paper contributed this!
      
        Driver changes:
      
         - No new drivers, not even new device tree bindings!
      
         - Fixes, improvements and cleanups and all over the place"
      
      * tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
        time/kunit: Add missing MODULE_LICENSE()
        time: Improve performance of time64_to_tm()
        clockevents: Use list_move() instead of list_del()/list_add()
        clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
        clocksource: Provide kernel module to test clocksource watchdog
        clocksource: Reduce clocksource-skew threshold
        clocksource: Limit number of CPUs checked for clock synchronization
        clocksource: Check per-CPU clock synchronization when marked unstable
        clocksource: Retry clock read if long delays detected
        clockevents: Add missing parameter documentation
        clocksource/drivers/timer-ti-dm: Drop unnecessary restore
        clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
        clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
        clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
        arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
        clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
        clocksource/drivers/ingenic: Rename unreasonable array names
        clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
        clocksource/drivers/mediatek: Ack and disable interrupts on suspend
        clocksource/drivers/samsung_pwm: Constify source IO memory
        ...
      a941a034
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 21edf509
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Updates for the interrupt subsystem:
      
        Core changes:
      
         - Cleanup and simplification of common code to invoke the low level
           interrupt flow handlers when this invocation requires irqdomain
           resolution. Add the necessary core infrastructure.
      
         - Provide a proper interface for modular PMU drivers to set the
           interrupt affinity.
      
         - Add a request flag which allows to exclude interrupts from spurious
           interrupt detection. Useful especially for IPI handlers which
           always return IRQ_HANDLED which turns the spurious interrupt
           detection into a pointless waste of CPU cycles.
      
        Driver changes:
      
         - Bulk convert interrupt chip drivers to the new irqdomain low level
           flow handler invocation mechanism.
      
         - Add device tree bindings for the Renesas R-Car M3-W+ SoC
      
         - Enable modular build of the Qualcomm PDC driver
      
         - The usual small fixes and improvements"
      
      * tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
        dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
        irqchip: gic-pm: Remove redundant error log of clock bulk
        irqchip/sun4i: Remove unnecessary oom message
        irqchip/irq-imx-gpcv2: Remove unnecessary oom message
        irqchip/imgpdc: Remove unnecessary oom message
        irqchip/gic-v3-its: Remove unnecessary oom message
        irqchip/gic-v2m: Remove unnecessary oom message
        irqchip/exynos-combiner: Remove unnecessary oom message
        irqchip: Bulk conversion to generic_handle_domain_irq()
        genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
        genirq: Add generic_handle_domain_irq() helper
        irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq()
        irqdesc: Fix __handle_domain_irq() comment
        genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
        irqdomain: Introduce irq_resolve_mapping()
        irqdomain: Protect the linear revmap with RCU
        irqdomain: Cache irq_data instead of a virq number in the revmap
        irqdomain: Use struct_size() helper when allocating irqdomain
        irqdomain: Make normal and nomap irqdomains exclusive
        powerpc: Move the use of irq_domain_add_nomap() behind a config option
        ...
      21edf509
    • Linus Torvalds's avatar
      Merge tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 62180152
      Linus Torvalds authored
      Pull CPU hotplug fix from Thomas Gleixner:
       "A fix for the CPU hotplug and cpusets interaction:
      
        cpusets delegate the hotplug work to a workqueue to prevent a lock
        order inversion vs. the CPU hotplug lock. The work is not flushed
        before the hotplug operation returns which creates user visible
        inconsistent state. Prevent this by flushing the work after dropping
        CPU hotplug lock and before releasing the outer mutex which serializes
        the CPU hotplug related sysfs interface operations"
      
      * tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu/hotplug: Cure the cpusets trainwreck
      62180152
    • Linus Torvalds's avatar
      Merge tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 371fb854
      Linus Torvalds authored
      Pull CPU hotplug cleanup from Thomas Gleixner:
       "A simple cleanup for the CPU hotplug code to avoid per_cpu_ptr()
        reevaluation"
      
      * tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu/hotplug: Simplify access to percpu cpuhp_state
      371fb854
    • Linus Torvalds's avatar
      Merge tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux · e563592c
      Linus Torvalds authored
      Pull printk updates from Petr Mladek:
      
       - Add %pt[RT]s modifier to vsprintf(). It overrides ISO 8601 separator
         by using ' ' (space). It produces "YYYY-mm-dd HH:MM:SS" instead of
         "YYYY-mm-ddTHH:MM:SS".
      
       - Correctly parse long row of numbers by sscanf() when using the field
         width. Add extensive sscanf() selftest.
      
       - Generalize re-entrant CPU lock that has already been used to
         serialize dump_stack() output. It is part of the ongoing printk
         rework. It will allow to remove the obsoleted printk_safe buffers and
         introduce atomic consoles.
      
       - Some code clean up and sparse warning fixes.
      
      * tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
        printk: fix cpu lock ordering
        lib/dump_stack: move cpu lock to printk.c
        printk: Remove trailing semicolon in macros
        random32: Fix implicit truncation warning in prandom_seed_state()
        lib: test_scanf: Remove pointless use of type_min() with unsigned types
        selftests: lib: Add wrapper script for test_scanf
        lib: test_scanf: Add tests for sscanf number conversion
        lib: vsprintf: Fix handling of number field widths in vsscanf
        lib: vsprintf: scanf: Negative number must have field width > 1
        usb: host: xhci-tegra: Switch to use %ptTs
        nilfs2: Switch to use %ptTs
        kdb: Switch to use %ptTs
        lib/vsprintf: Allow to override ISO 8601 date and time separator
      e563592c
    • Linus Torvalds's avatar
      Merge tag 'hyperv-next-signed-20210629' of... · b694011a
      Linus Torvalds authored
      Merge tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
      
      Pull hyperv updates from Wei Liu:
       "Just a few minor enhancement patches and bug fixes"
      
      * tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
        PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv()
        Drivers: hv: Move Hyper-V extended capability check to arch neutral code
        drivers: hv: Fix missing error code in vmbus_connect()
        x86/hyperv: fix logical processor creation
        hv_utils: Fix passing zero to 'PTR_ERR' warning
        scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
        Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
        hv_balloon: Remove redundant assignment to region_start
      b694011a
    • Naoya Horiguchi's avatar
      mm,hwpoison: make get_hwpoison_page() call get_any_page() · 0ed950d1
      Naoya Horiguchi authored
      __get_hwpoison_page() could fail to grab refcount by some race condition,
      so it's helpful if we can handle it by retrying.  We already have retry
      logic, so make get_hwpoison_page() call get_any_page() when called from
      memory_failure().
      
      As a result, get_hwpoison_page() can return negative values (i.e.  error
      code), so some callers are also changed to handle error cases.
      soft_offline_page() does nothing for -EBUSY because that's enough and
      users in userspace can easily handle it.  unpoison_memory() is also
      unchanged because it's broken and need thorough fixes (will be done
      later).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-3-nao.horiguchi@gmail.comSigned-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed950d1
    • Naoya Horiguchi's avatar
      mm,hwpoison: send SIGBUS with error virutal address · a3f5d80e
      Naoya Horiguchi authored
      Now an action required MCE in already hwpoisoned address surely sends a
      SIGBUS to current process, but the SIGBUS doesn't convey error virtual
      address.  That's not optimal for hwpoison-aware applications.
      
      To fix the issue, make memory_failure() call kill_accessing_process(),
      that does pagetable walk to find the error virtual address.  It could find
      multiple virtual addresses for the same error page, and it seems hard to
      tell which virtual address is correct one.  But that's rare and sending
      incorrect virtual address could be better than no address.  So let's
      report the first found virtual address for now.
      
      [naoya.horiguchi@nec.com: fix walk_page_range() return]
        Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
      
      Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.comSigned-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3f5d80e
    • Mel Gorman's avatar
      mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes · 203c06ee
      Mel Gorman authored
      Dave Hansen reported the following about Feng Tang's tests on a machine
      with persistent memory onlined as a DRAM-like device.
      
        Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
        ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
        use" mode and being managed via the buddy just like the normal RAM.
      
        The PMEM zones are big ones:
      
              present  65011712 = 248 G
              high       134595 = 525 M
      
        The PMEM nodes, of course, don't have any CPUs in them.
      
        With your series, the pcp->high value per-cpu is 69584 pages or about
        270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
        worst-case memory in the pcps per zone, or roughly 10% of the size of
        the zone.
      
      This should not cause a problem as such although it could trigger reclaim
      due to pages being stored on per-cpu lists for CPUs remote to a node.  It
      is not possible to treat cpuless nodes exactly the same as normal nodes
      but the worst-case scenario can be mitigated by splitting pcp->high across
      all online CPUs for cpuless memory nodes.
      
      Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.netSuggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Tang, Feng" <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      203c06ee
    • Mel Gorman's avatar
      mm/page_alloc: allow high-order pages to be stored on the per-cpu lists · 44042b44
      Mel Gorman authored
      The per-cpu page allocator (PCP) only stores order-0 pages.  This means
      that all THP and "cheap" high-order allocations including SLUB contends on
      the zone->lock.  This patch extends the PCP allocator to store THP and
      "cheap" high-order pages.  Note that struct per_cpu_pages increases in
      size to 256 bytes (4 cache lines) on x86-64.
      
      Note that this is not necessarily a universal performance win because of
      how it is implemented.  High-order pages can cause pcp->high to be
      exceeded prematurely for lower-orders so for example, a large number of
      THP pages being freed could release order-0 pages from the PCP lists.
      Hence, much depends on the allocation/free pattern as observed by a single
      CPU to determine if caching helps or hurts a particular workload.
      
      That said, basic performance testing passed.  The following is a netperf
      UDP_STREAM test which hits the relevant patches as some of the network
      allocations are high-order.
      
      netperf-udp
                                       5.13.0-rc2             5.13.0-rc2
                                 mm-pcpburst-v3r4   mm-pcphighorder-v1r7
      Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
      Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
      Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
      Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
      Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
      Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
      Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
      Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
      Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*
      
      Functionally, a patch like this is necessary to make bulk allocation of
      high-order pages work with similar performance to order-0 bulk
      allocations.  The bulk allocator is not updated in this series as it would
      have to be determined by bulk allocation users how they want to track the
      order of pages allocated with the bulk allocator.
      
      Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44042b44
    • Mike Rapoport's avatar
      mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM · 43b02ba9
      Mike Rapoport authored
      After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
      configuration option is equivalent to FLATMEM.
      
      Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b02ba9
    • Mike Rapoport's avatar
      mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA · a9ee6cf5
      Mike Rapoport authored
      After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
      configuration options are equivalent.
      
      Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
      
      Done with
      
      	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
      		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
      	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
      		$(git grep -wl NEED_MULTIPLE_NODES)
      
      with manual tweaks afterwards.
      
      [rppt@linux.ibm.com: fix arm boot crash]
        Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9ee6cf5
    • Mike Rapoport's avatar
      docs: remove description of DISCONTIGMEM · 48d9f335
      Mike Rapoport authored
      Remove description of DISCONTIGMEM from the "Memory Models" document and
      update VM sysctl description so that it won't mention DISCONIGMEM.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-8-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48d9f335
    • Mike Rapoport's avatar
      arch, mm: remove stale mentions of DISCONIGMEM · d3c251ab
      Mike Rapoport authored
      There are several places that mention DISCONIGMEM in comments or have
      stale code guarded by CONFIG_DISCONTIGMEM.
      
      Remove the dead code and update the comments.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-7-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3c251ab
    • Mike Rapoport's avatar
      mm: remove CONFIG_DISCONTIGMEM · bb1c50d3
      Mike Rapoport authored
      There are no architectures that support DISCONTIGMEM left.
      
      Remove the configuration option and the dead code it was guarding in the
      generic memory management code.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb1c50d3
    • Mike Rapoport's avatar
      m68k: remove support for DISCONTIGMEM · 5ab06e10
      Mike Rapoport authored
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ab06e10
    • Mike Rapoport's avatar
      arc: remove support for DISCONTIGMEM · 8b793b44
      Mike Rapoport authored
      DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
      in v5.11.
      
      Remove the support for DISCONTIGMEM entirely.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b793b44
    • Mike Rapoport's avatar
      arc: update comment about HIGHMEM implementation · e7793e53
      Mike Rapoport authored
      Arc does not use DISCONTIGMEM to implement high memory, update the comment
      describing how high memory works to reflect this.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7793e53
    • Mike Rapoport's avatar
      alpha: remove DISCONTIGMEM and NUMA · fdb7d9b7
      Mike Rapoport authored
      Patch series "Remove DISCONTIGMEM memory model", v3.
      
      SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
      (long) while ago.  The last architectures that used DISCONTIGMEM were
      updated to use other memory models in v5.11 and it is about the time to
      entirely remove DISCONTIGMEM from the kernel.
      
      This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
      model selection in mm/Kconfig and replaces usage of redundant
      CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
      and CONFIG_FLATMEM respectively.
      
      I've also removed NUMA support on alpha that was BROKEN for more than 15
      years.
      
      There were also minor updates all over arch/ to remove mentions of
      DISCONTIGMEM in comments and #ifdefs.
      
      This patch (of 9):
      
      NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
      replaced with SPARSEMEM in v5.11.
      
      Remove both NUMA and DISCONTIGMEM support from alpha.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210608091316.3622-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdb7d9b7
    • Mel Gorman's avatar
      mm/page_alloc: move free_the_page · 21d02f8f
      Mel Gorman authored
      Patch series "Allow high order pages to be stored on PCP", v2.
      
      The per-cpu page allocator (PCP) only handles order-0 pages.  With the
      series "Use local_lock for pcp protection and reduce stat overhead" and
      "Calculate pcp->high based on zone sizes and active CPUs", it's now
      feasible to store high-order pages on PCP lists.
      
      This small series allows PCP to store "cheap" orders where cheap is
      determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.
      
      This patch (of 2):
      
      In the next page, free_compount_page is going to use the common helper
      free_the_page.  This patch moves the definition to ease review.  No
      functional change.
      
      Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21d02f8f
    • Liu Shixin's avatar
      mm/page_alloc: fix counting of managed_pages · f7ec1044
      Liu Shixin authored
      commit f6366156 ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
      the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
      But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
      zero, zone_managed_pages(zone) is not counted in the managed_pages either.
      This is inconsistent with the description of lowmem_reserve, so fix it.
      
      Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com
      Fixes: f6366156 ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Reported-by: default avataryangerkun <yangerkun@huawei.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ec1044
    • Dong Aisheng's avatar
    • Dong Aisheng's avatar
      mm: drop SECTION_SHIFT in code comments · 777c00f5
      Dong Aisheng authored
      Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
      is strictly incorrect.  And since commit bbeae5b0 ("mm: move page
      flags layout to separate header"), SECTIONS_SHIFT definition has been
      moved to include/linux/page-flags-layout.h, since code itself looks quite
      straighforward, instead of moving the code comment into the new place as
      well, we just simply remove it.
      
      This also fixed a checkpatch complain derived from the original code:
      WARNING: please, no space before tabs
      + * SECTIONS_SHIFT    ^I^I#bits space required to store a section #$
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.comSigned-off-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Suggested-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      777c00f5
    • Mel Gorman's avatar
      mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822
      Mel Gorman authored
      This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
      similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
      both pcp->batch and pcp->high with the higher pcp->high potentially
      reducing zone->lock contention.  However, the higher pcp->batch value also
      potentially increased allocation latency while the PCP was refilled.  This
      sysctl only adjusts pcp->high so that zone->lock contention is potentially
      reduced but allocation latency during a PCP refill remains the same.
      
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=8
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  35071
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=64
                    high:  4383
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=0
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net: fix documentation]
        Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74f44822
    • Mel Gorman's avatar
      mm/page_alloc: limit the number of pages on PCP lists when reclaim is active · c49c2c47
      Mel Gorman authored
      When kswapd is active then direct reclaim is potentially active.  In
      either case, it is possible that a zone would be balanced if pages were
      not trapped on PCP lists.  Instead of draining remote pages, simply limit
      the size of the PCP lists while kswapd is active.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49c2c47
    • Mel Gorman's avatar
      mm/page_alloc: scale the number of pages that are batch freed · 3b12e7e9
      Mel Gorman authored
      When a task is freeing a large number of order-0 pages, it may acquire the
      zone->lock multiple times freeing pages in batches.  This may
      unnecessarily contend on the zone lock when freeing very large number of
      pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent frees.
      
      As the machines I used were not large enough to test this are not large
      enough to illustrate a problem, a debugging patch shows patterns like the
      following (slightly editted for clarity)
      
      Baseline vanilla kernel
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
      
      With patches
        time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b12e7e9
    • Mel Gorman's avatar
      mm/page_alloc: adjust pcp->high after CPU hotplug events · 04f8cfea
      Mel Gorman authored
      The PCP high watermark is based on the number of online CPUs so the
      watermarks must be adjusted during CPU hotplug.  At the time of
      hot-remove, the number of online CPUs is already adjusted but during
      hot-add, a delta needs to be applied to update PCP to the correct value.
      After this patch is applied, the high watermarks are adjusted correctly.
      
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
        # echo 0 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  664
        # echo 1 > /sys/devices/system/cpu/cpu4/online
        # grep high: /proc/zoneinfo  | tail -1
                    high:  649
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-4-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f8cfea
    • Mel Gorman's avatar
      mm/page_alloc: disassociate the pcp->high from pcp->batch · b92ca18e
      Mel Gorman authored
      The pcp high watermark is based on the batch size but there is no
      relationship between them other than it is convenient to use early in
      boot.
      
      This patch takes the first step and bases pcp->high on the zone low
      watermark split across the number of CPUs local to a zone while the batch
      size remains the same to avoid increasing allocation latencies.  The
      intent behind the default pcp->high is "set the number of PCP pages such
      that if they are all full that background reclaim is not started
      prematurely".
      
      Note that in this patch the pcp->high values are adjusted after memory
      hotplug events, min_free_kbytes adjustments and watermark scale factor
      adjustments but not CPU hotplug events which is handled later in the
      series.
      
      On a test KVM instance;
      
      Before grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  378
                    batch: 63
      
      After grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net:  fix __setup_per_zone_wmarks for parallel memory
      hotplug]
        Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b92ca18e
    • Mel Gorman's avatar
      mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35
      Mel Gorman authored
      Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
      
      The per-cpu page allocator (PCP) is meant to reduce contention on the zone
      lock but the sizing of batch and high is archaic and neither takes the
      zone size into account or the number of CPUs local to a zone.  With larger
      zones and more CPUs per node, the contention is getting worse.
      Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
      and high values means that the sysctl can reduce zone lock contention but
      also increase allocation latencies.
      
      This series disassociates pcp->high from pcp->batch and then scales
      pcp->high based on the size of the local zone with limited impact to
      reclaim and accounting for active CPUs but leaves pcp->batch static.  It
      also adapts the number of pages that can be on the pcp list based on
      recent freeing patterns.
      
      The motivation is partially to adjust to larger memory sizes but is also
      driven by the fact that large batches of page freeing via release_pages()
      often shows zone contention as a major part of the problem.  Another is a
      bug report based on an older kernel where a multi-terabyte process can
      takes several minutes to exit.  A workaround was to use
      vm.percpu_pagelist_fraction to increase the pcp->high value but testing
      indicated that a production workload could not use the same values because
      of an increase in allocation latencies.  Unfortunately, I cannot reproduce
      this test case myself as the multi-terabyte machines are in active use but
      it should alleviate the problem.
      
      The series aims to address both and partially acts as a pre-requisite.
      pcp only works with order-0 which is useless for SLUB (when using high
      orders) and THP (unconditionally).  To store high-order pages on PCP, the
      pcp->high values need to be increased first.
      
      This patch (of 6):
      
      The vm.percpu_pagelist_fraction is used to increase the batch and high
      limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
      is to reduce zone lock acquisition when allocating/freeing pages but it
      has a problem.  While it can decrease contention, it can also increase
      latency on the allocation side due to unreasonably large batch sizes.
      This leads to games where an administrator adjusts
      percpu_pagelist_fraction on the fly to work around contention and
      allocation latency problems.
      
      This series aims to alleviate the problems with zone lock contention while
      avoiding the allocation-side latency problems.  For the purposes of
      review, it's easier to remove this sysctl now and reintroduce a similar
      sysctl later in the series that deals only with pcp->high.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbbecb35
    • Minchan Kim's avatar
      mm: page_alloc: dump migrate-failed pages only at -EBUSY · 151e084a
      Minchan Kim authored
      alloc_contig_dump_pages() aims for helping debugging page migration
      failure by elevated page refcount compared to expected_count.  (for the
      detail, please look at migrate_page_move_mapping)
      
      However, -ENOMEM is just the case that system is under memory pressure
      state, not relevant with page refcount at all.  Thus, the dumping page
      list is not helpful for the debugging point of view.
      
      Link: https://lkml.kernel.org/r/YKa2Wyo9xqIErpfa@google.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      151e084a
    • Mel Gorman's avatar
      mm/page_alloc: update PGFREE outside the zone lock in __free_pages_ok · 90249993
      Mel Gorman authored
      VM events do not need explicit protection by disabling IRQs so update the
      counter with IRQs enabled in __free_pages_ok.
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-10-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90249993
    • Mel Gorman's avatar
      mm/page_alloc: avoid conflating IRQs disabled with zone->lock · df1acc85
      Mel Gorman authored
      Historically when freeing pages, free_one_page() assumed that callers had
      IRQs disabled and the zone->lock could be acquired with spin_lock().  This
      confuses the scope of what local_lock_irq is protecting and what
      zone->lock is protecting in free_unref_page_list in particular.
      
      This patch uses spin_lock_irqsave() for the zone->lock in free_one_page()
      instead of relying on callers to have disabled IRQs.
      free_unref_page_commit() is changed to only deal with PCP pages protected
      by the local lock.  free_unref_page_list() then first frees isolated pages
      to the buddy lists with free_one_page() and frees the rest of the pages to
      the PCP via free_unref_page_commit().  The end result is that
      free_one_page() is no longer depending on side-effects of local_lock to be
      correct.
      
      Note that this may incur a performance penalty while memory hot-remove is
      running but that is not a common operation.
      
      [lkp@intel.com: Ensure CMA pages get addded to correct pcp list]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-9-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df1acc85