1. 07 Nov, 2015 2 commits
    • Mel Gorman's avatar
      mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe · e2b19197
      Mel Gorman authored
      Overall, the intent of this series is to remove the zonelist cache which
      was introduced to avoid high overhead in the page allocator.  Once this is
      done, it is necessary to reduce the cost of watermark checks.
      
      The series starts with minor micro-optimisations.
      
      Next it notes that GFP flags that affect watermark checks are abused.
      __GFP_WAIT historically identified callers that could not sleep and could
      access reserves.  This was later abused to identify callers that simply
      prefer to avoid sleeping and have other options.  A patch distinguishes
      between atomic callers, high-priority callers and those that simply wish
      to avoid sleep.
      
      The zonelist cache has been around for a long time but it is of dubious
      merit with a lot of complexity and some issues that are explained.  The
      most important issue is that a failed THP allocation can cause a zone to
      be treated as "full".  This potentially causes unnecessary stalls, reclaim
      activity or remote fallbacks.  The issues could be fixed but it's not
      worth it.  The series places a small number of other micro-optimisations
      on top before examining GFP flags watermarks.
      
      High-order watermarks enforcement can cause high-order allocations to fail
      even though pages are free.  The watermark checks both protect high-order
      atomic allocations and make kswapd aware of high-order pages but there is
      a much better way that can be handled using migrate types.  This series
      uses page grouping by mobility to reserve pageblocks for high-order
      allocations with the size of the reservation depending on demand.  kswapd
      awareness is maintained by examining the free lists.  By patch 12 in this
      series, there are no high-order watermark checks while preserving the
      properties that motivated the introduction of the watermark checks.
      
      This patch (of 10):
      
      No user of zone_watermark_ok_safe() specifies alloc_flags.  This patch
      removes the unnecessary parameter.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2b19197
    • Yaowei Bai's avatar
      mm/oom_kill.c: introduce is_sysrq_oom helper · db2a0dd7
      Yaowei Bai authored
      Introduce is_sysrq_oom helper function indicating oom kill triggered
      by sysrq to improve readability.
      
      No functional changes.
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db2a0dd7
  2. 06 Nov, 2015 38 commits
    • Linus Torvalds's avatar
      Merge tag 'backlight-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight · 5bc23a0c
      Linus Torvalds authored
      Pull backlight updates from Lee Jones:
       "New Device Support
         - None
      
        New Functionality:
         - None
      
        Core Frameworks:
         - Reject legacy PWM request for device defined in DT
      
        Fix-ups:
         - Remove unnecessary MODULE_ALIAS(); adp8860_bl, adp8870_bl
         - Simplify code: pm8941-wled
         - Supply default-brightness logic; pm8941-wled
      
        Bug Fixes:
         - Clean up OF node; 88pm860x_bl
         - Ensure struct is zeroed; lp855x_bl"
      
      * tag 'backlight-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
        backlight: pm8941-wled: Add default-brightness property
        backlight: pm8941-wled: Fix ptr_ret.cocci warnings
        backlight: pwm: Reject legacy PWM request for device defined in DT
        backlight: 88pm860x_bl: Add missing of_node_put
        backlight: adp8870: Remove unnecessary MODULE_ALIAS()
        backlight: adp8860: Remove unnecessary MODULE_ALIAS()
        backlight: lp855x: Make sure props struct is zeroed
      5bc23a0c
    • Linus Torvalds's avatar
      mfd: avoid newly introduced compiler warning · 4dcee4d8
      Linus Torvalds authored
      Commit b158b69a ("mfd: rtsx: Simplify function return logic")
      removed the use of the 'err' variable, but left the variable itself
      around, resulting in gcc quite reasonably warning:
      
          drivers/mfd/rtsx_pcr.c: In function ‘rtsx_pci_set_pull_ctl’:
          drivers/mfd/rtsx_pcr.c:565:6: warning: unused variable ‘err’ [-Wunused-variable]
            int err;
                ^
      
      Get rid of the unused variable, and avoid the new warning.
      
      Cc: Javier Martinez Canillas <javier@osg.samsung.com>
      Cc: Lee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dcee4d8
    • Linus Torvalds's avatar
      Merge tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · bc914532
      Linus Torvalds authored
      Pull MFD updates from Lee Jones:
       "New Device Support:
         - Add support for 88pm860; 88pm80x
         - Add support for 24c08 EEPROM; at24
         - Add support for Broxton Whiskey Cove; intel*
         - Add support for RTS522A; rts5227
         - Add support for I2C devices; intel_quark_i2c_gpio
      
        New Functionality:
         - Add microphone support; arizona
         - Add general purpose switch support; arizona
         - Add fuel-gauge support; da9150-core
         - Add shutdown support; sec-core
         - Add charger support; tps65217
         - Add flexible serial communication unit support; atmel-flexcom
         - Add power button support; axp20x
         - Add led-flash support; rt5033
      
        Core Frameworks:
         - Supply a generic macro for defining Regmap IRQs
         - Rework ACPI child device matching
      
        Fix-ups:
         - Use Regmap to access registers; tps6105x
         - Use DEFINE_RES_IRQ_NAMED() macro; da9150
         - Re-arrange device registration order; intel_quark_i2c_gpio
         - Allow OF matching; cros_ec_i2c, atmel-hlcdc, hi6421-pmic, max8997, sm501
         - Handle deferred probe; twl6040
         - Improve accuracy of headphone detect; arizona
         - Unnecessary MODULE_ALIAS() removal; bcm590xx, rt5033
         - Remove unused code; htc-i2cpld, arizona, pcf50633-irq, sec-core
         - Simplify code; kempld, rts5209, da903x, lm3533, da9052, arizona
         - Remove #iffery; arizona
         - DT binding adaptions; many
      
        Bug Fixes:
         - Fix possible NULL pointer dereference; wm831x, tps6105x
         - Fix 64bit bug; intel_soc_pmic_bxtwc
         - Fix signedness issue; arizona"
      
      * tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
        bindings: mfd: s2mps11: Add documentation for s2mps15 PMIC
        mfd: sec-core: Remove unused s2mpu02-rtc and s2mpu02-clk children
        extcon: arizona: Add extcon specific device tree binding document
        MAINTAINERS: Add binding docs for Cirrus Logic/Wolfson Arizona devices
        mfd: arizona: Remove bindings covered in new subsystem specific docs
        mfd: rt5033: Add RT5033 Flash led sub device
        mfd: lpss: Add Intel Broxton PCI IDs
        mfd: lpss: Add Broxton ACPI IDs
        mfd: arizona: Signedness bug in arizona_runtime_suspend()
        mfd: axp20x: Add a cell for the power button part of the, axp288 PMICs
        mfd: dt-bindings: Document pulled down WRSTBI pin on S2MPS1X
        mfd: sec-core: Disable buck voltage reset on watchdog falling edge
        mfd: sec-core: Dump PMIC revision to find out the HW
        mfd: arizona: Use correct type ID for device tree config
        mfd: arizona: Remove use of codec build config #ifdefs
        mfd: arizona: Simplify adding subdevices
        mfd: arizona: Downgrade type mismatch messages to dev_warn
        mfd: arizona: Factor out checking of jack detection state
        mfd: arizona: Factor out DCVDD isolation control
        mfd: Make TPS6105X select REGMAP_I2C
        ...
      bc914532
    • Linus Torvalds's avatar
      x86: don't make DEBUG_WX default to 'y' even with DEBUG_RODATA · 54727e6e
      Linus Torvalds authored
      It turns out that we still have issues with the EFI memory map that ends
      up polluting our kernel page tables with writable executable pages.
      
      That will get sorted out, but in the meantime let's not make the scary
      complaint about them be on by default.  The code is useful for
      developers, but not ready for end user testing yet.
      Acked-by: default avatarBorislav Petkov <bp@alien8.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54727e6e
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.4-1' of... · d1e41ff1
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.4-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver update from Darren Hart:
       "Various toshiba hotkey and keyboard related fixes and a new WMI
        driver.  Several intel_scu_ipc cleanups and a locking fix.  A
        spattering of small single fixes across various platforms.
      
        I was asked to pick up an OLPC cleanup as the driver appeared
        unmaintained and it seemed similar to what is maintained in
        platform/drivers/x86.  I have included the patch and an update to the
        MAINTAINERS file.
      
        toshiba_acpi:
         - Initialize hotkey_event_type variable
         - Remove unneeded u32 variables from *setup_keyboard
         - Add 0x prefix to available_kbd_modes_show function
         - Change default Hotkey enabling value
         - Unify hotkey enabling functions
      
        toshiba-wmi:
         - Toshiba WMI Hotkey Driver
      
        intel_scu_ipc:
         - Protect dev member assignment on ->remove()
         - Switch to use module_pci_driver() macro
         - Convert to use struct device *
         - Propagate pointer to struct intel_scu_ipc_dev
         - Fix error path by turning to devm_* / pcim_*
      
        acer-wmi:
         - remove threeg and interface sysfs interfaces
      
        OLPC:
         - Use %*ph specifier instead of passing direct values
      
        MAINTAINERS:
         - Add drivers/platform/olpc to drivers/platform/x86
      
        sony-laptop:
         - Fix handling sony_nc_hotkeys_decode result
      
        intel_mid_powerbtn:
         - Remove misuse of IRQF_NO_SUSPEND flag
      
        compal-laptop:
         - Add charge control limit
      
        asus-wmi:
         - restore kbd led level after resume"
      
      * tag 'platform-drivers-x86-v4.4-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        toshiba_acpi: Initialize hotkey_event_type variable
        intel_scu_ipc: Protect dev member assignment on ->remove()
        intel_scu_ipc: Switch to use module_pci_driver() macro
        intel_scu_ipc: Convert to use struct device *
        intel_scu_ipc: Propagate pointer to struct intel_scu_ipc_dev
        intel_scu_ipc: Fix error path by turning to devm_* / pcim_*
        acer-wmi: remove threeg and interface sysfs interfaces
        OLPC: Use %*ph specifier instead of passing direct values
        MAINTAINERS: Add drivers/platform/olpc to drivers/platform/x86
        platform/x86: Toshiba WMI Hotkey Driver
        sony-laptop: Fix handling sony_nc_hotkeys_decode result
        intel_mid_powerbtn: Remove misuse of IRQF_NO_SUSPEND flag
        compal-laptop: Add charge control limit
        asus-wmi: restore kbd led level after resume
        toshiba_acpi: Remove unneeded u32 variables from *setup_keyboard
        toshiba_acpi: Add 0x prefix to available_kbd_modes_show function
        toshiba_acpi: Change default Hotkey enabling value
        toshiba_acpi: Unify hotkey enabling functions
      d1e41ff1
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 2f4bf528
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
      
       - Kconfig: remove BE-only platforms from LE kernel build from Boqun
         Feng
       - Refresh ps3_defconfig from Geoff Levand
       - Emit GNU & SysV hashes for the vdso from Michael Ellerman
       - Define an enum for the bolted SLB indexes from Anshuman Khandual
       - Use a local to avoid multiple calls to get_slb_shadow() from Michael
         Ellerman
       - Add gettimeofday() benchmark from Michael Neuling
       - Avoid link stack corruption in __get_datapage() from Michael Neuling
       - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
         K.V
       - Add ppc64le_defconfig from Michael Ellerman
       - pseries: extract of_helpers module from Andy Shevchenko
       - Correct string length in pseries_of_derive_parent() from Nathan
         Fontenot
       - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
       - Shorten irq_chip name for the SIU from Christophe Leroy
       - Wait 1s for secondaries to enter OPAL during kexec from Samuel
         Mendoza-Jonas
       - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
       - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
         Colin Ian King
       - Disable hugepd for 64K page size, from Aneesh Kumar K.V
       - Differentiate between hugetlb and THP during page walk from Aneesh
         Kumar K.V
       - Make PCI non-optional for pseries from Michael Ellerman
       - Individual System V IPC system calls from Sam bobroff
       - Add selftest of unmuxed IPC calls from Michael Ellerman
       - discard .exit.data at runtime from Stephen Rothwell
       - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
         Gortmaker
       - Use of_get_next_parent to simplify code from Christophe Jaillet
       - Paginate some xmon output from Sam bobroff
       - Add some more elements to the xmon PACA dump from Michael Ellerman
       - Allow the tm-syscall selftest to build with old headers from Michael
         Ellerman
       - Run EBB selftests only on POWER8 from Denis Kirjanov
       - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
         Ellerman
       - Avoid reference to potentially freed memory in prom.c from Christophe
         Jaillet
       - Quieten boot wrapper output with run_cmd from Geoff Levand
       - EEH fixes and cleanups from Gavin Shan
       - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
       - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
         Ellerman
       - Fix section mismatch warning in msi_bitmap_alloc() from Denis
         Kirjanov
       - Fix ps3-lpm white space from Rudhresh Kumar J
       - Fix ps3-vuart null dereference from Colin King
       - nvram: Add missing kfree in error path from Christophe Jaillet
       - nvram: Fix function name in some errors messages, from Christophe
         Jaillet
       - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
         Koskinen
       - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
       - cxl: Free virtual PHB when removing from Andrew Donnellan
       - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
         Michael Ellerman
       - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
         with O= from Michael Ellerman
       - Freescale updates from Scott: Highlights include 64-bit book3e
         kexec/kdump support, a rework of the qoriq clock driver, device tree
         changes including qoriq fman nodes, support for a new 85xx board, and
         some fixes.
       - MPC5xxx updates from Anatolij: Highlights include a driver for
         MPC512x LocalPlus Bus FIFO with its device tree binding
         documentation, mpc512x device tree updates and some minor fixes.
      
      * tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
        powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
        powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
        powerpc/pseries: Correct string length in pseries_of_derive_parent()
        powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
        powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
        powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
        powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
        powerpc: handle error case in cpm_muram_alloc()
        powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
        powerpc/book3e-64: Enable kexec
        powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
        powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
        powerpc/book3e-64/kexec: Enable SMP release
        powerpc/book3e-64/kexec: create an identity TLB mapping
        powerpc/book3e-64: Don't limit paca to 256 MiB
        powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
        powerpc/book3e: support CONFIG_RELOCATABLE
        powerpc/booke64: Fix args to copy_and_flush
        powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
        powerpc/e6500: kexec: Handle hardware threads
        ...
      2f4bf528
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 2e3078af
      Linus Torvalds authored
      Merge patch-bomb from Andrew Morton:
      
       - inotify tweaks
      
       - some ocfs2 updates (many more are awaiting review)
      
       - various misc bits
      
       - kernel/watchdog.c updates
      
       - Some of mm.  I have a huge number of MM patches this time and quite a
         lot of it is quite difficult and much will be held over to next time.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
        selftests: vm: add tests for lock on fault
        mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
        mm: introduce VM_LOCKONFAULT
        mm: mlock: add new mlock system call
        mm: mlock: refactor mlock, munlock, and munlockall code
        kasan: always taint kernel on report
        mm, slub, kasan: enable user tracking by default with KASAN=y
        kasan: use IS_ALIGNED in memory_is_poisoned_8()
        kasan: Fix a type conversion error
        lib: test_kasan: add some testcases
        kasan: update reference to kasan prototype repo
        kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
        kasan: various fixes in documentation
        kasan: update log messages
        kasan: accurately determine the type of the bad access
        kasan: update reported bug types for kernel memory accesses
        kasan: update reported bug types for not user nor kernel memory accesses
        mm/kasan: prevent deadlock in kasan reporting
        mm/kasan: don't use kasan shadow pointer in generic functions
        mm/kasan: MODULE_VADDR is not available on all archs
        ...
      2e3078af
    • Eric Biggers's avatar
      vfs: clear remainder of 'full_fds_bits' in dup_fd() · ea5c58e7
      Eric Biggers authored
      This fixes a bug from commit f3f86e33 ("vfs: Fix pathological
      performance case for __alloc_fd()").
      
      v2: refactor to share fd bitmap copying code
      Signed-off-by: default avatarEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea5c58e7
    • Eric B Munson's avatar
      selftests: vm: add tests for lock on fault · b3b0d09c
      Eric B Munson authored
      Test the mmap() flag, and the mlockall() flag.  These tests ensure that
      pages are not faulted in until they are accessed, that the pages are
      unevictable once faulted in, and that VMA splitting and merging works with
      the new VM flag.  The second test ensures that mlock limits are respected.
       Note that the limit test needs to be run a normal user.
      
      Also add tests to use the new mlock2 family of system calls.
      
      [treding@nvidia.com: : Fix mlock2-tests for 32-bit architectures]
      [treding@nvidia.com: ensure the mlock2 syscall number can be found]
      [treding@nvidia.com: use the right arguments for main()]
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3b0d09c
    • Eric B Munson's avatar
      mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage · b0f205c2
      Eric B Munson authored
      The previous patch introduced a flag that specified pages in a VMA should
      be placed on the unevictable LRU, but they should not be made present when
      the area is created.  This patch adds the ability to set this state via
      the new mlock system calls.
      
      We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
      MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
      MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
      When used with MCL_CURRENT, all current mappings will be marked with
      VM_LOCKED | VM_LOCKONFAULT.  When used with MCL_FUTURE, the mm->def_flags
      will be marked with VM_LOCKED | VM_LOCKONFAULT.  When used with both
      MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
      marked with VM_LOCKED | VM_LOCKONFAULT.
      
      Prior to this patch, mlockall() will unconditionally clear the
      mm->def_flags any time it is called without MCL_FUTURE.  This behavior is
      maintained after adding MCL_ONFAULT.  If a call to mlockall(MCL_FUTURE) is
      followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
      new VMAs will be unlocked.  This remains true with or without MCL_ONFAULT
      in either mlockall() invocation.
      
      munlock() will unconditionally clear both vma flags.  munlockall()
      unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
      field.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f205c2
    • Eric B Munson's avatar
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson authored
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
    • Eric B Munson's avatar
      mm: mlock: add new mlock system call · a8ca5d0e
      Eric B Munson authored
      With the refactored mlock code, introduce a new system call for mlock.
      The new call will allow the user to specify what lock states are being
      added.  mlock2 is trivial at the moment, but a follow on patch will add a
      new mlock state making it useful.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8ca5d0e
    • Eric B Munson's avatar
      mm: mlock: refactor mlock, munlock, and munlockall code · 1aab92ec
      Eric B Munson authored
      mlock() allows a user to control page out of program memory, but this
      comes at the cost of faulting in the entire mapping when it is allocated.
      For large mappings where the entire area is not necessary this is not
      ideal.  Instead of forcing all locked pages to be present when they are
      allocated, this set creates a middle ground.  Pages are marked to be
      placed on the unevictable LRU (locked) when they are first used, but they
      are not faulted in by the mlock call.
      
      This series introduces a new mlock() system call that takes a flags
      argument along with the start address and size.  This flags argument gives
      the caller the ability to request memory be locked in the traditional way,
      or to be locked after the page is faulted in.  A new MCL flag is added to
      mirror the lock on fault behavior from mlock() in mlockall().
      
      There are two main use cases that this set covers.  The first is the
      security focussed mlock case.  A buffer is needed that cannot be written
      to swap.  The maximum size is known, but on average the memory used is
      significantly less than this maximum.  With lock on fault, the buffer is
      guaranteed to never be paged out without consuming the maximum size every
      time such a buffer is created.
      
      The second use case is focussed on performance.  Portions of a large file
      are needed and we want to keep the used portions in memory once accessed.
      This is the case for large graphical models where the path through the
      graph is not known until run time.  The entire graph is unlikely to be
      used in a given invocation, but once a node has been used it needs to stay
      resident for further processing.  Given these constraints we have a number
      of options.  We can potentially waste a large amount of memory by mlocking
      the entire region (this can also cause a significant stall at startup as
      the entire file is read in).  We can mlock every page as we access them
      without tracking if the page is already resident but this introduces large
      overhead for each access.  The third option is mapping the entire region
      with PROT_NONE and using a signal handler for SIGSEGV to
      mprotect(PROT_READ) and mlock() the needed page.  Doing this page at a
      time adds a significant performance penalty.  Batching can be used to
      mitigate this overhead, but in order to safely avoid trying to mprotect
      pages outside of the mapping, the boundaries of each mapping to be used in
      this way must be tracked and available to the signal handler.  This is
      precisely what the mm system in the kernel should already be doing.
      
      For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
      mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
      created not when the pages are faulted in.  For mlockall(MCL_ONFAULT) the
      user is charged as if MCL_FUTURE was used.  This decision was made to keep
      the accounting checks out of the page fault path.
      
      To illustrate the benefit of this set I wrote a test program that mmaps a
      5 GB file filled with random data and then makes 15,000,000 accesses to
      random addresses in that mapping.  The test program was run 20 times for
      each setup.  Results are reported for two program portions, setup and
      execution.  The setup phase is calling mmap and optionally mlock on the
      entire region.  For most experiments this is trivial, but it highlights
      the cost of faulting in the entire region.  Results are averages across
      the 20 runs in milliseconds.
      
      mmap with mlock(MLOCK_LOCKED) on entire range:
      Setup avg:      8228.666
      Processing avg: 8274.257
      
      mmap with mlock(MLOCK_LOCKED) before each access:
      Setup avg:      0.113
      Processing avg: 90993.552
      
      mmap with PROT_NONE and signal handler and batch size of 1 page:
      With the default value in max_map_count, this gets ENOMEM as I attempt
      to change the permissions, after upping the sysctl significantly I get:
      Setup avg:      0.058
      Processing avg: 69488.073
      mmap with PROT_NONE and signal handler and batch size of 8 pages:
      Setup avg:      0.068
      Processing avg: 38204.116
      
      mmap with PROT_NONE and signal handler and batch size of 16 pages:
      Setup avg:      0.044
      Processing avg: 29671.180
      
      mmap with mlock(MLOCK_ONFAULT) on entire range:
      Setup avg:      0.189
      Processing avg: 17904.899
      
      The signal handler in the batch cases faulted in memory in two steps to
      avoid having to know the start and end of the faulting mapping.  The first
      step covers the page that caused the fault as we know that it will be
      possible to lock.  The second step speculatively tries to mlock and
      mprotect the batch size - 1 pages that follow.  There may be a clever way
      to avoid this without having the program track each mapping to be covered
      by this handeler in a globally accessible structure, but I could not find
      it.  It should be noted that with a large enough batch size this two step
      fault handler can still cause the program to crash if it reaches far
      beyond the end of the mapping.
      
      These results show that if the developer knows that a majority of the
      mapping will be used, it is better to try and fault it in at once,
      otherwise mlock(MLOCK_ONFAULT) is significantly faster.
      
      The performance cost of these patches are minimal on the two benchmarks I
      have tested (stream and kernbench).  The following are the average values
      across 20 runs of stream and 10 runs of kernbench after a warmup run whose
      results were discarded.
      
      Avg throughput in MB/s from stream using 1000000 element arrays
      Test     4.2-rc1      4.2-rc1+lock-on-fault
      Copy:    10,566.5     10,421
      Scale:   10,685       10,503.5
      Add:     12,044.1     11,814.2
      Triad:   12,064.8     11,846.3
      
      Kernbench optimal load
                       4.2-rc1  4.2-rc1+lock-on-fault
      Elapsed Time     78.453   78.991
      User Time        64.2395  65.2355
      System Time      9.7335   9.7085
      Context Switches 22211.5  22412.1
      Sleeps           14965.3  14956.1
      
      This patch (of 6):
      
      Extending the mlock system call is very difficult because it currently
      does not take a flags argument.  A later patch in this set will extend
      mlock to support a middle ground between pages that are locked and faulted
      in immediately and unlocked pages.  To pave the way for the new system
      call, the code needs some reorganization so that all the actual entry
      point handles is checking input and translating to VMA flags.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aab92ec
    • Andrey Ryabinin's avatar
      kasan: always taint kernel on report · eb06f43f
      Andrey Ryabinin authored
      Currently we already taint the kernel in some cases.  E.g.  if we hit some
      bug in slub memory we call object_err() which will taint the kernel with
      TAINT_BAD_PAGE flag.  But for other kind of bugs kernel left untainted.
      
      Always taint with TAINT_BAD_PAGE if kasan found some bug.  This is useful
      for automated testing.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb06f43f
    • Andrey Ryabinin's avatar
      mm, slub, kasan: enable user tracking by default with KASAN=y · 89d3c87e
      Andrey Ryabinin authored
      It's recommended to have slub's user tracking enabled with CONFIG_KASAN,
      because:
      
      a) User tracking disables slab merging which improves
          detecting out-of-bounds accesses.
      b) User tracking metadata acts as redzone which also improves
          detecting out-of-bounds accesses.
      c) User tracking provides additional information about object.
          This information helps to understand bugs.
      
      Currently it is not enabled by default.  Besides recompiling the kernel
      with KASAN and reinstalling it, user also have to change the boot cmdline,
      which is not very handy.
      
      Enable slub user tracking by default with KASAN=y, since there is no good
      reason to not do this.
      
      [akpm@linux-foundation.org: little fixes, per David]
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89d3c87e
    • Xishi Qiu's avatar
      kasan: use IS_ALIGNED in memory_is_poisoned_8() · 10f70262
      Xishi Qiu authored
      Use IS_ALIGNED() to determine whether the shadow span two bytes.  It
      generates less code and more readable.  Also add some comments in shadow
      check functions.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10f70262
    • Wang Long's avatar
      kasan: Fix a type conversion error · e0d57714
      Wang Long authored
      The current KASAN code can not find the following out-of-bounds bugs:
      
              char *ptr;
              ptr = kmalloc(8, GFP_KERNEL);
              memset(ptr+7, 0, 2);
      
      the cause of the problem is the type conversion error in
      *memory_is_poisoned_n* function.  So this patch fix that.
      Signed-off-by: default avatarWang Long <long.wanglong@huawei.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0d57714
    • Wang Long's avatar
      lib: test_kasan: add some testcases · f523e737
      Wang Long authored
      Add some out of bounds testcases to test_kasan module.
      Signed-off-by: default avatarWang Long <long.wanglong@huawei.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f523e737
    • Andrey Konovalov's avatar
      kasan: update reference to kasan prototype repo · 5d0926ef
      Andrey Konovalov authored
      Update the reference to the kasan prototype repository on github, since it
      was renamed.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d0926ef
    • Andrey Konovalov's avatar
      kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile · c63f06dd
      Andrey Konovalov authored
      Move KASAN_SANITIZE in arch/x86/boot/Makefile above the comment
      related to SVGA_MODE, since the comment refers to 'the next line'.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c63f06dd
    • Andrey Konovalov's avatar
      kasan: various fixes in documentation · 0295fd5d
      Andrey Konovalov authored
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0295fd5d
    • Andrey Konovalov's avatar
      kasan: update log messages · 25add7ec
      Andrey Konovalov authored
      We decided to use KASAN as the short name of the tool and
      KernelAddressSanitizer as the full one.  Update log messages according to
      that.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25add7ec
    • Andrey Konovalov's avatar
      kasan: accurately determine the type of the bad access · cdf6a273
      Andrey Konovalov authored
      Makes KASAN accurately determine the type of the bad access. If the shadow
      byte value is in the [0, KASAN_SHADOW_SCALE_SIZE) range we can look at
      the next shadow byte to determine the type of the access.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdf6a273
    • Andrey Konovalov's avatar
      kasan: update reported bug types for kernel memory accesses · 0952d87f
      Andrey Konovalov authored
      Update the names of the bad access types to better reflect the type of
      the access that happended and make these error types "literals" that can
      be used for classification and deduplication in scripts.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0952d87f
    • Andrey Konovalov's avatar
      kasan: update reported bug types for not user nor kernel memory accesses · e9121076
      Andrey Konovalov authored
      Each access with address lower than
      kasan_shadow_to_mem(KASAN_SHADOW_START) is reported as user-memory-access.
      This is not always true, the accessed address might not be in user space.
      Fix this by reporting such accesses as null-ptr-derefs or
      wild-memory-accesses.
      
      There's another reason for this change.  For userspace ASan we have a
      bunch of systems that analyze error types for the purpose of
      classification and deduplication.  Sooner of later we will write them to
      KASAN as well.  Then clearly and explicitly stated error types will bring
      value.
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9121076
    • Aneesh Kumar K.V's avatar
      mm/kasan: prevent deadlock in kasan reporting · fc5aeeaf
      Aneesh Kumar K.V authored
      When we end up calling kasan_report in real mode, our shadow mapping for
      the spinlock variable will show poisoned.  This will result in us calling
      kasan_report_error with lock_report spin lock held.  To prevent this
      disable kasan reporting when we are priting error w.r.t kasan.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc5aeeaf
    • Aneesh Kumar K.V's avatar
      mm/kasan: don't use kasan shadow pointer in generic functions · f2377d4e
      Aneesh Kumar K.V authored
      We can't use generic functions like print_hex_dump to access kasan shadow
      region.  This require us to setup another kasan shadow region for the
      address passed (kasan shadow address).  Some architectures won't be able
      to do that.  Hence make a copy of the shadow region row and pass that to
      generic functions.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2377d4e
    • Aneesh Kumar K.V's avatar
    • Aneesh Kumar K.V's avatar
      mm/kasan: rename kasan_enabled() to kasan_report_enabled() · 0ba8663c
      Aneesh Kumar K.V authored
      The function only disable/enable reporting.  In the later patch we will be
      adding a kasan early enable/disable.  Rename kasan_enabled to properly
      reflect its function.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ba8663c
    • Tetsuo Handa's avatar
      mm: remove refresh_cpu_vm_stats() definition for !SMP kernel · 5ba97bf9
      Tetsuo Handa authored
      refresh_cpu_vm_stats(int cpu) is no longer referenced by !SMP kernel
      since Linux 3.12.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ba97bf9
    • Hugh Dickins's avatar
      Documentation/filesystems/proc.txt: a little tidying · a5be3563
      Hugh Dickins authored
      There's an odd line about "Locked" at the head of the description of
      /proc/meminfo: it seems to have strayed from /proc/PID/smaps, so lead it
      back there.  Move "Swap" and "SwapPss" descriptions down above it, to
      match the order in the file (though "PageSize"s still undescribed).
      
      The example of "Locked: 374 kB" (the same as Pss, neither Rss nor Size) is
      so unlikely as to be misleading: just make it 0, this is /bin/bash text;
      which would be "dw" (disabled write) not "de" (do not expand).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5be3563
    • Hugh Dickins's avatar
      tmpfs: avoid a little creat and stat slowdown · d0424c42
      Hugh Dickins authored
      LKP reports that v4.2 commit afa2db2f ("tmpfs: truncate prealloc
      blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
      benchmark.
      
      creat-clo does just what you'd expect from the name, and creat's O_TRUNC
      on 0-length file does indeed get into more overhead now shmem_setattr()
      tests "0 <= 0" instead of "0 < 0".
      
      I'm not sure how much we care, but I think it would not be too VW-like to
      add in a check for whether any pages (or swap) are allocated: if none are
      allocated, there's none to remove from the radix_tree.  At first I thought
      that check would be good enough for the unmaps too, but no: we should not
      skip the unlikely case of unmapping pages beyond the new EOF, which were
      COWed from holes which have now been reclaimed, leaving none.
      
      This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, and
      running a debug config before and after: I hope those account for the
      lesser speedup.
      
      And probably someone has a benchmark where a thousand threads keep on
      stat'ing the same file repeatedly: forestall that report by adjusting v4.3
      commit 44a30220 ("shmem: recalculate file inode when fstat") not to
      take the spinlock in shmem_getattr() when there's no work to do.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarYing Huang <ying.huang@linux.intel.com>
      Tested-by: default avatarYing Huang <ying.huang@linux.intel.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0424c42
    • David Rientjes's avatar
      mm, oom: add comment for why oom_adj exists · b72bdfa7
      David Rientjes authored
      /proc/pid/oom_adj exists solely to avoid breaking existing userspace
      binaries that write to the tunable.
      
      Add a comment in the only possible location within the kernel tree to
      describe the situation and motivation for keeping it around.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b72bdfa7
    • Michal Hocko's avatar
      memcg: fix thresholds for 32b architectures. · c12176d3
      Michal Hocko authored
      Commit 424cdc14 ("memcg: convert threshold to bytes") has fixed a
      regression introduced by 3e32cb2e ("mm: memcontrol: lockless page
      counters") where thresholds were silently converted to use page units
      rather than bytes when interpreting the user input.
      
      The fix is not complete, though, as properly pointed out by Ben Hutchings
      during stable backport review.  The page count is converted to bytes but
      unsigned long is used to hold the value which would be obviously not
      sufficient for 32b systems with more than 4G thresholds.  The same applies
      to usage as taken from mem_cgroup_usage which might overflow.
      
      Let's remove this bytes vs.  pages internal tracking differences and
      handle thresholds in page units internally.  Chage mem_cgroup_usage() to
      return the value in page units and revert 424cdc14 because this should
      be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
      only users of mem_cgroup_usage outside of the threshold handling code is
      converted to give the proper in bytes result.  It is doing that already
      for page_counter output so this is more consistent as well.
      
      The value presented to the userspace is still in bytes units.
      
      Fixes: 424cdc14 ("memcg: convert threshold to bytes")
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      From: Michal Hocko <mhocko@kernel.org>
      Subject: memcg-fix-thresholds-for-32b-architectures-fix
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      From: Andrew Morton <akpm@linux-foundation.org>
      Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix
      
      don't attempt to inline mem_cgroup_usage()
      
      The compiler ignores the inline anwyay.  And __always_inlining it adds 600
      bytes of goop to the .o file.
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12176d3
    • Johannes Weiner's avatar
      mm: page_counter: let page_counter_try_charge() return bool · 6071ca52
      Johannes Weiner authored
      page_counter_try_charge() currently returns 0 on success and -ENOMEM on
      failure, which is surprising behavior given the function name.
      
      Make it follow the expected pattern of try_stuff() functions that return a
      boolean true to indicate success, or false for failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6071ca52
    • Johannes Weiner's avatar
      mm: memcontrol: eliminate root memory.current · f5fc3c5d
      Johannes Weiner authored
      memory.current on the root level doesn't add anything that wouldn't be
      more accurate and detailed using system statistics.  It already doesn't
      include slabs, and it'll be a pain to keep in sync when further memory
      types are accounted in the memory controller.  Remove it.
      
      Note that this applies to the new unified hierarchy interface only.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5fc3c5d
    • Dave Hansen's avatar
      mm, hugetlbfs: optimize when NUMA=n · e0ec90ee
      Dave Hansen authored
      My recent patch "mm, hugetlb: use memory policy when available" added some
      bloat to hugetlb.o.  This patch aims to get some of the bloat back,
      especially when NUMA is not in play.
      
      It does this with an implicit #ifdef and marking some things static that
      should have been static in my first patch.  It also makes the warnings
      only VM_WARN_ON()s.  They were responsible for a pretty big chunk of the
      bloat.
      
      Doing this gets our NUMA=n text size back to a wee bit _below_ where we
      started before the original patch.
      
      It also shaves a bit of space off the NUMA=y case, but not much.
      Enforcing the mempolicy definitely takes some text and it's hard to avoid.
      
      size(1) output:
      
         text	   data	    bss	    dec	    hex	filename
        30745	   3433	   2492	  36670	   8f3e	hugetlb.o.nonuma.baseline
        31305	   3755	   2492	  37552	   92b0	hugetlb.o.nonuma.patch1
        30713	   3433	   2492	  36638	   8f1e	hugetlb.o.nonuma.patch2 (this patch)
        25235	    473	  41276	  66984	  105a8	hugetlb.o.numa.baseline
        25715	    475	  41276	  67466	  1078a	hugetlb.o.numa.patch1
        25491	    473	  41276	  67240	  106a8	hugetlb.o.numa.patch2 (this patch)
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0ec90ee
    • Dave Hansen's avatar
      mm, hugetlb: use memory policy when available · 099730d6
      Dave Hansen authored
      I have a hugetlbfs user which is never explicitly allocating huge pages
      with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
      the pages be allocated from the buddy allocator at fault time.
      
      This works, but they noticed that mbind() was not doing them any good and
      the pages were being allocated without respect for the policy they
      specified.
      
      The code in question is this:
      
      > struct page *alloc_huge_page(struct vm_area_struct *vma,
      ...
      >         page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
      >         if (!page) {
      >                 page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
      
      dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
       But, it only grabs _existing_ huge pages from the huge page pool.  If the
      pool is empty, we fall back to alloc_buddy_huge_page() which obviously
      can't do anything with the VMA's policy because it isn't even passed the
      VMA.
      
      Almost everybody preallocates huge pages.  That's probably why nobody has
      ever noticed this.  Looking back at the git history, I don't think this
      _ever_ worked from when alloc_buddy_huge_page() was introduced in
      7893d1d5, 8 years ago.
      
      The fix is to pass vma/addr down in to the places where we actually call
      in to the buddy allocator.  It's fairly straightforward plumbing.  This
      has been lightly tested.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      099730d6