1. 11 Sep, 2013 40 commits
    • Mathieu Desnoyers's avatar
      kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user() · 3ddc5b46
      Mathieu Desnoyers authored
      I found the following pattern that leads in to interesting findings:
      
        grep -r "ret.*|=.*__put_user" *
        grep -r "ret.*|=.*__get_user" *
        grep -r "ret.*|=.*__copy" *
      
      The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
      since those appear in compat code, we could probably expect the kernel
      addresses not to be reachable in the lower 32-bit range, so I think they
      might not be exploitable.
      
      For the "__get_user" cases, I don't think those are exploitable: the worse
      that can happen is that the kernel will copy kernel memory into in-kernel
      buffers, and will fail immediately afterward.
      
      The alpha csum_partial_copy_from_user() seems to be missing the
      access_ok() check entirely.  The fix is inspired from x86.  This could
      lead to information leak on alpha.  I also noticed that many architectures
      map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
      wonder if the latter is performing the access checks on every
      architectures.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ddc5b46
    • Jingoo Han's avatar
      drivers/firmware/google/gsmi.c: replace strict_strtoul() with kstrtoul() · 20d0e570
      Jingoo Han authored
      The use of strict_strtoul() is not preferred, because strict_strtoul() is
      obsolete.  Thus, kstrtoul() should be used.
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Cc: Tom Gundersen <teg@jklm.no>
      Cc: Mike Waychison <mikew@google.com>
      Acked-by: default avatarMike Waychison <mikew@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20d0e570
    • Shuah Khan's avatar
      platform: convert apple-gmux driver to dev_pm_ops from legacy pm_ops · 8aa6c216
      Shuah Khan authored
      Convert drivers/platform/x86/apple-gmux to use dev_pm_ops instead of
      legacy pm_ops.  This patch depends on pnp driver bus ops change to invoke
      pnp_driver dev_pm_ops.
      Signed-off-by: default avatarShuah Khan <shuah.kh@samsung.com>
      Cc: Matthew Garrett <matthew.garrett@nebula.com>
      Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
      Cc: Ashley Lai <ashley@ashleylai.com>
      Cc: Rajiv Andrade <mail@srajiv.net>
      Cc: Marcel Selhorst <tpmdd@selhorst.net>
      Cc: Sirrix AG <tpmdd@sirrix.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Peter Hüwe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aa6c216
    • Shuah Khan's avatar
      tpm: convert tpm_tis driver to use dev_pm_ops from legacy pm_ops · a2fa3fb0
      Shuah Khan authored
      Convert drivers/char/tpm/tpm_tis.c to use dev_pm_ops instead of legacy
      pm_ops.  This patch depends on pnp driver bus ops change to invoke
      pnp_driver dev_pm_ops.
      Signed-off-by: default avatarShuah Khan <shuah.kh@samsung.com>
      Cc: Matthew Garrett <matthew.garrett@nebula.com>
      Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
      Cc: Ashley Lai <ashley@ashleylai.com>
      Cc: Rajiv Andrade <mail@srajiv.net>
      Cc: Marcel Selhorst <tpmdd@selhorst.net>
      Cc: Sirrix AG <tpmdd@sirrix.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Peter Hüwe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2fa3fb0
    • Shuah Khan's avatar
      rtc: convert rtc-cmos to dev_pm_ops from legacy pm_ops · a8a3808b
      Shuah Khan authored
      Convert drivers/rtc/rtc-cmos to use dev_pm_ops instead of legacy pm_ops.
      This patch depends on pnp driver bus ops change to invoke pnp_driver
      dev_pm_ops.
      Signed-off-by: default avatarShuah Khan <shuah.kh@samsung.com>
      Cc: Matthew Garrett <matthew.garrett@nebula.com>
      Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
      Cc: Ashley Lai <ashley@ashleylai.com>
      Cc: Rajiv Andrade <mail@srajiv.net>
      Cc: Marcel Selhorst <tpmdd@selhorst.net>
      Cc: Sirrix AG <tpmdd@sirrix.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Peter Hüwe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8a3808b
    • Shuah Khan's avatar
      pnp: change pnp bus pm_ops to invoke pnp driver dev_pm_ops if specified · 729377d5
      Shuah Khan authored
      pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
      pnp_driver.  Changed pnp_bus_suspend() and pnp_bus_resume() to check if
      pnp driver has dev_pm_ops and call.  If dev_pm_ops don't exist, then call
      use legacy pm_ops.  Without this change, pnp_driver dev_pm_ops will not
      get called.
      
      In addition to the pnp driver bus pm_ops change to invoke driver
      dev_pm_ops, this patch set contains changes to rtc-cmos, tpm_tis, and
      apple-gmux pnp drivers to convert from legacy pm_ops to dev_pm_ops.
      
      This patch (of 4):
      
      pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from
      pnp_driver.  Changed pnp_bus_suspend() and pnp_bus_resume() to check if
      pnp driver has dev_pm_ops and call.  If dev_pm_ops don't exist, then call
      use legacy pm_ops.  Without this change, pnp_driver dev_pm_ops will not
      get called.
      Signed-off-by: default avatarShuah Khan <shuah.kh@samsung.com>
      Cc: Matthew Garrett <matthew.garrett@nebula.com>
      Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
      Cc: Ashley Lai <ashley@ashleylai.com>
      Cc: Rajiv Andrade <mail@srajiv.net>
      Cc: Marcel Selhorst <tpmdd@selhorst.net>
      Cc: Sirrix AG <tpmdd@sirrix.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Peter Hüwe <PeterHuewe@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      729377d5
    • Greg Thelen's avatar
      memcg: fix multiple large threshold notifications · 2bff24a3
      Greg Thelen authored
      A memory cgroup with (1) multiple threshold notifications and (2) at least
      one threshold >=2G was not reliable.  Specifically the notifications would
      either not fire or would not fire in the proper order.
      
      The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
      thresholds in sorted order.  mem_cgroup_usage_register_event() sorts them
      with compare_thresholds(), which returns the difference of two 64 bit
      thresholds as an int.  If the difference is positive but has bit[31] set,
      then sort() treats the difference as negative and breaks sort order.
      
      This fix compares the two arbitrary 64 bit thresholds returning the
      classic -1, 0, 1 result.
      
      The test below sets two notifications (at 0x1000 and 0x81001000):
        cd /sys/fs/cgroup/memory
        mkdir x
        for x in 4096 2164264960; do
          cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
        done
        echo $$ > x/cgroup.procs
        anon_leaker 500M
      
      v3.11-rc7 fails to signal the 4096 event listener:
        Leaking...
        Done leaking pages.
      
      Patched v3.11-rc7 properly notifies:
        Leaking...
        4096 listener:2013:8:31:14:13:36
        Done leaking pages.
      
      The fixed bug is old.  It appears to date back to the introduction of
      memcg threshold notifications in v2.6.34-rc1-116-g2e72b634 "memcg:
      implement memory thresholds"
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bff24a3
    • Joe Perches's avatar
      mm/mempool.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) · 7b5219db
      Joe Perches authored
      Use the helper function instead of __GFP_ZERO.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b5219db
    • Joe Perches's avatar
      lib/genalloc.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) · ade34a35
      Joe Perches authored
      Use the helper function instead of __GFP_ZERO.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ade34a35
    • Yanchuan Nian's avatar
      mm/mmap: remove unnecessary assignment · 2d8a1781
      Yanchuan Nian authored
      pgoff is not used after the statement "pgoff = vma->vm_pgoff;", so the
      assignment is redundant.
      Signed-off-by: default avatarYanchuan Nian <ycnian@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d8a1781
    • Junxiao Bi's avatar
      writeback: fix race that cause writeback hung · 146d7009
      Junxiao Bi authored
      There is a race between mark inode dirty and writeback thread, see the
      following scenario.  In this case, writeback thread will not run though
      there is dirty_io.
      
      __mark_inode_dirty()                                          bdi_writeback_workfn()
      	...                                                       	...
      	spin_lock(&inode->i_lock);
      	...
      	if (bdi_cap_writeback_dirty(bdi)) {
      	    <<< assume wb has dirty_io, so wakeup_bdi is false.
      	    <<< the following inode_dirty also have wakeup_bdi false.
      	    if (!wb_has_dirty_io(&bdi->wb))
      		    wakeup_bdi = true;
      	}
      	spin_unlock(&inode->i_lock);
      	                                                            <<< assume last dirty_io is removed here.
      	                                                            pages_written = wb_do_writeback(wb);
      	                                                            ...
      	                                                            <<< work_list empty and wb has no dirty_io,
      	                                                            <<< delayed_work will not be queued.
      	                                                            if (!list_empty(&bdi->work_list) ||
      	                                                                (wb_has_dirty_io(wb) && dirty_writeback_interval))
      	                                                                queue_delayed_work(bdi_wq, &wb->dwork,
      	                                                                    msecs_to_jiffies(dirty_writeback_interval * 10));
      	spin_lock(&bdi->wb.list_lock);
      	inode->dirtied_when = jiffies;
      	<<< new dirty_io is added.
      	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
      	spin_unlock(&bdi->wb.list_lock);
      
      	<<< though there is dirty_io, but wakeup_bdi is false,
      	<<< so writeback thread will not be waked up and
      	<<< the new dirty_io will not be flushed.
      	if (wakeup_bdi)
      	    bdi_wakeup_thread_delayed(bdi);
      
      Writeback will run until there is a new flush work queued.  This may cause
      a lot of dirty pages stay in memory for a long time.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      146d7009
    • Andrew Morton's avatar
      mm/madvise.c:madvise_hwpoison(): remove local `ret' · 325c4ef5
      Andrew Morton authored
      madvise_hwpoison() has two locals called "ret".  Fix it all up.
      
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      325c4ef5
    • Wanpeng Li's avatar
      mm/madvise.c: fix return value of madvise_hwpoison() · 8302423b
      Wanpeng Li authored
      The return value outside for loop is always zero which means
      madvise_hwpoison return success, however, this is not truth for
      soft_offline_page w/ failure return value.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8302423b
    • Wanpeng Li's avatar
      mm/memory-failure.c: fix bug triggered by unpoisoning empty zero page · 3ba5eebc
      Wanpeng Li authored
        Injecting memory failure for page 0x19d0 at 0xb77d2000
        MCE 0x19d0: non LRU page recovery: Ignored
        MCE: Software-unpoisoned page 0x19d0
        BUG: Bad page state in process bash  pfn:019d0
        page:f3461a00 count:0 mapcount:0 mapping:  (null) index:0x0
        page flags: 0x40000404(referenced|reserved)
        Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
        CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
        Hardware name: LENOVO 7034DD7/        , BIOS 9HKT47AUS 01//2012
         00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
         ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
         f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
        Call Trace:
          dump_stack+0x41/0x52
          bad_page+0xcf/0xeb
          free_pages_prepare+0x12a/0x140
          free_hot_cold_page+0x21/0x110
          __put_single_page+0x21/0x30
          put_page+0x25/0x40
          unpoison_memory+0x107/0x200
          hwpoison_unpoison+0x20/0x30
          simple_attr_write+0xb6/0xd0
          vfs_write+0xa0/0x1b0
          SyS_write+0x4f/0x90
          sysenter_do_call+0x12/0x22
        Disabling lock debugging due to kernel taint
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 1
      #define PAGE_SIZE	4096
      
      int main(void)
      {
      	char *mem;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      
      There is one page reference count for default empty zero page,
      madvise_hwpoison add another one by get_user_pages_fast.  memory_hwpoison
      reduce one page reference count since it's a non LRU page.
      unpoison_memory release the last page reference count and free empty zero
      page to buddy system which is not correct since empty zero page has
      PG_reserved flag.  This patch fix it by don't reduce the page reference
      count under 1 against empty zero page.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ba5eebc
    • Wanpeng Li's avatar
      mm/hwpoison-inject.c: change permission of corrupt-pfn/unpoison-pfn to 0200 · 2d1e8b3f
      Wanpeng Li authored
      Hwpoison injection doesn't implement read method for
      corrupt-pfn/unpoison-pfn attributes:
      
      # cat /sys/kernel/debug/hwpoison/corrupt-pfn
      cat: /sys/kernel/debug/hwpoison/corrupt-pfn: Permission denied
      # cat /sys/kernel/debug/hwpoison/unpoison-pfn
      cat: /sys/kernel/debug/hwpoison/unpoison-pfn: Permission denied
      
      This patch changes the permission of corrupt-pfn/unpoison-pfn to 0200.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d1e8b3f
    • Wanpeng Li's avatar
      mm/hwpoison.c: fix held reference count after unpoisoning empty zero page · 29b4eede
      Wanpeng Li authored
      madvise hwpoison inject will poison the read-only empty zero page if there
      is no write access before poison.  Empty zero page reference count will be
      increased for hwpoison, subsequent poison zero page will return directly
      since page has already been set PG_hwpoison, however, page reference count
      is still increased by get_user_pages_fast.  The unpoison process will
      unpoison the empty zero page and decrease the reference count successfully
      for the fist time, however, subsequent unpoison empty zero page will
      return directly since page has already been unpoisoned and without
      decrease the page reference count of empty zero page.
      
      This patch fixes it by make madvise_hwpoison() put a page and return
      immediately (without calling memory_failure() or soft_offline_page()) when
      the page is already hwpoisoned.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 3
      #define PAGE_SIZE	4096
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      
      Add printk to dump page reference count:
      
      [   93.075959] Injecting memory failure for page 0x19d0 at 0xb77d8000
      [   93.076207] MCE 0x19d0: non LRU page recovery: Ignored
      [   93.076209] pfn 0x19d0, page count = 1 after memory failure
      [   93.076220] Injecting memory failure for page 0x19d0 at 0xb77d9000
      [   93.076221] MCE 0x19d0: already hardware poisoned
      [   93.076222] pfn 0x19d0, page count = 2 after memory failure
      [   93.076224] Injecting memory failure for page 0x19d0 at 0xb77da000
      [   93.076224] MCE 0x19d0: already hardware poisoned
      [   93.076225] pfn 0x19d0, page count = 3 after memory failure
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Suggested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29b4eede
    • Wanpeng Li's avatar
      mm/hwpoison: add '#' to madvise_hwpoison · b194b8cd
      Wanpeng Li authored
      Add '#' to madvise_hwpoison.
      
      Before patch:
      
      [   95.892866] Injecting memory failure for page 19d0 at b7786000
      [   95.893151] MCE 0x19d0: non LRU page recovery: Ignored
      
      After patch:
      
      [   95.892866] Injecting memory failure for page 0x19d0 at 0xb7786000
      [   95.893151] MCE 0x19d0: non LRU page recovery: Ignored
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b194b8cd
    • Wanpeng Li's avatar
      mm/hwpoison: drop forward reference declarations __soft_offline_page() · 86e05773
      Wanpeng Li authored
      Drop forward reference declarations __soft_offline_page.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86e05773
    • Wanpeng Li's avatar
      mm/hwpoison: don't set migration type twice to avoid holding heavily contend zone->lock · 0be35096
      Wanpeng Li authored
      Set pageblock migration type will hold zone->lock which is heavy contended
      in system to avoid race.  However, soft offline page will set pageblock
      migration type twice during get page if the page is in used, not hugetlbfs
      page and not on lru list.  There is unnecessary to set the pageblock
      migration type and hold heavy contended zone->lock again if the first
      round get page have already set the pageblock to right migration type.
      
      The trick here is migration type is MIGRATE_ISOLATE.  There are other two
      parts can change MIGRATE_ISOLATE except hwpoison.  One is memory hoplug,
      however, we hold lock_memory_hotplug() which avoid race.  The second is
      CMA which umovable page allocation requst can't fallback to.  So it's safe
      here.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0be35096
    • Wanpeng Li's avatar
      mm/hwpoison: replace atomic_long_sub() with atomic_long_dec() · dd9538a5
      Wanpeng Li authored
      Replace atomic_long_sub() with atomic_long_dec() since the page is normal
      page instead of hugetlbfs page or thp.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd9538a5
    • Wanpeng Li's avatar
      mm/hwpoison: fix race against poison thp · 0cea3fdc
      Wanpeng Li authored
      There is a race between hwpoison page and unpoison page, memory_failure
      set the page hwpoison and increase num_poisoned_pages without hold page
      lock, and one page count will be accounted against thp for
      num_poisoned_pages.  However, unpoison can occur before memory_failure
      hold page lock and split transparent hugepage, unpoison will decrease
      num_poisoned_pages by 1 << compound_order since memory_failure has not yet
      split transparent hugepage with page lock held.  That means we account one
      page for hwpoison and 1 << compound_order for unpoison.  This patch fix it
      by inserting a PageTransHuge check before doing TestClearPageHWPoison,
      unpoison failed without clearing PageHWPoison and decreasing
      num_poisoned_pages.
      
                  A                                                 	B
          	memory_failue
              TestSetPageHWPoison(p);
              if (PageHuge(p))
                  nr_pages = 1 << compound_order(hpage);
              else
                  nr_pages = 1;
              atomic_long_add(nr_pages, &num_poisoned_pages);
                                                                  unpoison_memory
      	                                                        nr_pages = 1<< compound_trans_order(page);
                                                                  if(TestClearPageHWPoison(p))
                                                                  atomic_long_sub(nr_pages, &num_poisoned_pages);
              lock page
              if (!PageHWPoison(p))
              	unlock page and return
              hwpoison_user_mappings
              if (PageTransHuge(hpage))
              	split_huge_page(hpage);
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Suggested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cea3fdc
    • Wanpeng Li's avatar
      mm/hwpoison: don't need to hold compound lock for hugetlbfs page · f9121153
      Wanpeng Li authored
      compound lock is introduced by commit e9da73d6("thp: compound_lock."), it
      is used to serialize put_page against __split_huge_page_refcount().  In
      addition, transparent hugepages will be splitted in hwpoison handler and
      just one subpage will be poisoned.  There is unnecessary to hold compound
      lock for hugetlbfs page.  This patch replace compound_trans_order by
      compond_order in the place where the page is hugetlbfs page.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9121153
    • Wanpeng Li's avatar
      mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages · 841fcc58
      Wanpeng Li authored
      memory_failure() store the page flag of the error page before doing unmap,
      and (only) if the first check with page flags at the time decided the
      error page is unknown, it do the second check with the stored page flag
      since memory_failure() does unmapping of the error pages before doing
      page_action().  This unmapping changes the page state, especially
      page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
      page_action() can't catch mlocked pages after that.
      
      However, memory_failure() can't handle memory errors on dirty mlocked
      pages correctly.  try_to_unmap_one will move the dirty bit from pte to the
      physical page, the second check lose it since it check the stored page
      flag.  This patch fix it by restore PG_dirty flag to stored page flag if
      the page is dirty.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 2
      #define PAGE_SIZE	4096
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0, 0);
      
      	for (i = 0; i < PAGES_TO_TEST; i++)
      		mem[i * PAGE_SIZE] = 'a';
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	return 0;
      }
      
      Before patch:
      
      [  912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000
      [  912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered
      [  912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users
      [  912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000
      [  912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered
      [  912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users
      
      After patch:
      
      [  163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000
      [  163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered
      [  163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users
      [  163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000
      [  163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered
      [  163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      841fcc58
    • Naoya Horiguchi's avatar
      hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page() · 0d6fdbdb
      Naoya Horiguchi authored
      Soft offline code expects that MIGRATE_ISOLATE is set on the target page
      only during soft offlining work.  But currenly it doesn't work as expected
      when get_any_page() fails and returns negative value.  In the result, end
      users can have unexpectedly isolated pages.  This patch just fixes it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d6fdbdb
    • Wang Sheng-Hui's avatar
      mm: correct the comment about the value for buddy _mapcount · cf6fe945
      Wang Sheng-Hui authored
      Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy.  Not the
      magic number -2.
      Signed-off-by: default avatarWang Sheng-Hui <shhuiw@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf6fe945
    • Cyrill Gorcunov's avatar
      mm: make sure _PAGE_SWP_SOFT_DIRTY bit is not set on present pte · fa0f281c
      Cyrill Gorcunov authored
      _PAGE_SOFT_DIRTY bit should never be set on present pte so add VM_BUG_ON
      to catch any potential future abuse.
      
      Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining scope of
      its usage.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa0f281c
    • Maxim Patlasov's avatar
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov authored
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: default avatarMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
    • Chen Gang's avatar
      mm/backing-dev.c: check user buffer length before copying data to the related user buffer · 4c3bffc2
      Chen Gang authored
      '*lenp' may be less than "sizeof(kbuf)" so we must check this before the
      next copy_to_user().
      
      pdflush_proc_obsolete() is called by sysctl which 'procname' is
      "nr_pdflush_threads", if the user passes buffer length less than
      "sizeof(kbuf)", it will cause issue.
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c3bffc2
    • Chen Gang's avatar
      mm/mremap.c: call pud_free() after fail calling pmd_alloc() · 1ecfd533
      Chen Gang authored
      In alloc_new_pmd(), if pud_alloc() was called successfully, but
      pmd_alloc() fails, avoid leaking `pud'.
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ecfd533
    • Wanpeng Li's avatar
      mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area · 762216ab
      Wanpeng Li authored
      Use wrapper function get_vm_area_size to calculate size of vm area.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      762216ab
    • Wanpeng Li's avatar
      mm/writeback: make writeback_inodes_wb static · 7d9f073b
      Wanpeng Li authored
      It's not used globally and could be static.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d9f073b
    • Wanpeng Li's avatar
      mm/sparse: introduce alloc_usemap_and_memmap · 18732093
      Wanpeng Li authored
      After commit 9bdac914 ("sparsemem: Put mem map for one node
      together."), vmemmap for one node will be allocated together, its logic
      is similar as memory allocation for pageblock flags.  This patch
      introduces alloc_usemap_and_memmap to extract the same logic of memory
      alloction for pageblock flags and vmemmap.
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18732093
    • Lisa Du's avatar
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du authored
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarLisa Du <cldu@marvell.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • Vlastimil Babka's avatar
      mm: munlock: manual pte walk in fast path instead of follow_page_mask() · 7a8010cd
      Vlastimil Babka authored
      Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
      each individual struct page.  This entails repeated full page table
      translations and page table lock taken for each page separately.
      
      This patch avoids the costly follow_page_mask() where possible, by
      iterating over ptes within single pmd under single page table lock.  The
      first pte is obtained by get_locked_pte() for non-THP page acquired by the
      initial follow_page_mask().  The rest of the on-stack pagevec for munlock
      is filled up using pte_walk as long as pte_present() and vm_normal_page()
      are sufficient to obtain the struct page.
      
      After this patch, a 14% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a8010cd
    • Vlastimil Babka's avatar
      mm: munlock: remove redundant get_page/put_page pair on the fast path · 5b40998a
      Vlastimil Babka authored
      The performance of the fast path in munlock_vma_range() can be further
      improved by avoiding atomic ops of a redundant get_page()/put_page() pair.
      
      When calling get_page() during page isolation, we already have the pin
      from follow_page_mask().  This pin will be then returned by
      __pagevec_lru_add(), after which we do not reference the pages anymore.
      
      After this patch, an 8% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b40998a
    • Vlastimil Babka's avatar
      mm: munlock: bypass per-cpu pvec for putback_lru_page · 56afe477
      Vlastimil Babka authored
      After introducing batching by pagevecs into munlock_vma_range(), we can
      further improve performance by bypassing the copying into per-cpu pagevec
      and the get_page/put_page pair associated with that.  Instead we perform
      LRU putback directly from our pagevec.  However, this is possible only for
      single-mapped pages that are evictable after munlock.  Unevictable pages
      require rechecking after putting on the unevictable list, so for those we
      fallback to putback_lru_page(), hich handles that.
      
      After this patch, a 13% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      
      [akpm@linux-foundation.org:clarify comment]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56afe477
    • Vlastimil Babka's avatar
      mm: munlock: batch NR_MLOCK zone state updates · 1ebb7cc6
      Vlastimil Babka authored
      Depending on previous batch which introduced batched isolation in
      munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
       After the whole pagevec is processed for page isolation, the stats are
      updated only once with the number of successful isolations.  There were
      however no measurable perfomance gains.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ebb7cc6
    • Vlastimil Babka's avatar
      mm: munlock: batch non-THP page isolation and munlock+putback using pagevec · 7225522b
      Vlastimil Babka authored
      Currently, munlock_vma_range() calls munlock_vma_page on each page in a
      loop, which results in repeated taking and releasing of the lru_lock
      spinlock for isolating pages one by one.  This patch batches the munlock
      operations using an on-stack pagevec, so that isolation is done under
      single lru_lock.  For THP pages, the old behavior is preserved as they
      might be split while putting them into the pagevec.  After this patch, a
      9% speedup was measured for munlocking a 56GB large memory area with THP
      disabled.
      
      A new function __munlock_pagevec() is introduced that takes a pagevec and:
      1) It clears PageMlocked and isolates all pages under lru_lock.  Zone page
      stats can be also updated using the variant which assumes disabled
      interrupts.  2) It finishes the munlock and lru putback on all pages under
      their lock_page.  Note that previously, lock_page covered also the
      PageMlocked clearing and page isolation, but it is not needed for those
      operations.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7225522b
    • Vlastimil Babka's avatar
      mm: munlock: remove unnecessary call to lru_add_drain() · 586a32ac
      Vlastimil Babka authored
      In munlock_vma_range(), lru_add_drain() is currently called in a loop
      before each munlock_vma_page() call.
      
      This is suboptimal for performance when munlocking many pages.  The
      benefits of per-cpu pagevec for batching the LRU putback are removed since
      the pagevec only holds at most one page from the previous loop's
      iteration.
      
      The lru_add_drain() call also does not serve any purposes for correctness
      - it does not even drain pagavecs of all cpu's.  The munlock code already
      expects and handles situations where a page cannot be isolated from the
      LRU (e.g.  because it is on some per-cpu pagevec).
      
      The history of the (not commented) call also suggest that it appears there
      as an oversight rather than intentionally.  Before commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages") the call happened only once
      upon entering the function.  The commit has moved the call into the while
      loope.  So while the other changes in the commit improved munlock
      performance for THP pages, it introduced the abovementioned suboptimal
      per-cpu pagevec usage.
      
      Further in history, before commit 408e82b7 ("mm: munlock use
      follow_page"), munlock_vma_pages_range() was just a wrapper around
      __mlock_vma_pages_range which performed both mlock and munlock depending
      on a flag.  However, before ba470de4 ("mmap: handle mlocked pages during
      map, remap, unmap") the function handled only mlock, not munlock.  The
      lru_add_drain call thus comes from the implementation in commit b291f000
      ("mlock: mlocked pages are unevictable" and was intended only for
      mlocking, not munlocking.  The original intention of draining the LRU
      pagevec at mlock time was to ensure the pages were on the LRU before the
      lock operation so that they could be placed on the unevictable list
      immediately.  There is very little motivation to do the same in the
      munlock path this, particularly for every single page.
      
      This patch therefore removes the call completely.  After removing the
      call, a 10% speedup was measured for munlock() of a 56GB large memory area
      with THP disabled.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      586a32ac
    • Vlastimil Babka's avatar
      mm: putback_lru_page: remove unnecessary call to page_lru_base_type() · 0ec3b74c
      Vlastimil Babka authored
      The goal of this patch series is to improve performance of munlock() of
      large mlocked memory areas on systems without THP.  This is motivated by
      reported very long times of crash recovery of processes with such areas,
      where munlock() can take several seconds.  See
      http://lwn.net/Articles/548108/
      
      The work was driven by a simple benchmark (to be included in mmtests) that
      mmaps() e.g.  56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
      munlock().  Profiling was performed by attaching operf --pid to the
      process and sending a signal to trigger the munlock() part and then notify
      bach the monitoring wrapper to stop operf, so that only munlock() appears
      in the profile.
      
      The profiles have shown that CPU time is spent mostly by atomic operations
      and repeated locking per single pages. This series aims to reduce both, starting
      from simpler to more complex changes.
      
      Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
      	type is not determined without being actually needed.
      
      Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
      	pagevec after each munlocked page is put there.
      
      Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
      	multiple non-THP pages under a single lru_lock instead of locking and
      	processing each page separately.
      
      Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
      	introduced by previous patch.
      
      Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
      	when possible, bypassing the per-cpu pvec and associated overhead.
      
      Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
      	operations.
      
      Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
      	multiple page references under a single page table lock where possible.
      
      Measurements were made using 3.11-rc3 as a baseline.  The first set of
      measurements shows the possibly ideal conditions where batching should
      help the most.  All memory is allocated from a single NUMA node and THP is
      disabled.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           3.38 (  0.00%)        3.39 ( -0.13%)        3.00 ( 11.33%)        2.70 ( 20.20%)        2.67 ( 21.11%)        2.37 ( 29.88%)        2.20 ( 34.91%)        1.91 ( 43.59%)
      Elapsed mean          3.39 (  0.00%)        3.40 ( -0.23%)        3.01 ( 11.33%)        2.70 ( 20.26%)        2.67 ( 21.21%)        2.38 ( 29.88%)        2.21 ( 34.93%)        1.92 ( 43.46%)
      Elapsed stddev        0.01 (  0.00%)        0.01 (-43.09%)        0.01 ( 15.42%)        0.01 ( 23.42%)        0.00 ( 89.78%)        0.01 ( -7.15%)        0.00 ( 76.69%)        0.02 (-91.77%)
      Elapsed max           3.41 (  0.00%)        3.43 ( -0.52%)        3.03 ( 11.29%)        2.72 ( 20.16%)        2.67 ( 21.63%)        2.40 ( 29.50%)        2.21 ( 35.21%)        1.96 ( 42.39%)
      Elapsed range         0.03 (  0.00%)        0.04 (-51.16%)        0.02 (  6.27%)        0.02 ( 14.67%)        0.00 ( 88.90%)        0.03 (-19.18%)        0.01 ( 73.70%)        0.06 (-113.35%
      
      The second set of measurements simulates the worst possible conditions for
      batching by using numactl --interleave, so that there is in fact only one
      page per pagevec.  Even in this case the series seems to improve
      performance thanks to reduced atomic operations and removal of
      lru_add_drain().
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           4.00 (  0.00%)        4.04 ( -0.93%)        3.87 (  3.37%)        3.72 (  6.94%)        3.81 (  4.72%)        3.69 (  7.82%)        3.64 (  8.92%)        3.41 ( 14.81%)
      Elapsed mean          4.17 (  0.00%)        4.15 (  0.51%)        4.03 (  3.49%)        3.89 (  6.84%)        3.86 (  7.48%)        3.89 (  6.69%)        3.70 ( 11.27%)        3.48 ( 16.59%)
      Elapsed stddev        0.16 (  0.00%)        0.08 ( 50.76%)        0.10 ( 41.58%)        0.16 (  4.59%)        0.05 ( 72.38%)        0.19 (-12.91%)        0.05 ( 68.09%)        0.06 ( 66.03%)
      Elapsed max           4.34 (  0.00%)        4.32 (  0.56%)        4.19 (  3.62%)        4.12 (  5.15%)        3.91 (  9.88%)        4.12 (  5.25%)        3.80 ( 12.58%)        3.56 ( 18.08%)
      Elapsed range         0.34 (  0.00%)        0.28 ( 17.91%)        0.32 (  6.45%)        0.40 (-15.73%)        0.10 ( 70.06%)        0.43 (-24.84%)        0.15 ( 55.32%)        0.15 ( 56.16%)
      
      For completeness, a third set of measurements shows the situation where
      THP is enabled and allocations are again done on a single NUMA node.  Here
      munlock() is already very fast thanks to huge pages, and this series does
      not compromise that performance.  It seems that the removal of call to
      lru_add_drain() still helps a bit.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           0.01 (  0.00%)        0.01 ( -0.11%)        0.01 (  6.59%)        0.01 (  5.41%)        0.01 (  5.45%)        0.01 (  5.03%)        0.01 (  6.08%)        0.01 (  5.20%)
      Elapsed mean          0.01 (  0.00%)        0.01 ( -0.27%)        0.01 (  6.39%)        0.01 (  5.30%)        0.01 (  5.32%)        0.01 (  5.03%)        0.01 (  5.97%)        0.01 (  5.22%)
      Elapsed stddev        0.00 (  0.00%)        0.00 ( -9.59%)        0.00 ( 10.77%)        0.00 (  3.24%)        0.00 ( 24.42%)        0.00 ( 31.86%)        0.00 ( -7.46%)        0.00 (  6.11%)
      Elapsed max           0.01 (  0.00%)        0.01 ( -0.01%)        0.01 (  6.83%)        0.01 (  5.42%)        0.01 (  5.79%)        0.01 (  5.53%)        0.01 (  6.08%)        0.01 (  5.26%)
      Elapsed range         0.00 (  0.00%)        0.00 (  7.30%)        0.00 ( 24.38%)        0.00 (  6.10%)        0.00 ( 30.79%)        0.00 ( 42.52%)        0.00 (  6.11%)        0.00 ( 10.07%)
      
      This patch (of 7):
      
      In putback_lru_page() since commit c53954a0 (""mm: remove lru parameter
      from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
      determine lru list via page_lru_base_type().
      
      This patch replaces it with simple flag is_unevictable which says that the
      page was put on the inevictable list.  This is the only information that
      matters in subsequent tests.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJörn Engel <joern@logfs.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ec3b74c