1. 14 Jan, 2011 40 commits
    • Shaohua Li's avatar
      block cfq: make queue preempt work for queues from different workload · f8ae6e3e
      Shaohua Li authored
      I got this:
                   fio-874   [007]  2157.724514:   8,32   m   N cfq874 preempt
                   fio-874   [007]  2157.724519:   8,32   m   N cfq830 slice expired t=1
                   fio-874   [007]  2157.724520:   8,32   m   N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
                   fio-874   [007]  2157.724521:   8,32   m   N cfq830 set_active wl_prio:0 wl_type:0
                   fio-874   [007]  2157.724522:   8,32   m   N cfq830 Not idling. st->count:1
      
      cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
      have cfqg->saved_workload_slice mechanism, the preempt is a nop.
      Looks currently our preempt is totally broken if the two queues are not from
      the same workload type.
      Below patch fixes it. This will might make async queue starvation, but it's
      what our old code does before cgroup is added.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      f8ae6e3e
    • Linus Torvalds's avatar
      Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 · 52cfd503
      Linus Torvalds authored
      * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
        ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
        ACPI: fix resource check message
        ACPI / Battery: Update information on info notification and resume
        ACPI: Drop device flag wake_capable
        ACPI: Always check if _PRW is present before trying to evaluate it
        ACPI / PM: Check status of power resources under mutexes
        ACPI / PM: Rename acpi_power_off_device()
        ACPI / PM: Drop acpi_power_nocheck
        ACPI / PM: Drop acpi_bus_get_power()
        Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
        ACPI / Fan: Rework the handling of power resources
        ACPI / PM: Register power resource devices as soon as they are needed
        ACPI / PM: Register acpi_power_driver early
        ACPI / PM: Add function for updating device power state consistently
        ACPI / PM: Add function for device power state initialization
        ACPI / PM: Introduce __acpi_bus_get_power()
        ACPI / PM: Introduce function for refcounting device power resources
        ACPI / PM: Add functions for manipulating lists of power resources
        ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
        ACPICA: Update version to 20101209
        ...
      52cfd503
    • Linus Torvalds's avatar
      Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 · dc8e7e3e
      Linus Torvalds authored
      * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
        cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer
        intel_idle: open broadcast clock event
        cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific
        cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle
        cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions
        SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW
        cpuidle: delete NOP CPUIDLE_FLAG_POLL
        ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs
        cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL
        ACPI, intel_idle: Cleanup idle= internal variables
        cpuidle: Make cpuidle_enable_device() call poll_idle_init()
        intel_idle: update Sandy Bridge core C-state residency targets
      dc8e7e3e
    • Linus Torvalds's avatar
    • Linus Torvalds's avatar
      Merge branch 'vfs-scale-working' of... · db9effe9
      Linus Torvalds authored
      Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
      
      * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
        fs: fix do_last error case when need_reval_dot
        nfs: add missing rcu-walk check
        fs: hlist UP debug fixup
        fs: fix dropping of rcu-walk from force_reval_path
        fs: force_reval_path drop rcu-walk before d_invalidate
        fs: small rcu-walk documentation fixes
      
      Fixed up trivial conflicts in Documentation/filesystems/porting
      db9effe9
    • J. R. Okajima's avatar
      fs: fix do_last error case when need_reval_dot · f20877d9
      J. R. Okajima authored
      When open(2) without O_DIRECTORY opens an existing dir, it should return
      EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
      is changed by d_revalidate() which returns any positive to represent
      'the target dir is valid.'
      
      Should we keep and return the initialized 'error' in this case.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      f20877d9
    • Nick Piggin's avatar
      nfs: add missing rcu-walk check · 657e94b6
      Nick Piggin authored
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      657e94b6
    • Linus Torvalds's avatar
      Merge branch 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen · 9c4bc1c2
      Linus Torvalds authored
      * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
        xen/p2m: Fix module linking error.
        xen p2m: clear the old pte when adding a page to m2p_override
        xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
        xen: introduce gnttab_map_refs and gnttab_unmap_refs
        xen p2m: transparently change the p2m mappings in the m2p override
        xen/gntdev: Fix circular locking dependency
        xen/gntdev: stop using "token" argument
        xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
        xen: add m2p override mechanism
        xen: move p2m handling to separate file
        xen/gntdev: add VM_PFNMAP to vma
        xen/gntdev: allow usermode to map granted pages
        xen: define gnttab_set_map_op/unmap_op
      
      Fix up trivial conflict in drivers/xen/Kconfig
      9c4bc1c2
    • Linus Torvalds's avatar
      Merge branch 'stable/platform-pci-fixes' of... · 2c0076d8
      Linus Torvalds authored
      Merge branch 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
      
      * 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
        xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
        xen: rename platform-pci module to xen-platform-pci.
        xen-platform: use PCI interfaces to request IO and MEM resources.
      2c0076d8
    • Nick Piggin's avatar
      fs: hlist UP debug fixup · 2c675598
      Nick Piggin authored
      Po-Yu Chuang <ratbert.chuang@gmail.com> noticed that hlist_bl_set_first could
      crash on a UP system when LIST_BL_LOCKMASK is 0, because
      
      	LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));
      
      always evaulates to true.
      
      Fix the expression, and also avoid a dependency between bit spinlock
      implementation and list bl code (list code shouldn't know anything
      except that bit 0 is set when adding and removing elements). Eventually
      if a good use case comes up, we might use this list to store 1 or more
      arbitrary bits of data, so it really shouldn't be tied to locking either,
      but for now they are helpful for debugging.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      2c675598
    • Nick Piggin's avatar
      fs: fix dropping of rcu-walk from force_reval_path · 90dbb77b
      Nick Piggin authored
      As J. R. Okajima noted, force_reval_path passes in the same dentry to
      d_revalidate as the one in the nameidata structure (other callers pass in a
      child), so the locking breaks. This can oops with a chrooted nfs mount, for
      example. Similarly there can be other problems with revalidating a dentry
      which is already in nameidata of the path walk.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      90dbb77b
    • Nick Piggin's avatar
      fs: force_reval_path drop rcu-walk before d_invalidate · bb20c18d
      Nick Piggin authored
      d_revalidate can return in rcu-walk mode even when it returns 0.  We can't just
      call any old dcache function on rcu-walk dentry (the dentry is unstable, so
      even through d_lock can safely be taken, the result may no longer be what we
      expect -- careful re-checks would be required). So just drop rcu in this case.
      
      (I missed this conversion when switching to the rcu-walk convention that Linus
      suggested)
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      bb20c18d
    • Nick Piggin's avatar
      fs: small rcu-walk documentation fixes · a82416da
      Nick Piggin authored
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      a82416da
    • Daisuke Nishimura's avatar
      memcg: fix memory migration of shmem swapcache · 50de1dd9
      Daisuke Nishimura authored
      In the current implementation mem_cgroup_end_migration() decides whether
      the page migration has succeeded or not by checking "oldpage->mapping".
      
      But if we are tring to migrate a shmem swapcache, the page->mapping of it
      is NULL from the begining, so the check would be invalid.  As a result,
      mem_cgroup_end_migration() assumes the migration has succeeded even if
      it's not, so "newpage" would be freed while it's not uncharged.
      
      This patch fixes it by passing mem_cgroup_end_migration() the result of
      the page migration.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50de1dd9
    • Jesper Juhl's avatar
      memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset · 17295c88
      Jesper Juhl authored
      In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
      followed by memset() to zero the memory.  This can be more efficiently
      achieved by using kzalloc() and vzalloc().  There's also one situation
      where we can use kzalloc_node() - this is what's new in this version of
      the patch.
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17295c88
    • Daisuke Nishimura's avatar
      memcg: fix deadlock between cpuset and memcg · dfe076b0
      Daisuke Nishimura authored
      Commit b1dd693e ("memcg: avoid deadlock between move charge and
      try_charge()") can cause another deadlock about mmap_sem on task migration
      if cpuset and memcg are mounted onto the same mount point.
      
      After the commit, cgroup_attach_task() has sequence like:
      
      cgroup_attach_task()
        ss->can_attach()
          cpuset_can_attach()
          mem_cgroup_can_attach()
            down_read(&mmap_sem)        (1)
        ss->attach()
          cpuset_attach()
            mpol_rebind_mm()
              down_write(&mmap_sem)     (2)
              up_write(&mmap_sem)
            cpuset_migrate_mm()
              do_migrate_pages()
                down_read(&mmap_sem)
                up_read(&mmap_sem)
          mem_cgroup_move_task()
            mem_cgroup_clear_mc()
              up_read(&mmap_sem)
      
      We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).
      
      But the commit itself is necessary to fix deadlocks which have existed
      before the commit like:
      
      Ex.1)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |  down_write(&mmap_sem)
            mc.moving_task = current          |    ..
            mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
              mem_cgroup_count_precharge()    |    prepare_to_wait()
                down_read(&mmap_sem)          |    if (mc.moving_task)
                -> cannot aquire the lock     |    -> true
                                              |      schedule()
                                              |      -> move charge should wake it up
      
      Ex.2)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |
            mc.moving_task = current          |
            mem_cgroup_precharge_mc()         |
              mem_cgroup_count_precharge()    |
                down_read(&mmap_sem)          |
                ..                            |
                up_read(&mmap_sem)            |
                                              |  down_write(&mmap_sem)
          mem_cgroup_move_task()              |    ..
            mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
              down_read(&mmap_sem)            |    prepare_to_wait()
              -> cannot aquire the lock       |    if (mc.moving_task)
                                              |    -> true
                                              |      schedule()
                                              |      -> move charge should wake it up
      
      This patch fixes all of these problems by:
      1. revert the commit.
      2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
         has released the mmap_sem.
      3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
         mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
         all extra charges, wake up all waiters, and retry trylock.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe076b0
    • Minchan Kim's avatar
    • Johannes Weiner's avatar
      memcg: fix unit mismatch in memcg oom limit calculation · f3e8eb70
      Johannes Weiner authored
      Adding the number of swap pages to the byte limit of a memory control
      group makes no sense.  Convert the pages to bytes before adding them.
      
      The only user of this code is the OOM killer, and the way it is used means
      that the error results in a higher OOM badness value.  Since the cgroup
      limit is the same for all tasks in the cgroup, the error should have no
      practical impact at the moment.
      
      But let's not wait for future or changing users to trip over it.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3e8eb70
    • KAMEZAWA Hiroyuki's avatar
      memcg: add lock to synchronize page accounting and migration · dbd4ea78
      KAMEZAWA Hiroyuki authored
      Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
      accounting and migration code.  This reworks the locking scheme of
      _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
      which is always taken under IRQ disable.
      
      1. If pages are being migrated from a memcg, then updates to that
         memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
         move_lock_page_cgroup().  In an upcoming commit, memcg dirty page
         accounting will be updating memcg page accounting (specifically: num
         writeback pages) from IRQ context (softirq).  Avoid a deadlocking
         nested spin lock attempt by disabling irq on the local processor when
         grabbing the PCG_MOVE_LOCK.
      
      2. lock for update_page_stat is used only for avoiding race with
         move_account().  So, IRQ awareness of lock_page_cgroup() itself is not
         a problem.  The problem is between mem_cgroup_update_page_stat() and
         mem_cgroup_move_account_page().
      
      Trade-off:
        * Changing lock_page_cgroup() to always disable IRQ (or
          local_bh) has some impacts on performance and I think
          it's bad to disable IRQ when it's not necessary.
        * adding a new lock makes move_account() slower.  Score is
          here.
      
      Performance Impact: moving a 8G anon process.
      
      Before:
      	real    0m0.792s
      	user    0m0.000s
      	sys     0m0.780s
      
      After:
      	real    0m0.854s
      	user    0m0.000s
      	sys     0m0.842s
      
      This score is bad but planned patches for optimization can reduce
      this impact.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrea Righi <arighi@develer.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbd4ea78
    • Greg Thelen's avatar
      memcg: create extensible page stat update routines · 2a7106f2
      Greg Thelen authored
      Replace usage of the mem_cgroup_update_file_mapped() memcg
      statistic update routine with two new routines:
      * mem_cgroup_inc_page_stat()
      * mem_cgroup_dec_page_stat()
      
      As before, only the file_mapped statistic is managed.  However, these more
      general interfaces allow for new statistics to be more easily added.  New
      statistics are added with memcg dirty page accounting.
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrea Righi <arighi@develer.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a7106f2
    • Greg Thelen's avatar
      memcg: document cgroup dirty memory interfaces · ece72400
      Greg Thelen authored
      Document cgroup dirty memory interfaces and statistics.
      
      [akpm@linux-foundation.org: fix use_hierarchy description]
      Signed-off-by: default avatarAndrea Righi <arighi@develer.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ece72400
    • Greg Thelen's avatar
      memcg: add page_cgroup flags for dirty page tracking · db16d5ec
      Greg Thelen authored
      This patchset provides the ability for each cgroup to have independent
      dirty page limits.
      
      Limiting dirty memory is like fixing the max amount of dirty (hard to
      reclaim) page cache used by a cgroup.  So, in case of multiple cgroup
      writers, they will not be able to consume more than their designated share
      of dirty pages and will be forced to perform write-out if they cross that
      limit.
      
      The patches are based on a series proposed by Andrea Righi in Mar 2010.
      
      Overview:
      
      - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
        unstable.
      
      - Extend mem_cgroup to record the total number of pages in each of the
        interesting dirty states (dirty, writeback, unstable_nfs).
      
      - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
        limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
        via cgroupfs control files.
      
      - Consider both system and per-memcg dirty limits in page writeback when
        deciding to queue background writeback or block for foreground writeback.
      
      Known shortcomings:
      
      - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
        writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
        just inodes contributing dirty pages to the cgroup exceeding its limit.
      
      - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
        implementation detail.  An enhanced implementation is needed to check the
        chain of parents to ensure that no dirty limit is exceeded.
      
      Performance data:
      - A page fault microbenchmark workload was used to measure performance, which
        can be called in read or write mode:
              f = open(foo. $cpu)
              truncate(f, 4096)
              alarm(60)
              while (1) {
                      p = mmap(f, 4096)
                      if (write)
      			*p = 1
      		else
      			x = *p
                      munmap(p)
              }
      
      - The workload was called for several points in the patch series in different
        modes:
        - s_read is a single threaded reader
        - s_write is a single threaded writer
        - p_read is a 16 thread reader, each operating on a different file
        - p_write is a 16 thread writer, each operating on a different file
      
      - Measurements were collected on a 16 core non-numa system using "perf stat
        --repeat 3".  The -a option was used for parallel (p_*) runs.
      
      - All numbers are page fault rate (M/sec).  Higher is better.
      
      - To compare the performance of a kernel without non-memcg compare the first and
        last rows, neither has memcg configured.  The first row does not include any
        of these memcg patches.
      
      - To compare the performance of using memcg dirty limits, compare the baseline
        (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
        row titled "all patches").
      
                                 root_cgroup                    child_cgroup
                       s_read s_write p_read p_write   s_read s_write p_read p_write
      mmotm w/o memcg   0.428  0.390   0.429  0.388
      mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
      all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
      all patches       0.431  0.402   0.427  0.395
        w/o memcg
      
      This patch:
      
      Add additional flags to page_cgroup to track dirty pages within a
      mem_cgroup.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrea Righi <arighi@develer.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db16d5ec
    • Shaohua Li's avatar
      mm: batch activate_page() to reduce lock contention · 744ed144
      Shaohua Li authored
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      I tested some microbenchmarks:
      case-anon-cow-rand-mt		0.58%
      case-anon-cow-rand		-3.30%
      case-anon-cow-seq-mt		-0.51%
      case-anon-cow-seq		-5.68%
      case-anon-r-rand-mt		0.23%
      case-anon-r-rand		0.81%
      case-anon-r-seq-mt		-0.71%
      case-anon-r-seq			-1.99%
      case-anon-rx-rand-mt		2.11%
      case-anon-rx-seq-mt		3.46%
      case-anon-w-rand-mt		-0.03%
      case-anon-w-rand		-0.50%
      case-anon-w-seq-mt		-1.08%
      case-anon-w-seq			-0.12%
      case-anon-wx-rand-mt		-5.02%
      case-anon-wx-seq-mt		-1.43%
      case-fork			1.65%
      case-fork-sleep			-0.07%
      case-fork-withmem		1.39%
      case-hugetlb			-0.59%
      case-lru-file-mmap-read-mt	-0.54%
      case-lru-file-mmap-read		0.61%
      case-lru-file-mmap-read-rand	-2.24%
      case-lru-file-readonce		-0.64%
      case-lru-file-readtwice		-11.69%
      case-lru-memcg			-1.35%
      case-mmap-pread-rand-mt		1.88%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq-mt		0.89%
      case-mmap-pread-seq		-69.72%
      case-mmap-xread-rand-mt		0.71%
      case-mmap-xread-seq-mt		0.38%
      
      The most significent are:
      case-lru-file-readtwice		-11.69%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq		-69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      744ed144
    • Shaohua Li's avatar
      mm: simplify code of swap.c · d8505dee
      Shaohua Li authored
      Clean up code and remove duplicate code.  Next patch will use
      pagevec_lru_move_fn introduced here too.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8505dee
    • Andrew Morton's avatar
      mm/page_alloc.c: don't cache `current' in a local · c06b1fca
      Andrew Morton authored
      It's old-fashioned and unneeded.
      
      akpm:/usr/src/25> size mm/page_alloc.o
         text    data     bss     dec     hex filename
        39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
        39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c06b1fca
    • Hugh Dickins's avatar
      mm: fix hugepage migration · fd4a4663
      Hugh Dickins authored
      2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
      but its anon_vma handling was still based around the 2.6.35 conventions.
      Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
      drop_anon_vma in the same way as we're now changing unmap_and_move().
      
      I don't particularly like to propose this for stable when I've not seen
      its problems in practice nor tested the solution: but it's clearly out of
      synch at present.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: <stable@kernel.org> [2.6.37, 2.6.36]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd4a4663
    • Hugh Dickins's avatar
      mm: fix migration hangs on anon_vma lock · 1ce82b69
      Hugh Dickins authored
      Increased usage of page migration in mmotm reveals that the anon_vma
      locking in unmap_and_move() has been deficient since 2.6.36 (or even
      earlier).  Review at the time of f1819427
      ("mm: fix hang on anon_vma->root->lock") missed the issue here: the
      anon_vma to which we get a reference may already have been freed back to
      its slab (it is in use when we check page_mapped, but that can change),
      and so its anon_vma->root may be switched at any moment by reuse in
      anon_vma_prepare.
      
      Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
      not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
      then we don't need any rcu read locking over here.
      
      In removing the rcu_unlock label: since PageAnon is a bit in
      page->mapping, it's impossible for a !page->mapping page to be anon; but
      insert VM_BUG_ON in case the implementation ever changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: <stable@kernel.org> [2.6.37, 2.6.36]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ce82b69
    • Hugh Dickins's avatar
      ksm: drain pagevecs to lru · 2919bfd0
      Hugh Dickins authored
      It was hard to explain the page counts which were causing new LTP tests
      of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Cc:Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2919bfd0
    • Eric B Munson's avatar
      hugetlb: fix handling of parse errors in sysfs · 73ae31e5
      Eric B Munson authored
      When parsing changes to the huge page pool sizes made from userspace via
      the sysfs interface, bogus input values are being covered up by
      nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
      when strict_strtoul returns an error.  This can cause an infinite loop in
      the nr_hugepages_store code.  This patch changes the return value for
      these functions to -EINVAL when strict_strtoul returns an error.
      Signed-off-by: default avatarEric B Munson <emunson@mgebm.net>
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73ae31e5
    • Eric B Munson's avatar
      hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment · adbe8726
      Eric B Munson authored
      Huge pages with order >= MAX_ORDER must be allocated at boot via the
      kernel command line, they cannot be allocated or freed once the kernel is
      up and running.  Currently we allow values to be written to the sysfs and
      sysctl files controling pool size for these huge page sizes.  This patch
      makes the store functions for nr_hugepages and nr_overcommit_hugepages
      return -EINVAL when the pool for a page size >= MAX_ORDER is changed.
      
      [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
      [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
      Signed-off-by: default avatarEric B Munson <emunson@mgebm.net>
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adbe8726
    • Michal Hocko's avatar
      hugetlb: check the return value of string conversion in sysctl handler · 08d4a246
      Michal Hocko authored
      proc_doulongvec_minmax may fail if the given buffer doesn't represent a
      valid number.  If we provide something invalid we will initialize the
      resulting value (nr_overcommit_huge_pages in this case) to a random value
      from the stack.
      
      The issue was introduced by a3d0c6aa when the default handler has been
      replaced by the helper function where we do not check the return value.
      
      Reproducer:
      echo "" > /proc/sys/vm/nr_overcommit_hugepages
      
      [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: CAI Qian <caiqian@redhat.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08d4a246
    • Stefan Hajnoczi's avatar
      fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc · cb9ef8d5
      Stefan Hajnoczi authored
      The sync_inodes_sb() function does not have a return value.  Remove the
      outdated documentation comment.
      Signed-off-by: default avatarStefan Hajnoczi <stefanha@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb9ef8d5
    • Andrew Morton's avatar
      mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() · 684265d4
      Andrew Morton authored
      As it stands this code will degenerate into a busy-wait if the calling task
      has signal_pending().
      
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      684265d4
    • Rolf Eike Beer's avatar
      mm/dmapool.c: take lock only once in dma_pool_free() · 84bc227d
      Rolf Eike Beer authored
      dma_pool_free() scans for the page to free in the pool list holding the
      pool lock.  Then it releases the lock basically to acquire it immediately
      again.  Modify the code to only take the lock once.
      
      This will do some additional loops and computations with the lock held in
      if memory debugging is activated.  If it is not activated the only new
      operations with this lock is one if and one substraction.
      Signed-off-by: default avatarRolf Eike Beer <eike-kernel@sf-tec.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84bc227d
    • KyongHo Cho's avatar
      mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists · 43506fad
      KyongHo Cho authored
      The previous approach of calucation of combined index was
      
      	page_idx & ~(1 << order))
      
      but we have same result with
      
      	page_idx & buddy_idx
      
      This reduces instructions slightly as well as enhances readability.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix used-unintialised warning]
      Signed-off-by: default avatarKyongHo Cho <pullip.cho@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43506fad
    • Jiri Kosina's avatar
      brk: fix min_brk lower bound computation for COMPAT_BRK · 5520e894
      Jiri Kosina authored
      Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
      be overriden by randomize_va_space sysctl.
      
      If this is the case, the min_brk computation in sys_brk() implementation
      is wrong, as it solely takes into account COMPAT_BRK setting, assuming
      that brk start is not randomized.  But that might not be the case if
      randomize_va_space sysctl has been set to '2' at the time the binary has
      been loaded from disk.
      
      In such case, the check has to be done in a same way as in
      !CONFIG_COMPAT_BRK case.
      
      In addition to that, the check for the COMPAT_BRK case introduced back in
      a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
      bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
      but mm->end_data instead, as that's where the legacy applications expect
      brk section to start (i.e.  immediately after last global variable).
      
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5520e894
    • Jesper Juhl's avatar
      mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common() · 32d6fead
      Jesper Juhl authored
      The NODEMASK_ALLOC macro may dynamically allocate memory for its second
      argument ('nodes_allowed' in this context).
      
      In nr_hugepages_store_common() we may abort early if strict_strtoul()
      fails, but in that case we do not free the memory already allocated to
      'nodes_allowed', causing a memory leak.
      
      This patch closes the leak by freeing the memory in the error path.
      
      [akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32d6fead
    • Mel Gorman's avatar
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree... · 29c1f677
      Mel Gorman authored
      mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration
      
      migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
      anonymous pages, as introduced by git commit
      989f89c5 ("fix rcu_read_lock() in page
      migraton").  The point of the RCU protection there is part of getting a
      stable reference to anon_vma and is only held for anon pages as file pages
      are locked which is sufficient protection against freeing.
      
      However, while a file page's mapping is being migrated, the radix tree is
      double checked to ensure it is the expected page.  This uses
      radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
      triggering the following warning.
      
      [  173.674290] ===================================================
      [  173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
      [  173.676016] ---------------------------------------------------
      [  173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
      [  173.676016]
      [  173.676016] other info that might help us debug this:
      [  173.676016]
      [  173.676016]
      [  173.676016] rcu_scheduler_active = 1, debug_locks = 0
      [  173.676016] 1 lock held by hugeadm/2899:
      [  173.676016]  #0:  (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
      [  173.676016]
      [  173.676016] stack backtrace:
      [  173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
      [  173.676016] Call Trace:
      [  173.676016]  [<c128cc01>] ? printk+0x14/0x1b
      [  173.676016]  [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
      [  173.676016]  [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
      [  173.676016]  [<c10e41ad>] migrate_page+0x23/0x39
      [  173.676016]  [<c10e491b>] buffer_migrate_page+0x22/0x107
      [  173.676016]  [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
      [  173.676016]  [<c10e425d>] move_to_new_page+0x9a/0x1ae
      [  173.676016]  [<c10e47e6>] migrate_pages+0x1e7/0x2fa
      
      This patch introduces radix_tree_deref_slot_protected() which calls
      rcu_dereference_protected().  Users of it must pass in the
      mapping->tree_lock that is protecting this dereference.  Holding the tree
      lock protects against parallel updaters of the radix tree meaning that
      rcu_dereference_protected is allowable.
      
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.37.early]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29c1f677
    • Andrea Arcangeli's avatar
      thp: add compound_trans_head() helper · 22e5c47e
      Andrea Arcangeli authored
      Cleanup some code with common compound_trans_head helper.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22e5c47e
    • Andrea Arcangeli's avatar
      thp: KSM on THP · 29ad768c
      Andrea Arcangeli authored
      This makes KSM full operational with THP pages.  Subpages are scanned
      while the hugepage is still in place and delivering max cpu performance,
      and only if there's a match and we're going to deduplicate memory, the
      single hugepages with the subpage match is split.
      
      There will be no false sharing between ksmd and khugepaged.  khugepaged
      won't collapse 2m virtual regions with KSM pages inside.  ksmd also should
      only split pages when the checksum matches and we're likely to split an
      hugepage for some long living ksm page (usual ksm heuristic to avoid
      sharing pages that get de-cowed).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29ad768c