- 11 Sep, 2013 40 commits
-
-
Vlastimil Babka authored
The goal of this patch series is to improve performance of munlock() of large mlocked memory areas on systems without THP. This is motivated by reported very long times of crash recovery of processes with such areas, where munlock() can take several seconds. See http://lwn.net/Articles/548108/ The work was driven by a simple benchmark (to be included in mmtests) that mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of munlock(). Profiling was performed by attaching operf --pid to the process and sending a signal to trigger the munlock() part and then notify bach the monitoring wrapper to stop operf, so that only munlock() appears in the profile. The profiles have shown that CPU time is spent mostly by atomic operations and repeated locking per single pages. This series aims to reduce both, starting from simpler to more complex changes. Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base type is not determined without being actually needed. Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu pagevec after each munlocked page is put there. Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating multiple non-THP pages under a single lru_lock instead of locking and processing each page separately. Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec introduced by previous patch. Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page when possible, bypassing the per-cpu pvec and associated overhead. Patch 6 removes a redundant get_page/put_page pair which saves costly atomic operations. Patch 7 avoids calling follow_page_mask() on each individual page, and obtains multiple page references under a single page table lock where possible. Measurements were made using 3.11-rc3 as a baseline. The first set of measurements shows the possibly ideal conditions where batching should help the most. All memory is allocated from a single NUMA node and THP is disabled. timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 3.38 ( 0.00%) 3.39 ( -0.13%) 3.00 ( 11.33%) 2.70 ( 20.20%) 2.67 ( 21.11%) 2.37 ( 29.88%) 2.20 ( 34.91%) 1.91 ( 43.59%) Elapsed mean 3.39 ( 0.00%) 3.40 ( -0.23%) 3.01 ( 11.33%) 2.70 ( 20.26%) 2.67 ( 21.21%) 2.38 ( 29.88%) 2.21 ( 34.93%) 1.92 ( 43.46%) Elapsed stddev 0.01 ( 0.00%) 0.01 (-43.09%) 0.01 ( 15.42%) 0.01 ( 23.42%) 0.00 ( 89.78%) 0.01 ( -7.15%) 0.00 ( 76.69%) 0.02 (-91.77%) Elapsed max 3.41 ( 0.00%) 3.43 ( -0.52%) 3.03 ( 11.29%) 2.72 ( 20.16%) 2.67 ( 21.63%) 2.40 ( 29.50%) 2.21 ( 35.21%) 1.96 ( 42.39%) Elapsed range 0.03 ( 0.00%) 0.04 (-51.16%) 0.02 ( 6.27%) 0.02 ( 14.67%) 0.00 ( 88.90%) 0.03 (-19.18%) 0.01 ( 73.70%) 0.06 (-113.35% The second set of measurements simulates the worst possible conditions for batching by using numactl --interleave, so that there is in fact only one page per pagevec. Even in this case the series seems to improve performance thanks to reduced atomic operations and removal of lru_add_drain(). timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 4.00 ( 0.00%) 4.04 ( -0.93%) 3.87 ( 3.37%) 3.72 ( 6.94%) 3.81 ( 4.72%) 3.69 ( 7.82%) 3.64 ( 8.92%) 3.41 ( 14.81%) Elapsed mean 4.17 ( 0.00%) 4.15 ( 0.51%) 4.03 ( 3.49%) 3.89 ( 6.84%) 3.86 ( 7.48%) 3.89 ( 6.69%) 3.70 ( 11.27%) 3.48 ( 16.59%) Elapsed stddev 0.16 ( 0.00%) 0.08 ( 50.76%) 0.10 ( 41.58%) 0.16 ( 4.59%) 0.05 ( 72.38%) 0.19 (-12.91%) 0.05 ( 68.09%) 0.06 ( 66.03%) Elapsed max 4.34 ( 0.00%) 4.32 ( 0.56%) 4.19 ( 3.62%) 4.12 ( 5.15%) 3.91 ( 9.88%) 4.12 ( 5.25%) 3.80 ( 12.58%) 3.56 ( 18.08%) Elapsed range 0.34 ( 0.00%) 0.28 ( 17.91%) 0.32 ( 6.45%) 0.40 (-15.73%) 0.10 ( 70.06%) 0.43 (-24.84%) 0.15 ( 55.32%) 0.15 ( 56.16%) For completeness, a third set of measurements shows the situation where THP is enabled and allocations are again done on a single NUMA node. Here munlock() is already very fast thanks to huge pages, and this series does not compromise that performance. It seems that the removal of call to lru_add_drain() still helps a bit. timedmunlock 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 0 1 2 3 4 5 6 7 Elapsed min 0.01 ( 0.00%) 0.01 ( -0.11%) 0.01 ( 6.59%) 0.01 ( 5.41%) 0.01 ( 5.45%) 0.01 ( 5.03%) 0.01 ( 6.08%) 0.01 ( 5.20%) Elapsed mean 0.01 ( 0.00%) 0.01 ( -0.27%) 0.01 ( 6.39%) 0.01 ( 5.30%) 0.01 ( 5.32%) 0.01 ( 5.03%) 0.01 ( 5.97%) 0.01 ( 5.22%) Elapsed stddev 0.00 ( 0.00%) 0.00 ( -9.59%) 0.00 ( 10.77%) 0.00 ( 3.24%) 0.00 ( 24.42%) 0.00 ( 31.86%) 0.00 ( -7.46%) 0.00 ( 6.11%) Elapsed max 0.01 ( 0.00%) 0.01 ( -0.01%) 0.01 ( 6.83%) 0.01 ( 5.42%) 0.01 ( 5.79%) 0.01 ( 5.53%) 0.01 ( 6.08%) 0.01 ( 5.26%) Elapsed range 0.00 ( 0.00%) 0.00 ( 7.30%) 0.00 ( 24.38%) 0.00 ( 6.10%) 0.00 ( 30.79%) 0.00 ( 42.52%) 0.00 ( 6.11%) 0.00 ( 10.07%) This patch (of 7): In putback_lru_page() since commit c53954a0 (""mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru") it is no longer needed to determine lru list via page_lru_base_type(). This patch replaces it with simple flag is_unevictable which says that the page was put on the inevictable list. This is the only information that matters in subsequent tests. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cyrill Gorcunov authored
Pavel reported that in case if vma area get unmapped and then mapped (or expanded) in-place, the soft dirty tracker won't be able to recognize this situation since it works on pte level and ptes are get zapped on unmap, loosing soft dirty bit of course. So to resolve this situation we need to track actions on vma level, there VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded) we set this bit, and keep it here until application calls for clearing soft dirty bit. Thus when user space application track memory changes now it can detect if vma area is renewed. Reported-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
SeungHun Lee authored
cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the comment is moved to __cpuset_node_allowed_softwall. So fix this comment. Signed-off-by: SeungHun Lee <waydi1@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
In case when system contains no dirty pages, wakeup_flusher_threads() will submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits immediately without doing anything, even though there are dirty inodes in the system. Thus sync(1) will write all the dirty inodes from a WB_SYNC_ALL writeback pass which is slow. Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads() instead of calculating number of dirty pages manually. That function also takes number of dirty inodes into account. Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Paul Taysom <taysom@chromium.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Khalid Aziz authored
I am working with a tool that simulates oracle database I/O workload. This tool (orion to be specific - <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>) allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then does aio into these pages from flash disks using various common block sizes used by database. I am looking at performance with two of the most common block sizes - 1M and 64K. aio performance with these two block sizes plunged after Transparent HugePages was introduced in the kernel. Here are performance numbers: pre-THP 2.6.39 3.11-rc5 1M read 8384 MB/s 5629 MB/s 6501 MB/s 64K read 7867 MB/s 4576 MB/s 4251 MB/s I have narrowed the performance impact down to the overheads introduced by THP in __get_page_tail() and put_compound_page() routines. perf top shows >40% of cycles being spent in these two routines. Every time direct I/O to hugetlbfs pages starts, kernel calls get_page() to grab a reference to the pages and calls put_page() when I/O completes to put the reference away. THP introduced significant amount of locking overhead to get_page() and put_page() when dealing with compound pages because hugepages can be split underneath get_page() and put_page(). It added this overhead irrespective of whether it is dealing with hugetlbfs pages or transparent hugepages. This resulted in 20%-45% drop in aio performance when using hugetlbfs pages. Since hugetlbfs pages can not be split, there is no reason to go through all the locking overhead for these pages from what I can see. I added code to __get_page_tail() and put_compound_page() to bypass all the locking code when working with hugetlbfs pages. This improved performance significantly. Performance numbers with this patch: pre-THP 3.11-rc5 3.11-rc5 + Patch 1M read 8384 MB/s 6501 MB/s 8371 MB/s 64K read 7867 MB/s 4251 MB/s 6510 MB/s Performance with 64K read is still lower than what it was before THP, but still a 53% improvement. It does mean there is more work to be done but I will take a 53% improvement for now. Please take a look at the following patch and let me know if it looks reasonable. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
If kswapd was reclaiming for a high order and resets it to 0 due to fragmentation it will still call compact_pgdat. For the most part, this will fail a compaction_suitable() test and not compact but it is unnecessarily sloppy. It could be fixed in the caller but fix it in the API instead. [dhillf@gmail.com: pointed out that it was a potential problem] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <dhillf@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Vagin authored
The memcg_cache_params structure contains the common part and the union, which represents two different types of data: one for root cashes and another for child caches. The size of child data is fixed. The size of the memcg_caches array is calculated in runtime. Currently the size of memcg_cache_params for root caches is calculated incorrectly, because it includes the size of parameters for child caches. ssize_t size = memcg_caches_array_size(num_groups); size *= sizeof(void *); size += sizeof(struct memcg_cache_params); v2: Fix a typo in calculations Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yinghai Lu authored
Current early_pfn_to_nid() on arch that support memblock go over memblock.memory one by one, so will take too many try near the end. We can use existing memblock_search to find the node id for given pfn, that could save some time on bigger system that have many entries memblock.memory array. Here are the timing differences for several machines. In each case with the patch less time was spent in __early_pfn_to_nid(). 3.11-rc5 with patch difference (%) -------- ---------- -------------- UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%) UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%) UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%) UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%) Time in seconds. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Russ Anderson <rja@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
new_vma_page() is called only by page migration called from do_mbind(), where pages to be migrated are queued into a pagelist by queue_pages_range(). queue_pages_range() confirms that a queued page belongs to some vma, so !vma case is not supposed to be happen. This patch adds BUG_ON() to catch this unexpected case. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
The function check_range() (and its family) is not well-named, because it does not only checking something, but moving pages from list to list to do page migration for them. So queue_pages_*range is more desirable name. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [yongjun_wei@trendmicro.com.cn: remove duplicated include] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Enable hugepage migration from migrate_pages(2), move_pages(2), and mbind(2). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with move_pages(2) after applying the enablement patch which comes later in this series. We avoid getting refcount on tail pages of hugepage, because unlike thp, hugepage is not split and we need not care about races with splitting. And migration of larger (1GB for x86_64) hugepage are not enabled. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Extend check_range() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with migrate_pages(2) after applying the enablement patch which comes later in this series. Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for example), we simply skip it now. Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by pmd/pud. This is not true in some architectures implementing hugepage with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge simply return 0 in such arch and page walker simply ignores such hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated as an argument, instead of taking a pointer to the list of hugepages to be migrated. This behavior was introduced in commit 189ebff2 ("hugetlb: simplify migrate_huge_page()"), and was OK because until now hugepage migration is enabled only for soft-offlining which migrates only one hugepage in a single call. But the situation will change in the later patches in this series which enable other users of page migration to support hugepage migration. They can kick migration for both of normal pages and hugepages in a single call, so we need to go back to original implementation which uses linked lists to collect the hugepages to be migrated. With this patch, soft_offline_huge_page() switches to use migrate_pages(), and migrate_huge_page() is not used any more. So let's remove it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
If we fail with a reserved page, just calling put_page() is not sufficient, because put_page() invoke free_huge_page() at last step and it doesn't know whether a page comes from a reserved pool or not. So it doesn't do anything related to reserved count. This makes reserve count lower than how we need, because reserve count already decrease in dequeue_huge_page_vma(). This patch fix this situation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
We don't need to grab a page_table_lock when we try to release a page. So, defer to grab a page_table_lock. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for private. So we don't need to check whether this mapping is for shared or not. This patch is just for clean-up. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
If we alloc hugepage with avoid_reserve, we don't dequeue reserved one. So, we should check subpool counter when avoid_reserve. This patch implement it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
'reservations' is so long name as a variable and we use 'resv_map' to represent 'struct resv_map' in other place. To reduce confusion and unreadability, change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
Don't use the reserve pool when soft offlining a hugepage. Check we have free pages outside the reserve pool before we dequeue the huge page. Otherwise, we can steal other's reserve page. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Toshi Kani authored
lock_device_hotplug() serializes hotplug & online/offline operations. The lock is held in common sysfs online/offline interfaces and ACPI hotplug code paths. And here are the code paths: - CPU & Mem online/offline via sysfs online store_online()->lock_device_hotplug() - Mem online via sysfs state: store_mem_state()->lock_device_hotplug() - ACPI CPU & Mem hot-add: acpi_scan_bus_device_check()->lock_device_hotplug() - ACPI CPU & Mem hot-delete: acpi_scan_hot_remove()->lock_device_hotplug() try_offline_node() off-lines a node if all memory sections and cpus are removed on the node. It is called from acpi_processor_remove() and acpi_memory_remove_memory()->remove_memory() paths, both of which are in the ACPI hotplug code. try_offline_node() calls stop_machine() to stop all cpus while checking all cpu status with the assumption that the caller is not protected from CPU hotplug or CPU online/offline operations. However, the caller is always serialized with lock_device_hotplug(). Also, the code needs to be properly serialized with a lock, not by stopping all cpus at a random place with stop_machine(). This patch removes the use of stop_machine() in try_offline_node() and adds comments to try_offline_node() and remove_memory() that lock_device_hotplug() is required. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Toshi Kani authored
add_memory() and remove_memory() can only handle a memory range aligned with section. There are problems when an unaligned range is added and then deleted as follows: - add_memory() with an unaligned range succeeds, but __add_pages() called from add_memory() adds a whole section of pages even though a given memory range is less than the section size. - remove_memory() to the added unaligned range hits BUG_ON() in __remove_pages(). This patch changes add_memory() and remove_memory() to check if a given memory range is aligned with section at the beginning. As the result, add_memory() fails with -EINVAL when a given range is unaligned, and does not add such memory range. This prevents remove_memory() to be called with an unaligned range as well. Note that remove_memory() has to use BUG_ON() since this function cannot fail. [akpm@linux-foundation.org: avoid printk warnings] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Davidlohr Bueso authored
Explicitly mention/recommend using the libhugetlbfs test cases when changing related kernel code. Developers that are unaware of the project can easily miss this and introduce potential regressions that may or may not be caught by community review. Also do some cleanups that make the document visually easier to view at a first glance. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
This helps performance on moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Another SSD rand read test from Miao # file size: 2GB # read IO amount: 625MB sysbench --test=fileio \ --max-requests=10000 \ --num-threads=1 \ --file-num=1 \ --file-block-size=64K \ --file-test-mode=rndrd \ --file-fsync-freq=0 \ --file-fsync-end=off run shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from 104MB/s to 121MB/s. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Tao Ma <tm@tao.ma> Tested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Xishi Qiu authored
Use "zone_is_initialized()" instead of "if (zone->wait_table)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Xishi Qiu authored
Use "zone_is_empty()" instead of "if (zone->spanned_pages)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Xishi Qiu authored
Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages". Simplify the code, no functional change. [akpm@linux-foundation.org: fix build] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonyoung Shim authored
In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jianguo Wu authored
Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Xishi Qiu authored
I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(), because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block" is always greater than 0. memory_block_action() offline_pages() __offline_pages() BUG_ON(start_pfn >= end_pfn) In v2.6.32, If info->length==0, this way may hit this BUG_ON(). acpi_memory_disable_device() remove_memory(info->start_addr, info->length) offline_pages() A later Fujitsu patch renamed this function and the BUG_ON() is unnecessary. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joonsoo Kim authored
vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
Disabling interrupts repeatedly can be avoided in the inner loop if we use a this_cpu operation. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
Both functions that update global counters use the same mechanism. Create a function that contains the common code. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-