[PATCH] readahead: multiple performance fixes

From: Ram Pai <linuxram@us.ibm.com> I have enclosed a patch that fixes a bunch of performance bugs in the readahead code. Below is a brief summary of the problems noticed and the proposed fixes with some results: Problem 1: Readahead code closes the readahead window and goes into slowread path, if a file is accessed the first time at an offset notequal to zero. In the case of databases(especially in db2), a file may not be accessed at offset 0 the first time though the i/o's are sequential. Fix to Problem 1: min = get_min_readahead(ra); orig_next_size = ra-next_size; - if (ra-next_size == 0 && offset == 0) { + if (ra-next_size == 0) { ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 2: After fixing Problem, the readahead window still does not open up the first time, if all the pages requested are already in the page cache. This time the window closes because of pagecache hits instead of misses. To fix this we put in these changes. - check_ra_success(ra, ra-size, actual, orig_next_size); + if(!first_access) { + check_ra_success(ra, ra-size, actual, orig_next_size); + } ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 3: In the case of large random reads, the readahead window is read in, the moment there is a hit in the active window. And it turns out that in most of the cases the readahead window gets scrapped, because the next large random read does not even touch any of the pages in that readahead window. We fixed this by introducing lazy readahead. Basically we wait till the last page in the active window gets a hit. And once the last page is hit, the readahead window is then read in. This fix gave a tremendous boost in the performance. To fix this the changes we put in were: /* * This read request is within the current window. It is time * to submit I/O for the ahead window while the application is * crunching through the current window. */ - if (ra-ahead_start == 0) { + if (ra-ahead_start == 0 && offset == (ra-start + ra-size -1)) { ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 4: If the request page does not fall in the active window and is not the first page of the read ahead window, we scrap both the active window and the readahaed window and read in the active window. But it turns out that we read in a lot of pages in the active window based on the size of the 'projected readahead window size' (the next_size variable). And we end up using part of the active window and waste the remaining. We put in a fix where we read in just as many pages in the active window based on the number of pages used in the recent past. Again this gave us another big boost in performance and ended up beating the performance of aio patch on a DSS workload. The fix to this is: * ahead window and get some I/O underway for the new * current window. */ + if (!first_access && preoffset = ra-start && + preoffset < (ra-start + ra-size)) { + ra-size = preoffset - ra-start + 2; + } else { + ra-size = ra-next_size; ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 5: With all the above fixes there is very low chance that the readahead window shall close. But however if it does, we found that the slow read path is really slow. Any loss of sequentiality in the slow read path is penalized heavily by closing the window back to zero. So we fixed this by decreasing the window size by one anytime we loose sequentiality and increasing in by 1 if we didn't. if (offset != ra-prev_page + 1) { - ra-size = 0; /* Not sequential */ + ra-size = ra-size?ra-size-1:0; /*Notsequential */ ------------------------------------------------------------------------ With the above set of fixes we got about 28% improvement in DSS workload which is about 5% more than what we got with the suparna's aio patch. This patch compared equivalent to suparna's aio patch with aio-stress run. It fared better than aio patch for large random io. We are yet to run a bunch of other benchmarks to evaluate this patch. We would like to get your inputs on this patch and any suggestions you may have to improve it. I have enclosed a patch with all these changes along with some changes to the comments that reflect the new behaviour. NOTE: the above patch reverts suparna's aio patch.

[PATCH] readahead: multiple performance fixes
From: Ram Pai <linuxram@us.ibm.com> I have enclosed a patch that fixes a bunch of performance bugs in the readahead code. Below is a brief summary of the problems noticed and the proposed fixes with some results: Problem 1: Readahead code closes the readahead window and goes into slowread path, if a file is accessed the first time at an offset notequal to zero. In the case of databases(especially in db2), a file may not be accessed at offset 0 the first time though the i/o's are sequential. Fix to Problem 1: min = get_min_readahead(ra); orig_next_size = ra-next_size; - if (ra-next_size == 0 && offset == 0) { + if (ra-next_size == 0) { ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 2: After fixing Problem, the readahead window still does not open up the first time, if all the pages requested are already in the page cache. This time the window closes because of pagecache hits instead of misses. To fix this we put in these changes. - check_ra_success(ra, ra-size, actual, orig_next_size); + if(!first_access) { + check_ra_success(ra, ra-size, actual, orig_next_size); + } ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 3: In the case of large random reads, the readahead window is read in, the moment there is a hit in the active window. And it turns out that in most of the cases the readahead window gets scrapped, because the next large random read does not even touch any of the pages in that readahead window. We fixed this by introducing lazy readahead. Basically we wait till the last page in the active window gets a hit. And once the last page is hit, the readahead window is then read in. This fix gave a tremendous boost in the performance. To fix this the changes we put in were: /* * This read request is within the current window. It is time * to submit I/O for the ahead window while the application is * crunching through the current window. */ - if (ra-ahead_start == 0) { + if (ra-ahead_start == 0 && offset == (ra-start + ra-size -1)) { ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 4: If the request page does not fall in the active window and is not the first page of the read ahead window, we scrap both the active window and the readahaed window and read in the active window. But it turns out that we read in a lot of pages in the active window based on the size of the 'projected readahead window size' (the next_size variable). And we end up using part of the active window and waste the remaining. We put in a fix where we read in just as many pages in the active window based on the number of pages used in the recent past. Again this gave us another big boost in performance and ended up beating the performance of aio patch on a DSS workload. The fix to this is: * ahead window and get some I/O underway for the new * current window. */ + if (!first_access && preoffset = ra-start && + preoffset < (ra-start + ra-size)) { + ra-size = preoffset - ra-start + 2; + } else { + ra-size = ra-next_size; ------------------------------------------------------------------------ ------------------------------------------------------------------------ Problem 5: With all the above fixes there is very low chance that the readahead window shall close. But however if it does, we found that the slow read path is really slow. Any loss of sequentiality in the slow read path is penalized heavily by closing the window back to zero. So we fixed this by decreasing the window size by one anytime we loose sequentiality and increasing in by 1 if we didn't. if (offset != ra-prev_page + 1) { - ra-size = 0; /* Not sequential */ + ra-size = ra-size?ra-size-1:0; /*Notsequential */ ------------------------------------------------------------------------ With the above set of fixes we got about 28% improvement in DSS workload which is about 5% more than what we got with the suparna's aio patch. This patch compared equivalent to suparna's aio patch with aio-stress run. It fared better than aio patch for large random io. We are yet to run a bunch of other benchmarks to evaluate this patch. We would like to get your inputs on this patch and any suggestions you may have to improve it. I have enclosed a patch with all these changes along with some changes to the comments that reflect the new behaviour. NOTE: the above patch reverts suparna's aio patch.
2fbe6496 · Andrew Morton · Linus Torvalds · 56d7e6f4 · 2fbe6496 · 2fbe6496
Commit 2fbe6496 authored Dec 29, 2003 by Andrew Morton Committed by Linus Torvalds Dec 29, 2003
Show whitespace changes
Inline Side-by-side

Showing with 44 additions and 8 deletions

mm/filemap.c mm/filemap.c +10 -2

mm/readahead.c mm/readahead.c +34 -6

No files found.
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -587,13 +587,22 @@ void do_generic_mapping_read(struct address_space *mapping,
 			     read_actor_t actor)
 {
 	struct inode *inode = mapping->host;
-	unsigned long index, offset;
+	unsigned long index, offset, last;
 	struct page *cached_page;
 	int error;

 	cached_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
+	last = (*ppos + desc->count) >> PAGE_CACHE_SHIFT;
+
+	/*
+	 * Let the readahead logic know upfront about all
+	 * the pages we'll need to satisfy this request
+	 */
+	for (; index < last; index++)
+		page_cache_readahead(mapping, ra, filp, index);
+	index = *ppos >> PAGE_CACHE_SHIFT;

 	for (;;) {
 		struct page *page;
@@ -612,7 +621,6 @@ void do_generic_mapping_read(struct address_space *mapping,
 		}

 		cond_resched();
-		page_cache_readahead(mapping, ra, filp, index);

 		nr = nr - offset;
 find_page:

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -347,6 +347,8 @@ page_cache_readahead(struct address_space *mapping, struct file_ra_state *ra,
 	unsigned min;
 	unsigned orig_next_size;
 	unsigned actual;
+	int first_access=0;
+	unsigned long preoffset=0;

 	/*
 	 * Here we detect the case where the application is performing
@@ -370,16 +372,18 @@ page_cache_readahead(struct address_space *mapping, struct file_ra_state *ra,
 	min = get_min_readahead(ra);
 	orig_next_size = ra->next_size;

-	if (ra->next_size == 0 && offset == 0) {
+	if (ra->next_size == 0) {
 		/*
-		 * Special case - first read from first page.
+		 * Special case - first read.
 		 * We'll assume it's a whole-file read, and
 		 * grow the window fast.
 		 */
+		first_access=1;
 		ra->next_size = max / 2;
 		goto do_io;
 	}

+	preoffset = ra->prev_page;
 	ra->prev_page = offset;

 	if (offset >= ra->start && offset <= (ra->start + ra->size)) {
@@ -439,20 +443,44 @@ page_cache_readahead(struct address_space *mapping, struct file_ra_state *ra,
 		 * ahead window and get some I/O underway for the new
 		 * current window.
 		 */
+		if (!first_access && preoffset >= ra->start &&
+				preoffset < (ra->start + ra->size)) {
+			 /* Heuristic:  If 'n' pages were
+			  * accessed in the current window, there
+			  * is a high probability that around 'n' pages
+			  * shall be used in the next current window.
+			  *
+			  * To minimize lazy-readahead triggered
+			  * in the next current window, read in
+			  * an extra page.
+			  */
+			ra->next_size = preoffset - ra->start + 2;
+		}
 		ra->start = offset;
 		ra->size = ra->next_size;
 		ra->ahead_start = 0;		/* Invalidate these */
 		ra->ahead_size = 0;
 		actual = do_page_cache_readahead(mapping, filp, offset,
 						 ra->size);
+		if(!first_access) {
+			/*
+			 * do not adjust the readahead window size the first
+			 * time, the ahead window might get closed if all
+			 * the pages are already in the cache.
+			 */
 			check_ra_success(ra, ra->size, actual, orig_next_size);
+		}
 	} else {
 		/*
 		 * This read request is within the current window.  It is time
 		 * to submit I/O for the ahead window while the application is
-		 * crunching through the current window.
+		 * about to step into the ahead window.
+		 * Heuristic: Defer reading the ahead window till we hit
+		 * the last page in the current window. (lazy readahead)
+		 * If we read in earlier we run the risk of wasting
+		 * the ahead window.
 		 */
-		if (ra->ahead_start == 0) {
+		if (ra->ahead_start == 0 && offset == (ra->start + ra->size -1)) {
 			ra->ahead_start = ra->start + ra->size;
 			ra->ahead_size = ra->next_size;
 			actual = do_page_cache_readahead(mapping, filp,
@@ -488,7 +516,7 @@ void handle_ra_miss(struct address_space *mapping,
 		const unsigned long max = get_max_readahead(ra);

 		if (offset != ra->prev_page + 1) {
-			ra->size = 0;			/* Not sequential */
+			ra->size = ra->size?ra->size-1:0; /* Not sequential */
 		} else {
 			ra->size++;			/* A sequential read */
 			if (ra->size >= max) {		/* Resume readahead */