An error occurred fetching the project authors.
- 28 Jun, 2003 1 commit
-
-
Andrew Morton authored
Make sure that the address_space is capable of performing the readahead before going in and allocating the pages.
-
- 09 Apr, 2003 1 commit
-
-
Andrew Morton authored
Spinlocks don't have a buslocked unlock and are faster. On a P4, time to write a 4M file with 4M one-byte-write()s: Before: 0.72s user 5.47s system 99% cpu 6.227 total 0.76s user 5.40s system 100% cpu 6.154 total 0.77s user 5.38s system 100% cpu 6.146 total After: 1.09s user 4.92s system 99% cpu 6.014 total 0.74s user 5.28s system 99% cpu 6.023 total 1.03s user 4.97s system 100% cpu 5.991 total
-
- 08 Mar, 2003 1 commit
-
-
Andrew Morton authored
Some workloads really, really want to have no readahead. Databases which are perfoming small synchronous I/Os against a file which has extremely poor layout. Any readahead at all is a lose here. But the current readahead code refuses to adapt that low. Fix it up so that we can indeed adaptively disable readahead altogether, and do not start it again until we have seen max_readahead()'s worth of consecutive reads.
-
- 06 Feb, 2003 1 commit
-
-
Andrew Morton authored
We don't need these with self-unplugging queues. The patch also contains a couple of microopts suggested by Andrea: we don't need to run sync_page() if the page just came unlocked.
-
- 04 Feb, 2003 1 commit
-
-
Andrew Morton authored
Patch from Nikita Danilov <Nikita@Namesys.COM> read_cache_pages() is passed a bunch of pages to start I/O against and it is supposed to consume all those pages. But if there is an I/O error, someone need to throw away the unused pages. At present the single user of read_cache_pages() (nfs_readpages) does that cleanup by hand. But it should be done in the core kernel.
-
- 14 Jan, 2003 1 commit
-
-
Andrew Morton authored
max_sane_readahead() permits the user to readahead up to half-the-inactive-list's worth of pages. Which is totally wrong if most of memory is free. So make the limit be (nr_inactive + nr_free) / 2
-
- 05 Jan, 2003 1 commit
-
-
Andrew Morton authored
This patch uses the radix_tree_preload() API in add_to_page_cache(). A new gfp_mask argument is added to add_to_page_cache(), which is then passed on to radix_tree_preload(). It's pretty simple. In the case of adding pages to swapcache we're still using GFP_ATOMIC, so these addition attempts can still fail. That's OK, because the error is handled and, unlike file pages, it will not cause user applicaton failures. This codepath (radix-tree node exhaustion on swapout) was well tested in the days when the swapper_space radix tree was fragmented all over the place due to unfortunate swp_entry bit layout.
-
- 14 Dec, 2002 2 commits
-
-
Andrew Morton authored
madvise_willneed() currently has a very strange check on how much readahead it is prepared to do. It is based on the user's rss limit. But this is usually enormous, and the user isn't necessarily going to map all that memory at the same time anyway. And the logic is wrong - it is comparing rss (which is in bytes) with `end - start', which is in pages. And it returns -EIO on error, which is not mentioned in the Open Group spec and doesn't make sense. This patch takes it all out and applies the same upper limit as is used in sys_readahead() - half the inactive list.
-
Andrew Morton authored
readahead allocates all the pages before starting I/O. Potentially bad if someone is performing huge reads with madvise or sys_readahead(). So the patch just busts that up into two-megabyte units.
-
- 05 Nov, 2002 2 commits
-
-
Trond Myklebust authored
- Add the library function read_cache_pages(), which is used in a similar fashion to the single page 'read_cache_page()'. It hides the details of the LRU cache etc. from a filesystem that wants to to populate an address space with a list of pages. - Fix NFS so that readahead uses the ->readpages() interface. Means that we can immediately schedule an RPC call in order to complete the I/O, rather than relying on somebody later triggering it by calling lock_page() (and hence sync_page()). The sync_page() method is race-prone, since the waiting page may try to call it before we've finished initializing the 'struct nfs_page'. - Clear out nfs_sync_page(), the nfs_inode->read list, and friends. When the I/O completion gets scheduled in ->readpage(), ->readpages(), they have no reason to exist.
-
Trond Myklebust authored
The following patch makes the ->readpages() address_space_operation take a struct file argument just like ->readpage().
-
- 30 Oct, 2002 2 commits
-
-
Andrew Morton authored
Add a `cold' hint to struct pagevec, and teach truncate and page reclaim to use it. Empirical testing showed that truncate's pages tend to be hot. And page reclaim's are certainly cold.
-
Andrew Morton authored
It is usually the case that pagecache reads use busmastering hardware to transfer the data into pagecache. This invalidates the CPU cache of the pagecache pages. So use cache-cold pages for pagecache reads. To avoid wasting cache-hot pages.
-
- 29 Oct, 2002 1 commit
-
-
Andrew Morton authored
Provide a function in core kernel to initialise a file_ra_state structure. Perviously this was all taken care of by the fact that new struct file's are all zeroed out. But now a file_ra_state may be independently allocated, and we don't want users of it to have to know how to initialise it.
-
- 17 Oct, 2002 1 commit
-
-
David Howells authored
This does the following three things: (1) Makes the functions in mm/readahead.c only use struct file* to pass to readpage(). address_mapping* and file_ra_state* are used instead to keep track of readahead stuff. (2) Adds a new function do_generic_mapping_read() that is similar to do_generic_file_read(), except that it uses a mapping pointer and a readahead state pointer to access a file. The file* is only used to pass to readpage(). (3) Turns do_generic_file_read() into an inline function in linux/fs.h that simply wraps do_generic_mapping_read(). This should mean that it is no longer necessary to have a struct file to access a file in this manner. Just an inode or address space should be sufficient. It also means alternate read-ahead structures can be maintained. The reason I want this is that I'm writing a general cache manager for filesystems such as AFS, NFSv4, and Lustre. Block devices are made available to the "cache manager" by means of a filesystem that can be mounted. I'm storing meta data in an inode in the cache, but to scan this at the moment I need to gain a "struct file" to use with do_generic_file_read(). This involves either creating a dummy dentry and struct file (which will cause Al Viro to come looking for me with a shotgun), or to use an extra auxilliary filesystem mounted with do_kern_mount(), neither of which are particularly appealing. This patch is the alternative... it provides a function that I can pass an address_space to. This also allows me to make use of readahead semantics without having to reinvent them for myself.
-
- 15 Sep, 2002 1 commit
-
-
Andrew Morton authored
read_pages() is dropping the page refcount before running ->readpage(). Which just happens to work, because the page is in pagecache and locked. But it breaks under some unconventional things which reiser4 is doing, and it's better/safer/saner this way anyway.
-
- 15 Aug, 2002 1 commit
-
-
Andrew Morton authored
The patch goes through the various places which were calling lru_cache_add() against bulk pages and batches them up. Also. This whole patch series improves the behaviour of the system under heavy writeback load. There is a reduction in page allocation failures, some reduction in loss of interactivity due to page allocators getting stuck on writeback from the VM. (This is still bad though). I think it's due to the change here in mpage_writepages(). That function was originally unconditionally refiling written-back pages to the head of the inactive list. The theory being that they should be moved out of the way of page allocators, who would end up waiting on them. It appears that this simply had the effect of pushing dirty, unwritten data closer to the tail of the inactive list, making things worse. So instead, if the caller is (typically) balance_dirty_pages() then leave the pages where they are on the LRU. If the caller is PF_MEMALLOC then the pages *have* to be refiled. This is because VM writeback is clustered along mapping->dirty_pages, and it's almost certain that the pages which are being written are near the tail of the LRU. If they were left there, page allocators would block on them too soon. It would effectively become a synchronous write.
-
- 01 Aug, 2002 1 commit
-
-
Hugh Dickins authored
ISO C99 designated initializers by Art Haas for mm.
-
- 19 Jul, 2002 2 commits
-
-
Andrew Morton authored
Been looking at a workload which involves several processes which seek around and read from a large file. There are a few problems: generic_file_lseek is bouncing i_sem around like mad, and readahead is doing lots of pointless pagecache probing. This patch addresses readahead. Presumably the change will be larger on machines which have higher bandwidth memory than my test box, of which there are many. This patch teaches readahead to detect the situation where no IO is actually being performed as a result of its actions. Now, we don't want to sacrifice IO efficiency to save a bit of CPU, so the code is very cautious. But eventually, after some tens of consecutive readahead attempts were found to perform no I/O at all, readahead will turn itself off. readahead will be turned on again when either generic_file_read() or filemap_nopage() get a pagecache miss. The function handle_ra_thrashing() has been renamed to handle_ra_miss() to reflect its widened role. A performance bug in page_cache_readround() was fixed - if ra->next_size is zero, that function needs to leave it well alone, because next_size==0 is a magic value meaning that the file has just been opened and that readahead needs to get aggressive. This change makes a `make dep' run at the same speed as in the 2.4 kernel. It used to take 4x as long... `make dep' is an interesting test because it uses mmap to read the files.
-
Andrew Morton authored
A tasty patch from Hugh Dickens. radix_tree_insert() fails if something was already present at the target index, so that error can be propagated back through add_to_page_cache(). Hence add_to_page_cache_unique() is obsolete. Hugh's patch removes add_to_page_cache_unique() and cleans up a bunch of stuff.
-
- 12 Jun, 2002 1 commit
-
-
Andrew Morton authored
Silly oversight - read_pages needs to pass the file * down to ->readpage().
-
- 28 May, 2002 1 commit
-
-
Jens Axboe authored
This patch provides the ability for a block driver to signal it's too busy to receive more work and temporarily halt the request queue. In concept it's similar to the networking netif_{start,stop}_queue helpers. To do this cleanly, I've ripped out the old tq_disk task queue. Instead an internal list of plugged queues is maintained which will honor the current queue state (see QUEUE_FLAG_STOPPED bit). Execution of request_fn has been moved to tasklet context. blk_run_queues() provides similar functionality to the old run_task_queue(&tq_disk). Now, this only works at the request_fn level and not at the make_request_fn level. This is on purpose: drivers working at the make_request_fn level are essentially providing a piece of the block level infrastructure themselves. There are basically two reasons for doing make_request_fn style setups: o block remappers. start/stop functionality will be done at the target device in this case, which is the level that will signal hardware full (or continue) anyways. o drivers who wish to receive single entities of "buffers" and not merged requests etc. This could use the start/stop functionality. I'd suggest _still_ using a request_fn for these, but set the queue options so that no merging etc ever takes place. This has the added bonus of providing the usual request depletion throttling at the block level.
-
- 27 May, 2002 1 commit
-
-
Andrew Morton authored
Implements BIO-based multipage reads into the pagecache, and turns this on for ext2. CPU load for `cat large_file > /dev/null' is reduced by approximately 15%. Similar reductions for tiobench with a single thread. (Earlier claims of 25% were exaggerated - they were measured with slab debug enabled. But 15% isn't bad for a load which is dominated by copy_*_user costs). With 2, 4 and 8 tiobench threads, throughput is increased as well, which was unexpected. It's due to request queue weirdness. (Generally the request queueing is doing bad things under certain workloads - that's a separate issue.) BIOs of up to 64 kbytes are assembled and submitted for readahead and for single-page reads. So the work involved in reading 32 pages has gone from: - allocate and attach 32 buffer_heads - submit 32 buffer_heads - allocate 32 bios - submit 32 bios to: - allocate 2 bios - submit 2 bios These pages never have buffers attached. Buffers will be attached later if the application writes to these pages (file overwrite). The first version of this code (in the "delayed allocation" patches) tries to handle everything - bios which start mid-page, bios which end mid-page and pages which are covered by multiple bios. It is very complex code and in fact appears to be incorrect: out-of-order BIO completion could cause a page to come unlocked at the wrong time. This implementation is much simpler: if things get complex, it just falls back to the buffer-based block_read_full_page(), which isn't going away, and which understands all that complexity. There's no point in doing this in two places. This code will bypass the buffer layer for - fully-mapped pages which are on-disk contiguous. - fully unmapoped pages (holes) - partially unmapped pages, where the unmappedness is at the end of the page (end-of-file). and everything else falls back to buffers. This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are handed direct to BIO. With a heavy 10-minute dbench run on 4k PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO. Almost all of the other 5% were passed to block_read_full_page() because they were already partially uptodate from an earlier sub-page write(). This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater than four. But if that's the case, CPU efficiency is far from the main concern - there are significant seek and bandwidth problems just at 4 blocks per page. This code will stress out the block layer somewhat - RAID0 doesn't like multipage BIOs, and there are probably others. RAID0 seems to struggle along - readahead fails but read falls back to single-page reads, which succeed. Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE to PAGE_CACHE_SIZE in fs/mpage.c. It is trivial to enable multipage reads for many other filesystems. We can do that after completion of external testing of ext2.
-
- 19 May, 2002 2 commits
-
-
Andrew Morton authored
Collision avoidance for pdflush threads. Turns the request_queue-based `unsigned long ra_pages' into a structure which contains ra_pages as well as a longword. That longword is used to record the fact that a pdflush thread is currently writing something back against this request_queue. Avoids the situation where several pdflush threads are sleeping on the same request_queue. This patch provides only the infrastructure for the pdflush exclusion. This infrastructure gets used in pdflush-single.patch
-
Andrew Morton authored
Anton Blanchard has a workload (the SDET benchmark) which is showing some moderate lock contention in do_pagecache_readahead(). Seems that SDET has many threads performing seeky reads against a cached file. The average number of pagecache probes in a single do_pagecache_readahead() is six, which seems reasonable. The patch (from Anton) flips the locking around to optimise for the fast case (page was present). So the kernel takes the lock less often, and does more work once it has been acquired.
-
- 30 Apr, 2002 2 commits
-
-
Andrew Morton authored
- Fixes a performance problem - callers of prepare_write/commit_write, etc are locking pages, which synchronises them behind writeback, which also locks these pages. Significant slowdowns for some workloads. - So pages are no longer locked while under writeout. Introduce a new PG_writeback and associated infrastructure to support this design change. - Pages which are under read I/O still use PageLocked. Pages which are under write I/O have PageWriteback() true. I considered creating Page_IO instead of PageWriteback, and marking both readin and writeout pages as PageIO(). So pages are unlocked during both read and write. There just doesn't seem a need to do this - nobody ever needs unblocking access to a page which is under read I/O. - Pages under swapout (brw_page) are PageLocked, not PageWriteback. So their treatment is unchangeded. It's not obvious that pages which are under swapout actually need the more asynchronous behaviour of PageWriteback. I was setting the swapout pages PageWriteback and unlocking them prior to submitting the buffers in brw_page(). This led to deadlocks on the exit_mmap->zap_page_range->free_swap_and_cache path. These functions call block_flushpage under spinlock. If the page is unlocked but has locked buffers, block_flushpage->discard_buffer() sleeps. Under spinlock. So that will need fixing if for some reason we want swapout to use PageWriteback. Kernel has called block_flushpage() under spinlock for a long time. It is assuming that a locked page will never have locked buffers. This appears to be true, but it's ugly. - Adds new function wait_on_page_writeback(). Renames wait_on_page() to wait_on_page_locked() to remind people that they need to call the appropriate one. - Renames filemap_fdatasync() to filemap_fdatawrite(). It's more accurate - "sync" implies, if anything, writeout and wait. (fsync, msync) Or writeout. it's not clear. - Subtly changes the filemap_fdatawrite() internals - this function used to do a lock_page() - it waited for any other user of the page to let go before submitting new I/O against a page. It has been changed to simply skip over any pages which are currently under writeback. This is the right thing to do for memory-cleansing reasons. But it's the wrong thing to do for data consistency operations (eg, fsync()). For those operations we must ensure that all data which was dirty *at the time of the system call* are tight on disk before the call returns. So all places which care about this have been converted to do: filemap_fdatawait(mapping); /* Wait for current writeback */ filemap_fdatawrite(mapping); /* Write all dirty pages */ filemap_fdatawait(mapping); /* Wait for I/O to complete */ - Fixes a truncate_inode_pages problem - truncate currently will block when it hits a locked page, so it ends up getting into lockstep behind writeback and all of the file is pointlessly written back. One fix for this is for truncate to simply walk the page list in the opposite direction from writeback. I chose to use a separate cleansing pass. It is more CPU-intensive, but it is surer and clearer. This is because there is no reason why the per-address_space ->vm_writeback and ->writeback_mapping functions *have* to perform writeout in ->dirty_pages order. They may choose to do something totally different. (set_page_dirty() is an a_op now, so address_spaces could almost privatise the whole dirty-page handling thing. Except truncate_inode_pages and invalidate_inode_pages assume that the pages are on the address_space lists. hmm. So making truncate_inode_pages and invalidate_inode_pages a_ops would make some sense).
-
Andrew Morton authored
Changes the way in which the readahead code locates the readahead setting for the underlying device. - struct block_device and struct address_space gain a *pointer* to the current readahead tunable. - The tunable lives in the request queue and is altered with the traditional ioctl. - The value gets *copied* into the struct file at open() time. So a fcntl() mode to modify it per-fd is simple. - Filesystems which are not request_queue-backed get the address of the global `default_ra_pages'. If we want, this can become a tunable. - Filesystems are at liberty to alter address_space.ra_pages to point at some other fs-private default at new_inode/read_inode/alloc_inode time. - The ra_pages pointer can become a structure pointer if, at some time in the future, high-level code needs more detailed information about device characteristics. In fact, it'll need to become a struct pointer for use by writeback: my current writeback code has the problem that multiple pdflush threads can get stuck on the same request queue. That's a waste of resources. I currently have a silly flag in the superblock to try to avoid this. The proper way to get this exclusion is for the high-level writeback code to be able to do a test-and-set against a per-request_queue flag. That flag can live in a structure alongside ra_pages, conveniently accessible at the pagemap level. One thing still to-be-done is going into all callers of blk_init_queue and blk_queue_make_request and making sure that they're setting up a sensible default. ATA wants 248 sectors, and floppy drives don't want 128kbytes, I suspect. Later.
-
- 26 Apr, 2002 2 commits
-
-
Linus Torvalds authored
-
Andrew Morton authored
- Initialise the per-request_queue readahead parameter properly, rather than the dopey "if it's zero you get the deafult" approach. - Permit zero-length readahead. - 80-columnify mm/readahead.c
-
- 25 Apr, 2002 1 commit
-
-
Alexander Viro authored
- switch blk_{get,set}_readahead() to struct block_device *
-
- 10 Apr, 2002 1 commit
-
-
Andrew Morton authored
I'd like to be able to claim amazing speedups, but the best benchmark I could find was diffing two 256 megabyte files, which is about 10% quicker. And that is probably due to the window size being effectively 50% larger. Fact is, any disk worth owning nowadays has a segmented 2-megabyte cache, and OS-level readahead mainly seems to save on CPU cycles rather than overall throughput. Once you start reading more streams than there are segments in the disk cache we start to win. Still. The main motivation for this work is to clean the code up, and to create a central point at which many pages are marshalled together so that they can all be encapsulated into the smallest possible number of BIOs, and injected into the request layer. A number of filesystems were poking around inside the readahead state variables. I'm not really sure what they were up to, but I took all that out. The readahead code manages its own state autonomously and should not need any hints. - Unifies the current three readahead functions (mmap reads, read(2) and sys_readhead) into a single implementation. - More aggressive in building up the readahead windows. - More conservative in tearing them down. - Special start-of-file heuristics. - Preallocates the readahead pages, to avoid the (never demonstrated, but potentially catastrophic) scenario where allocation of readahead pages causes the allocator to perform VM writeout. - Gets all the readahead pages gathered together in one spot, so they can be marshalled into big BIOs. - reinstates the readahead ioctls, so hdparm(8) and blockdev(8) are working again. The readahead settings are now per-request-queue, and the drivers never have to know about it. I use blockdev(8). It works in units of 512 bytes. - Identifies readahead thrashing. Also attempts to handle it. Certainly the changes here delay the onset of catastrophic readahead thrashing by quite a lot, and decrease it seriousness as we get more deeply into it, but it's still pretty bad.
-