Commits · cb33cf00533bb645ba9a0d5155c1760aa4640723 · Kirill Smelkov / linux

An error occurred fetching the project authors.

28 Jun, 2003 1 commit
- [PATCH] check for presence of readpage() in the readahead code · cb33cf00
  Andrew Morton authored 21 years ago
```
Make sure that the address_space is capable of performing the readahead
before going in and allocating the pages.
```
  cb33cf00
09 Apr, 2003 1 commit

[PATCH] Replace the radix-tree rwlock with a spinlock · 8e98702b

Andrew Morton authored 21 years ago

Spinlocks don't have a buslocked unlock and are faster.

On a P4, time to write a 4M file with 4M one-byte-write()s:

Before:
	0.72s user 5.47s system 99% cpu 6.227 total
	0.76s user 5.40s system 100% cpu 6.154 total
	0.77s user 5.38s system 100% cpu 6.146 total

After:
	1.09s user 4.92s system 99% cpu 6.014 total
	0.74s user 5.28s system 99% cpu 6.023 total
	1.03s user 4.97s system 100% cpu 5.991 total

8e98702b

08 Mar, 2003 1 commit

[PATCH] Allow VFS readahead to fall to zero · bc858911

Andrew Morton authored 21 years ago

Some workloads really, really want to have no readahead.  Databases which are
perfoming small synchronous I/Os against a file which has extremely poor
layout.  Any readahead at all is a lose here.

But the current readahead code refuses to adapt that low.

Fix it up so that we can indeed adaptively disable readahead altogether, and
do not start it again until we have seen max_readahead()'s worth of
consecutive reads.

bc858911

06 Feb, 2003 1 commit

[PATCH] Remove most of the blk_run_queues() calls · 418f398e

Andrew Morton authored 21 years ago

We don't need these with self-unplugging queues.

The patch also contains a couple of microopts suggested by Andrea: we
don't need to run sync_page() if the page just came unlocked.

418f398e

04 Feb, 2003 1 commit

[PATCH] cleanup in read_cache_pages() · 99c88bc2

Andrew Morton authored 21 years ago

Patch from Nikita Danilov <Nikita@Namesys.COM>

read_cache_pages() is passed a bunch of pages to start I/O against and it is
supposed to consume all those pages. But if there is an I/O error, someone
need to throw away the unused pages.

At present the single user of read_cache_pages() (nfs_readpages) does that
cleanup by hand. But it should be done in the core kernel.

99c88bc2

14 Jan, 2003 1 commit

[PATCH] factor free memory into max_sane_readahead() · daebe5ee

Andrew Morton authored 21 years ago

max_sane_readahead() permits the user to readahead up to
half-the-inactive-list's worth of pages.  Which is totally wrong if most of
memory is free.

So make the limit be

	(nr_inactive + nr_free) / 2

daebe5ee

05 Jan, 2003 1 commit

[PATCH] handle radix_tree_node allocation failures · c3ed96a7

Andrew Morton authored 21 years ago

This patch uses the radix_tree_preload() API in add_to_page_cache().

A new gfp_mask argument is added to add_to_page_cache(), which is then passed
on to radix_tree_preload(). It's pretty simple.

In the case of adding pages to swapcache we're still using GFP_ATOMIC, so
these addition attempts can still fail. That's OK, because the error is
handled and, unlike file pages, it will not cause user applicaton failures.
This codepath (radix-tree node exhaustion on swapout) was well tested in the
days when the swapper_space radix tree was fragmented all over the place due
to unfortunate swp_entry bit layout.

c3ed96a7

14 Dec, 2002 2 commits

[PATCH] madvise_willneed() maximum readahead checking · 654107b9

Andrew Morton authored 22 years ago

madvise_willneed() currently has a very strange check on how much readahead
it is prepared to do.

  It is based on the user's rss limit.  But this is usually enormous, and
  the user isn't necessarily going to map all that memory at the same time
  anyway.

  And the logic is wrong - it is comparing rss (which is in bytes) with
  `end - start', which is in pages.

  And it returns -EIO on error, which is not mentioned in the Open Group
  spec and doesn't make sense.


This patch takes it all out and applies the same upper limit as is used in
sys_readahead() - half the inactive list.

654107b9

[PATCH] limit pinned memory due to readahead · 234931ab

Andrew Morton authored 22 years ago

readahead allocates all the pages before starting I/O. Potentially bad
if someone is performing huge reads with madvise or sys_readahead().

So the patch just busts that up into two-megabyte units.

234931ab

05 Nov, 2002 2 commits

[PATCH] Convert NFS client to use ->readpages() · b9a2dd76

Trond Myklebust authored 22 years ago

  - Add the library function read_cache_pages(), which is used in a
    similar fashion to the single page 'read_cache_page()'. It hides
    the details of the LRU cache etc. from a filesystem that wants to
    to populate an address space with a list of pages.

  - Fix NFS so that readahead uses the ->readpages() interface. Means
    that we can immediately schedule an RPC call in order to complete
    the I/O, rather than relying on somebody later triggering it by
    calling lock_page() (and hence sync_page()). The sync_page()
    method is race-prone, since the waiting page may try to call it
    before we've finished initializing the 'struct nfs_page'.

  - Clear out nfs_sync_page(), the nfs_inode->read list, and
    friends. When the I/O completion gets scheduled in ->readpage(),
    ->readpages(), they have no reason to exist.

b9a2dd76

[PATCH] Make ->readpages palatable to NFS · b729e488

Trond Myklebust authored 22 years ago

The following patch makes the ->readpages() address_space_operation
take a struct file argument just like ->readpage().

b729e488

30 Oct, 2002 2 commits

[PATCH] hot-n-cold pages: free and allocate hints · 8d6282a1

Andrew Morton authored 22 years ago

Add a `cold' hint to struct pagevec, and teach truncate and page
reclaim to use it.

Empirical testing showed that truncate's pages tend to be hot.  And page
reclaim's are certainly cold.

8d6282a1

[PATCH] hot-n-cold pages: use cold pages for readahead · 5019ce29

Andrew Morton authored 22 years ago

It is usually the case that pagecache reads use busmastering hardware
to transfer the data into pagecache.  This invalidates the CPU cache of
the pagecache pages.

So use cache-cold pages for pagecache reads.  To avoid wasting
cache-hot pages.

5019ce29

29 Oct, 2002 1 commit

[PATCH] add a file_ra_state init function · 6b390b3b

Andrew Morton authored 22 years ago

Provide a function in core kernel to initialise a file_ra_state structure.

Perviously this was all taken care of by the fact that new struct
file's are all zeroed out.  But now a file_ra_state may be
independently allocated, and we don't want users of it to have to know
how to initialise it.

6b390b3b

17 Oct, 2002 1 commit

[PATCH] do_generic_file_read / readahead adjustments · 9de05205

David Howells authored 22 years ago

This does the following three things:

 (1) Makes the functions in mm/readahead.c only use struct file* to pass to
     readpage(). address_mapping* and file_ra_state* are used instead to keep
     track of readahead stuff.

 (2) Adds a new function do_generic_mapping_read() that is similar to
     do_generic_file_read(), except that it uses a mapping pointer and a
     readahead state pointer to access a file. The file* is only used to pass
     to readpage().

 (3) Turns do_generic_file_read() into an inline function in linux/fs.h that
     simply wraps do_generic_mapping_read().

This should mean that it is no longer necessary to have a struct file to
access a file in this manner. Just an inode or address space should be
sufficient.

It also means alternate read-ahead structures can be maintained.

The reason I want this is that I'm writing a general cache manager for
filesystems such as AFS, NFSv4, and Lustre. Block devices are made available
to the "cache manager" by means of a filesystem that can be mounted. I'm
storing meta data in an inode in the cache, but to scan this at the moment I
need to gain a "struct file" to use with do_generic_file_read().

This involves either creating a dummy dentry and struct file (which will cause
Al Viro to come looking for me with a shotgun), or to use an extra auxilliary
filesystem mounted with do_kern_mount(), neither of which are particularly
appealing.

This patch is the alternative... it provides a function that I can pass an
address_space to. This also allows me to make use of readahead semantics
without having to reinvent them for myself.

9de05205

15 Sep, 2002 1 commit

[PATCH] hold the page ref across ->readpage · f3b3dc81

Andrew Morton authored 22 years ago

read_pages() is dropping the page refcount before running ->readpage().
Which just happens to work, because the page is in pagecache and
locked.

But it breaks under some unconventional things which reiser4 is doing,
and it's better/safer/saner this way anyway.

f3b3dc81

15 Aug, 2002 1 commit

[PATCH] batched addition of pages to the LRU · 9eb76ee2

Andrew Morton authored 22 years ago

The patch goes through the various places which were calling
lru_cache_add() against bulk pages and batches them up.

Also.  This whole patch series improves the behaviour of the system
under heavy writeback load.  There is a reduction in page allocation
failures, some reduction in loss of interactivity due to page
allocators getting stuck on writeback from the VM.  (This is still bad
though).

I think it's due to the change here in mpage_writepages().  That
function was originally unconditionally refiling written-back pages to
the head of the inactive list.  The theory being that they should be
moved out of the way of page allocators, who would end up waiting on
them.

It appears that this simply had the effect of pushing dirty, unwritten
data closer to the tail of the inactive list, making things worse.

So instead, if the caller is (typically) balance_dirty_pages() then
leave the pages where they are on the LRU.

If the caller is PF_MEMALLOC then the pages *have* to be refiled.  This
is because VM writeback is clustered along mapping->dirty_pages, and
it's almost certain that the pages which are being written are near the
tail of the LRU.  If they were left there, page allocators would block
on them too soon.  It would effectively become a synchronous write.

9eb76ee2

01 Aug, 2002 1 commit
- [PATCH] C99 initializers for mm · 846b6ce2
  Hugh Dickins authored 22 years ago
```
ISO C99 designated initializers by Art Haas for mm.
```
  846b6ce2
19 Jul, 2002 2 commits

[PATCH] readahead optimisations · b6938a7b

Andrew Morton authored 22 years ago

Been looking at a workload which involves several processes which seek
around and read from a large file.  There are a few problems:
generic_file_lseek is bouncing i_sem around like mad, and readahead is
doing lots of pointless pagecache probing.

This patch addresses readahead.

Presumably the change will be larger on machines which have higher
bandwidth memory than my test box, of which there are many.

This patch teaches readahead to detect the situation where no IO is
actually being performed as a result of its actions.  Now, we don't
want to sacrifice IO efficiency to save a bit of CPU, so the code is
very cautious.  But eventually, after some tens of consecutive
readahead attempts were found to perform no I/O at all, readahead will
turn itself off.

readahead will be turned on again when either generic_file_read() or
filemap_nopage() get a pagecache miss.  The function
handle_ra_thrashing() has been renamed to handle_ra_miss() to reflect
its widened role.

A performance bug in page_cache_readround() was fixed - if
ra->next_size is zero, that function needs to leave it well alone,
because next_size==0 is a magic value meaning that the file has just
been opened and that readahead needs to get aggressive.  This change
makes a `make dep' run at the same speed as in the 2.4 kernel.  It used
to take 4x as long...

`make dep' is an interesting test because it uses mmap to read the files.

b6938a7b

[PATCH] remove add_to_page_cache_unique() · cad46d66

Andrew Morton authored 22 years ago

A tasty patch from Hugh Dickens.  radix_tree_insert() fails if something
was already present at the target index, so that error can be
propagated back through add_to_page_cache().  Hence
add_to_page_cache_unique() is obsolete.

Hugh's patch removes add_to_page_cache_unique() and cleans up a bunch of
stuff.

cad46d66

12 Jun, 2002 1 commit

[PATCH] fix smbfs oops · 4c6924d4

Andrew Morton authored 22 years ago

Silly oversight - read_pages needs to pass the file *
down to ->readpage().

4c6924d4

28 May, 2002 1 commit

[PATCH] block plugging reworked · eba5b46c

Jens Axboe authored 22 years ago

This patch provides the ability for a block driver to signal it's too
busy to receive more work and temporarily halt the request queue. In
concept it's similar to the networking netif_{start,stop}_queue helpers.

To do this cleanly, I've ripped out the old tq_disk task queue. Instead
an internal list of plugged queues is maintained which will honor the
current queue state (see QUEUE_FLAG_STOPPED bit). Execution of
request_fn has been moved to tasklet context. blk_run_queues() provides
similar functionality to the old run_task_queue(&tq_disk).

Now, this only works at the request_fn level and not at the
make_request_fn level. This is on purpose: drivers working at the
make_request_fn level are essentially providing a piece of the block
level infrastructure themselves. There are basically two reasons for
doing make_request_fn style setups:

o block remappers. start/stop functionality will be done at the target
device in this case, which is the level that will signal hardware full
(or continue) anyways.

o drivers who wish to receive single entities of "buffers" and not
merged requests etc. This could use the start/stop functionality. I'd
suggest _still_ using a request_fn for these, but set the queue
options so that no merging etc ever takes place. This has the added
bonus of providing the usual request depletion throttling at the block
level.

eba5b46c

27 May, 2002 1 commit

[PATCH] direct-to-BIO readahead · bc67de55

Andrew Morton authored 22 years ago

Implements BIO-based multipage reads into the pagecache, and turns this
on for ext2.

CPU load for `cat large_file > /dev/null' is reduced by approximately
15%.  Similar reductions for tiobench with a single thread.  (Earlier
claims of 25% were exaggerated - they were measured with slab debug
enabled.  But 15% isn't bad for a load which is dominated by copy_*_user
costs).

With 2, 4 and 8 tiobench threads, throughput is increased as well, which was
unexpected.  It's due to request queue weirdness.  (Generally the
request queueing is doing bad things under certain workloads - that's a
separate issue.)

BIOs of up to 64 kbytes are assembled and submitted for readahead and
for single-page reads.  So the work involved in reading 32 pages has gone
from:

	- allocate and attach 32 buffer_heads
	- submit 32 buffer_heads
	- allocate 32 bios
	- submit 32 bios

to:

	- allocate 2 bios
	- submit 2 bios

These pages never have buffers attached.  Buffers will be attached
later if the application writes to these pages (file overwrite).

The first version of this code (in the "delayed allocation" patches)
tries to handle everything - bios which start mid-page, bios which end
mid-page and pages which are covered by multiple bios.  It is very
complex code and in fact appears to be incorrect: out-of-order BIO
completion could cause a page to come unlocked at the wrong time.

This implementation is much simpler: if things get complex, it just
falls back to the buffer-based block_read_full_page(), which isn't
going away, and which understands all that complexity.  There's no
point in doing this in two places.

This code will bypass the buffer layer for

 - fully-mapped pages which are on-disk contiguous.

 - fully unmapoped pages (holes)

 - partially unmapped pages, where the unmappedness is at the end of
   the page (end-of-file).

and everything else falls back to buffers.

This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are
handed direct to BIO.  With a heavy 10-minute dbench run on 4k
PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO.
Almost all of the other 5% were passed to block_read_full_page()
because they were already partially uptodate from an earlier sub-page
write().  This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater
than four.  But if that's the case, CPU efficiency is far from the main
concern - there are significant seek and bandwidth problems just at 4
blocks per page.

This code will stress out the block layer somewhat - RAID0 doesn't like
multipage BIOs, and there are probably others.  RAID0 seems to struggle
along - readahead fails but read falls back to single-page reads, which
succeed.  Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE
to PAGE_CACHE_SIZE in fs/mpage.c.

It is trivial to enable multipage reads for many other filesystems.  We
can do that after completion of external testing of ext2.

bc67de55

19 May, 2002 2 commits

[PATCH] pdflush exclusion infrastructure · 1f6acea0

Andrew Morton authored 22 years ago

Collision avoidance for pdflush threads.

Turns the request_queue-based `unsigned long ra_pages' into a structure
which contains ra_pages as well as a longword.

That longword is used to record the fact that a pdflush thread is
currently writing something back against this request_queue.

Avoids the situation where several pdflush threads are sleeping on the
same request_queue.

This patch provides only the infrastructure for the pdflush exclusion.
This infrastructure gets used in pdflush-single.patch

1f6acea0

[PATCH] reduce lock contention in do_pagecache_readahead · cd016d80

Andrew Morton authored 22 years ago

Anton Blanchard has a workload (the SDET benchmark) which is showing some
moderate lock contention in do_pagecache_readahead().

Seems that SDET has many threads performing seeky reads against a
cached file. The average number of pagecache probes in a single
do_pagecache_readahead() is six, which seems reasonable.

The patch (from Anton) flips the locking around to optimise for the
fast case (page was present). So the kernel takes the lock less often,
and does more work once it has been acquired.

cd016d80

30 Apr, 2002 2 commits

[PATCH] page writeback locking update · a2bcb3a0

Andrew Morton authored 22 years ago

- Fixes a performance problem - callers of
  prepare_write/commit_write, etc are locking pages, which synchronises
  them behind writeback, which also locks these pages.  Significant
  slowdowns for some workloads.

- So pages are no longer locked while under writeout.  Introduce a
  new PG_writeback and associated infrastructure to support this design
  change.

- Pages which are under read I/O still use PageLocked.  Pages which
  are under write I/O have PageWriteback() true.

  I considered creating Page_IO instead of PageWriteback, and marking
  both readin and writeout pages as PageIO().  So pages are unlocked
  during both read and write.  There just doesn't seem a need to do
  this - nobody ever needs unblocking access to a page which is under
  read I/O.

- Pages under swapout (brw_page) are PageLocked, not PageWriteback.
  So their treatment is unchangeded.

  It's not obvious that pages which are under swapout actually need
  the more asynchronous behaviour of PageWriteback.

  I was setting the swapout pages PageWriteback and unlocking them
  prior to submitting the buffers in brw_page().  This led to deadlocks
  on the exit_mmap->zap_page_range->free_swap_and_cache path.  These
  functions call block_flushpage under spinlock.  If the page is
  unlocked but has locked buffers, block_flushpage->discard_buffer()
  sleeps.  Under spinlock.  So that will need fixing if for some reason
  we want swapout to use PageWriteback.

  Kernel has called block_flushpage() under spinlock for a long time.
   It is assuming that a locked page will never have locked buffers.
  This appears to be true, but it's ugly.

- Adds new function wait_on_page_writeback().  Renames wait_on_page()
  to wait_on_page_locked() to remind people that they need to call the
  appropriate one.

- Renames filemap_fdatasync() to filemap_fdatawrite().  It's more
  accurate - "sync" implies, if anything, writeout and wait.  (fsync,
  msync) Or writeout.  it's not clear.

- Subtly changes the filemap_fdatawrite() internals - this function
  used to do a lock_page() - it waited for any other user of the page
  to let go before submitting new I/O against a page.  It has been
  changed to simply skip over any pages which are currently under
  writeback.

  This is the right thing to do for memory-cleansing reasons.

  But it's the wrong thing to do for data consistency operations (eg,
  fsync()).  For those operations we must ensure that all data which
  was dirty *at the time of the system call* are tight on disk before
  the call returns.

  So all places which care about this have been converted to do:

	filemap_fdatawait(mapping);	/* Wait for current writeback */
	filemap_fdatawrite(mapping);	/* Write all dirty pages */
	filemap_fdatawait(mapping);	/* Wait for I/O to complete */

- Fixes a truncate_inode_pages problem - truncate currently will
  block when it hits a locked page, so it ends up getting into lockstep
  behind writeback and all of the file is pointlessly written back.

  One fix for this is for truncate to simply walk the page list in the
  opposite direction from writeback.

  I chose to use a separate cleansing pass.  It is more
  CPU-intensive, but it is surer and clearer.  This is because there is
  no reason why the per-address_space ->vm_writeback and
  ->writeback_mapping functions *have* to perform writeout in
  ->dirty_pages order.  They may choose to do something totally
  different.

  (set_page_dirty() is an a_op now, so address_spaces could almost
  privatise the whole dirty-page handling thing.  Except
  truncate_inode_pages and invalidate_inode_pages assume that the pages
  are on the address_space lists.  hmm.  So making truncate_inode_pages
  and invalidate_inode_pages a_ops would make some sense).

a2bcb3a0

[PATCH] readahead fix · 00d6555e

Andrew Morton authored 22 years ago

Changes the way in which the readahead code locates the readahead
setting for the underlying device.

- struct block_device and struct address_space gain a *pointer* to the
  current readahead tunable.

- The tunable lives in the request queue and is altered with the
  traditional ioctl.

- The value gets *copied* into the struct file at open() time.  So a
  fcntl() mode to modify it per-fd is simple.

- Filesystems which are not request_queue-backed get the address of the
  global `default_ra_pages'.  If we want, this can become a tunable.

- Filesystems are at liberty to alter address_space.ra_pages to point
  at some other fs-private default at new_inode/read_inode/alloc_inode
  time.

- The ra_pages pointer can become a structure pointer if, at some time
  in the future, high-level code needs more detailed information about
  device characteristics.

  In fact, it'll need to become a struct pointer for use by
  writeback: my current writeback code has the problem that multiple
  pdflush threads can get stuck on the same request queue.  That's a
  waste of resources.  I currently have a silly flag in the superblock
  to try to avoid this.

  The proper way to get this exclusion is for the high-level
  writeback code to be able to do a test-and-set against a
  per-request_queue flag.  That flag can live in a structure alongside
  ra_pages, conveniently accessible at the pagemap level.

One thing still to-be-done is going into all callers of blk_init_queue
and blk_queue_make_request and making sure that they're setting up a
sensible default.  ATA wants 248 sectors, and floppy drives don't want
128kbytes, I suspect.  Later.

00d6555e

26 Apr, 2002 2 commits
- Use the correct inode for read-ahead trashing/ · f62e2b90
  Linus Torvalds authored 22 years ago
  
  f62e2b90
- [PATCH] permit zero-length readahead, and tidy up readahead · 92216226
  Andrew Morton authored 22 years ago
```
- Initialise the per-request_queue readahead parameter properly,
  rather than the dopey "if it's zero you get the deafult"
  approach.

- Permit zero-length readahead.

- 80-columnify mm/readahead.c
```
  92216226
25 Apr, 2002 1 commit
- [PATCH] (5/15) big struct block_device * push (first series) · 3b57ec8d
  Alexander Viro authored 22 years ago
```
 - switch blk_{get,set}_readahead() to struct block_device *
```
  3b57ec8d
10 Apr, 2002 1 commit

[PATCH] readahead · 8fa49846

Andrew Morton authored 22 years ago

I'd like to be able to claim amazing speedups, but
the best benchmark I could find was diffing two
256 megabyte files, which is about 10% quicker.  And
that is probably due to the window size being effectively
50% larger.

Fact is, any disk worth owning nowadays has a segmented
2-megabyte cache, and OS-level readahead mainly seems
to save on CPU cycles rather than overall throughput.
Once you start reading more streams than there are segments
in the disk cache we start to win.

Still.  The main motivation for this work is to
clean the code up, and to create a central point at
which many pages are marshalled together so that
they can all be encapsulated into the smallest possible
number of BIOs, and injected into the request layer.

A number of filesystems were poking around inside the
readahead state variables.  I'm not really sure what they
were up to, but I took all that out.  The readahead
code manages its own state autonomously and should not
need any hints.

- Unifies the current three readahead functions (mmap reads, read(2)
  and sys_readhead) into a single implementation.

- More aggressive in building up the readahead windows.

- More conservative in tearing them down.

- Special start-of-file heuristics.

- Preallocates the readahead pages, to avoid the (never demonstrated,
  but potentially catastrophic) scenario where allocation of readahead
  pages causes the allocator to perform VM writeout.

- Gets all the readahead pages gathered together in
  one spot, so they can be marshalled into big BIOs.

- reinstates the readahead ioctls, so hdparm(8) and blockdev(8)
  are working again.  The readahead settings are now per-request-queue,
  and the drivers never have to know about it.  I use blockdev(8).
  It works in units of 512 bytes.

- Identifies readahead thrashing.

  Also attempts to handle it.  Certainly the changes here
  delay the onset of catastrophic readahead thrashing by
  quite a lot, and decrease it seriousness as we get more
  deeply into it, but it's still pretty bad.

8fa49846