An error occurred fetching the project authors.
  1. 28 Jun, 2003 1 commit
  2. 09 Apr, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] Replace the radix-tree rwlock with a spinlock · 8e98702b
      Andrew Morton authored
      Spinlocks don't have a buslocked unlock and are faster.
      
      On a P4, time to write a 4M file with 4M one-byte-write()s:
      
      Before:
      	0.72s user 5.47s system 99% cpu 6.227 total
      	0.76s user 5.40s system 100% cpu 6.154 total
      	0.77s user 5.38s system 100% cpu 6.146 total
      
      After:
      	1.09s user 4.92s system 99% cpu 6.014 total
      	0.74s user 5.28s system 99% cpu 6.023 total
      	1.03s user 4.97s system 100% cpu 5.991 total
      8e98702b
  3. 08 Mar, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] Allow VFS readahead to fall to zero · bc858911
      Andrew Morton authored
      Some workloads really, really want to have no readahead.  Databases which are
      perfoming small synchronous I/Os against a file which has extremely poor
      layout.  Any readahead at all is a lose here.
      
      But the current readahead code refuses to adapt that low.
      
      Fix it up so that we can indeed adaptively disable readahead altogether, and
      do not start it again until we have seen max_readahead()'s worth of
      consecutive reads.
      bc858911
  4. 06 Feb, 2003 1 commit
  5. 04 Feb, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] cleanup in read_cache_pages() · 99c88bc2
      Andrew Morton authored
      Patch from Nikita Danilov <Nikita@Namesys.COM>
      
      read_cache_pages() is passed a bunch of pages to start I/O against and it is
      supposed to consume all those pages.  But if there is an I/O error, someone
      need to throw away the unused pages.
      
      At present the single user of read_cache_pages() (nfs_readpages) does that
      cleanup by hand.  But it should be done in the core kernel.
      99c88bc2
  6. 14 Jan, 2003 1 commit
  7. 05 Jan, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] handle radix_tree_node allocation failures · c3ed96a7
      Andrew Morton authored
      This patch uses the radix_tree_preload() API in add_to_page_cache().
      
      A new gfp_mask argument is added to add_to_page_cache(), which is then passed
      on to radix_tree_preload().   It's pretty simple.
      
      In the case of adding pages to swapcache we're still using GFP_ATOMIC, so
      these addition attempts can still fail.  That's OK, because the error is
      handled and, unlike file pages, it will not cause user applicaton failures.
      This codepath (radix-tree node exhaustion on swapout) was well tested in the
      days when the swapper_space radix tree was fragmented all over the place due
      to unfortunate swp_entry bit layout.
      c3ed96a7
  8. 14 Dec, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] madvise_willneed() maximum readahead checking · 654107b9
      Andrew Morton authored
      madvise_willneed() currently has a very strange check on how much readahead
      it is prepared to do.
      
        It is based on the user's rss limit.  But this is usually enormous, and
        the user isn't necessarily going to map all that memory at the same time
        anyway.
      
        And the logic is wrong - it is comparing rss (which is in bytes) with
        `end - start', which is in pages.
      
        And it returns -EIO on error, which is not mentioned in the Open Group
        spec and doesn't make sense.
      
      
      This patch takes it all out and applies the same upper limit as is used in
      sys_readahead() - half the inactive list.
      654107b9
    • Andrew Morton's avatar
      [PATCH] limit pinned memory due to readahead · 234931ab
      Andrew Morton authored
      readahead allocates all the pages before starting I/O.  Potentially bad
      if someone is performing huge reads with madvise or sys_readahead().
      
      So the patch just busts that up into two-megabyte units.
      234931ab
  9. 05 Nov, 2002 2 commits
    • Trond Myklebust's avatar
      [PATCH] Convert NFS client to use ->readpages() · b9a2dd76
      Trond Myklebust authored
        - Add the library function read_cache_pages(), which is used in a
          similar fashion to the single page 'read_cache_page()'. It hides
          the details of the LRU cache etc. from a filesystem that wants to
          to populate an address space with a list of pages.
      
        - Fix NFS so that readahead uses the ->readpages() interface. Means
          that we can immediately schedule an RPC call in order to complete
          the I/O, rather than relying on somebody later triggering it by
          calling lock_page() (and hence sync_page()). The sync_page()
          method is race-prone, since the waiting page may try to call it
          before we've finished initializing the 'struct nfs_page'.
      
        - Clear out nfs_sync_page(), the nfs_inode->read list, and
          friends. When the I/O completion gets scheduled in ->readpage(),
          ->readpages(), they have no reason to exist.
      b9a2dd76
    • Trond Myklebust's avatar
      [PATCH] Make ->readpages palatable to NFS · b729e488
      Trond Myklebust authored
      The following patch makes the ->readpages() address_space_operation
      take a struct file argument just like ->readpage().
      b729e488
  10. 30 Oct, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] hot-n-cold pages: free and allocate hints · 8d6282a1
      Andrew Morton authored
      Add a `cold' hint to struct pagevec, and teach truncate and page
      reclaim to use it.
      
      Empirical testing showed that truncate's pages tend to be hot.  And page
      reclaim's are certainly cold.
      8d6282a1
    • Andrew Morton's avatar
      [PATCH] hot-n-cold pages: use cold pages for readahead · 5019ce29
      Andrew Morton authored
      It is usually the case that pagecache reads use busmastering hardware
      to transfer the data into pagecache.  This invalidates the CPU cache of
      the pagecache pages.
      
      So use cache-cold pages for pagecache reads.  To avoid wasting
      cache-hot pages.
      5019ce29
  11. 29 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] add a file_ra_state init function · 6b390b3b
      Andrew Morton authored
      Provide a function in core kernel to initialise a file_ra_state structure.
      
      Perviously this was all taken care of by the fact that new struct
      file's are all zeroed out.  But now a file_ra_state may be
      independently allocated, and we don't want users of it to have to know
      how to initialise it.
      6b390b3b
  12. 17 Oct, 2002 1 commit
    • David Howells's avatar
      [PATCH] do_generic_file_read / readahead adjustments · 9de05205
      David Howells authored
      This does the following three things:
      
       (1) Makes the functions in mm/readahead.c only use struct file* to pass to
           readpage(). address_mapping* and file_ra_state* are used instead to keep
           track of readahead stuff.
      
       (2) Adds a new function do_generic_mapping_read() that is similar to
           do_generic_file_read(), except that it uses a mapping pointer and a
           readahead state pointer to access a file. The file* is only used to pass
           to readpage().
      
       (3) Turns do_generic_file_read() into an inline function in linux/fs.h that
           simply wraps do_generic_mapping_read().
      
      This should mean that it is no longer necessary to have a struct file to
      access a file in this manner. Just an inode or address space should be
      sufficient.
      
      It also means alternate read-ahead structures can be maintained.
      
      The reason I want this is that I'm writing a general cache manager for
      filesystems such as AFS, NFSv4, and Lustre. Block devices are made available
      to the "cache manager" by means of a filesystem that can be mounted. I'm
      storing meta data in an inode in the cache, but to scan this at the moment I
      need to gain a "struct file" to use with do_generic_file_read().
      
      This involves either creating a dummy dentry and struct file (which will cause
      Al Viro to come looking for me with a shotgun), or to use an extra auxilliary
      filesystem mounted with do_kern_mount(), neither of which are particularly
      appealing.
      
      This patch is the alternative... it provides a function that I can pass an
      address_space to. This also allows me to make use of readahead semantics
      without having to reinvent them for myself.
      9de05205
  13. 15 Sep, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] hold the page ref across ->readpage · f3b3dc81
      Andrew Morton authored
      read_pages() is dropping the page refcount before running ->readpage().
      Which just happens to work, because the page is in pagecache and
      locked.
      
      But it breaks under some unconventional things which reiser4 is doing,
      and it's better/safer/saner this way anyway.
      f3b3dc81
  14. 15 Aug, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] batched addition of pages to the LRU · 9eb76ee2
      Andrew Morton authored
      The patch goes through the various places which were calling
      lru_cache_add() against bulk pages and batches them up.
      
      Also.  This whole patch series improves the behaviour of the system
      under heavy writeback load.  There is a reduction in page allocation
      failures, some reduction in loss of interactivity due to page
      allocators getting stuck on writeback from the VM.  (This is still bad
      though).
      
      I think it's due to the change here in mpage_writepages().  That
      function was originally unconditionally refiling written-back pages to
      the head of the inactive list.  The theory being that they should be
      moved out of the way of page allocators, who would end up waiting on
      them.
      
      It appears that this simply had the effect of pushing dirty, unwritten
      data closer to the tail of the inactive list, making things worse.
      
      So instead, if the caller is (typically) balance_dirty_pages() then
      leave the pages where they are on the LRU.
      
      If the caller is PF_MEMALLOC then the pages *have* to be refiled.  This
      is because VM writeback is clustered along mapping->dirty_pages, and
      it's almost certain that the pages which are being written are near the
      tail of the LRU.  If they were left there, page allocators would block
      on them too soon.  It would effectively become a synchronous write.
      9eb76ee2
  15. 01 Aug, 2002 1 commit
  16. 19 Jul, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] readahead optimisations · b6938a7b
      Andrew Morton authored
      Been looking at a workload which involves several processes which seek
      around and read from a large file.  There are a few problems:
      generic_file_lseek is bouncing i_sem around like mad, and readahead is
      doing lots of pointless pagecache probing.
      
      This patch addresses readahead.
      
      Presumably the change will be larger on machines which have higher
      bandwidth memory than my test box, of which there are many.
      
      This patch teaches readahead to detect the situation where no IO is
      actually being performed as a result of its actions.  Now, we don't
      want to sacrifice IO efficiency to save a bit of CPU, so the code is
      very cautious.  But eventually, after some tens of consecutive
      readahead attempts were found to perform no I/O at all, readahead will
      turn itself off.
      
      readahead will be turned on again when either generic_file_read() or
      filemap_nopage() get a pagecache miss.  The function
      handle_ra_thrashing() has been renamed to handle_ra_miss() to reflect
      its widened role.
      
      A performance bug in page_cache_readround() was fixed - if
      ra->next_size is zero, that function needs to leave it well alone,
      because next_size==0 is a magic value meaning that the file has just
      been opened and that readahead needs to get aggressive.  This change
      makes a `make dep' run at the same speed as in the 2.4 kernel.  It used
      to take 4x as long...
      
      `make dep' is an interesting test because it uses mmap to read the files.
      b6938a7b
    • Andrew Morton's avatar
      [PATCH] remove add_to_page_cache_unique() · cad46d66
      Andrew Morton authored
      A tasty patch from Hugh Dickens.  radix_tree_insert() fails if something
      was already present at the target index, so that error can be
      propagated back through add_to_page_cache().  Hence
      add_to_page_cache_unique() is obsolete.
      
      Hugh's patch removes add_to_page_cache_unique() and cleans up a bunch of
      stuff.
      cad46d66
  17. 12 Jun, 2002 1 commit
  18. 28 May, 2002 1 commit
    • Jens Axboe's avatar
      [PATCH] block plugging reworked · eba5b46c
      Jens Axboe authored
      This patch provides the ability for a block driver to signal it's too
      busy to receive more work and temporarily halt the request queue. In
      concept it's similar to the networking netif_{start,stop}_queue helpers.
      
      To do this cleanly, I've ripped out the old tq_disk task queue. Instead
      an internal list of plugged queues is maintained which will honor the
      current queue state (see QUEUE_FLAG_STOPPED bit). Execution of
      request_fn has been moved to tasklet context. blk_run_queues() provides
      similar functionality to the old run_task_queue(&tq_disk).
      
      Now, this only works at the request_fn level and not at the
      make_request_fn level. This is on purpose: drivers working at the
      make_request_fn level are essentially providing a piece of the block
      level infrastructure themselves. There are basically two reasons for
      doing make_request_fn style setups:
      
      o block remappers. start/stop functionality will be done at the target
        device in this case, which is the level that will signal hardware full
        (or continue) anyways.
      
      o drivers who wish to receive single entities of "buffers" and not
        merged requests etc. This could use the start/stop functionality. I'd
        suggest _still_ using a request_fn for these, but set the queue
        options so that no merging etc ever takes place. This has the added
        bonus of providing the usual request depletion throttling at the block
        level.
      eba5b46c
  19. 27 May, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] direct-to-BIO readahead · bc67de55
      Andrew Morton authored
      Implements BIO-based multipage reads into the pagecache, and turns this
      on for ext2.
      
      CPU load for `cat large_file > /dev/null' is reduced by approximately
      15%.  Similar reductions for tiobench with a single thread.  (Earlier
      claims of 25% were exaggerated - they were measured with slab debug
      enabled.  But 15% isn't bad for a load which is dominated by copy_*_user
      costs).
      
      With 2, 4 and 8 tiobench threads, throughput is increased as well, which was
      unexpected.  It's due to request queue weirdness.  (Generally the
      request queueing is doing bad things under certain workloads - that's a
      separate issue.)
      
      BIOs of up to 64 kbytes are assembled and submitted for readahead and
      for single-page reads.  So the work involved in reading 32 pages has gone
      from:
      
      	- allocate and attach 32 buffer_heads
      	- submit 32 buffer_heads
      	- allocate 32 bios
      	- submit 32 bios
      
      to:
      
      	- allocate 2 bios
      	- submit 2 bios
      
      These pages never have buffers attached.  Buffers will be attached
      later if the application writes to these pages (file overwrite).
      
      The first version of this code (in the "delayed allocation" patches)
      tries to handle everything - bios which start mid-page, bios which end
      mid-page and pages which are covered by multiple bios.  It is very
      complex code and in fact appears to be incorrect: out-of-order BIO
      completion could cause a page to come unlocked at the wrong time.
      
      This implementation is much simpler: if things get complex, it just
      falls back to the buffer-based block_read_full_page(), which isn't
      going away, and which understands all that complexity.  There's no
      point in doing this in two places.
      
      This code will bypass the buffer layer for
      
       - fully-mapped pages which are on-disk contiguous.
      
       - fully unmapoped pages (holes)
      
       - partially unmapped pages, where the unmappedness is at the end of
         the page (end-of-file).
      
      and everything else falls back to buffers.
      
      This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are
      handed direct to BIO.  With a heavy 10-minute dbench run on 4k
      PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO.
      Almost all of the other 5% were passed to block_read_full_page()
      because they were already partially uptodate from an earlier sub-page
      write().  This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater
      than four.  But if that's the case, CPU efficiency is far from the main
      concern - there are significant seek and bandwidth problems just at 4
      blocks per page.
      
      This code will stress out the block layer somewhat - RAID0 doesn't like
      multipage BIOs, and there are probably others.  RAID0 seems to struggle
      along - readahead fails but read falls back to single-page reads, which
      succeed.  Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE
      to PAGE_CACHE_SIZE in fs/mpage.c.
      
      It is trivial to enable multipage reads for many other filesystems.  We
      can do that after completion of external testing of ext2.
      bc67de55
  20. 19 May, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] pdflush exclusion infrastructure · 1f6acea0
      Andrew Morton authored
      Collision avoidance for pdflush threads.
      
      Turns the request_queue-based `unsigned long ra_pages' into a structure
      which contains ra_pages as well as a longword.
      
      That longword is used to record the fact that a pdflush thread is
      currently writing something back against this request_queue.
      
      Avoids the situation where several pdflush threads are sleeping on the
      same request_queue.
      
      This patch provides only the infrastructure for the pdflush exclusion.
      This infrastructure gets used in pdflush-single.patch
      1f6acea0
    • Andrew Morton's avatar
      [PATCH] reduce lock contention in do_pagecache_readahead · cd016d80
      Andrew Morton authored
      Anton Blanchard has a workload (the SDET benchmark) which is showing some
      moderate lock contention in do_pagecache_readahead().
      
      Seems that SDET has many threads performing seeky reads against a
      cached file.  The average number of pagecache probes in a single
      do_pagecache_readahead() is six, which seems reasonable.
      
      The patch (from Anton) flips the locking around to optimise for the
      fast case (page was present).  So the kernel takes the lock less often,
      and does more work once it has been acquired.
      cd016d80
  21. 30 Apr, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] page writeback locking update · a2bcb3a0
      Andrew Morton authored
      - Fixes a performance problem - callers of
        prepare_write/commit_write, etc are locking pages, which synchronises
        them behind writeback, which also locks these pages.  Significant
        slowdowns for some workloads.
      
      - So pages are no longer locked while under writeout.  Introduce a
        new PG_writeback and associated infrastructure to support this design
        change.
      
      - Pages which are under read I/O still use PageLocked.  Pages which
        are under write I/O have PageWriteback() true.
      
        I considered creating Page_IO instead of PageWriteback, and marking
        both readin and writeout pages as PageIO().  So pages are unlocked
        during both read and write.  There just doesn't seem a need to do
        this - nobody ever needs unblocking access to a page which is under
        read I/O.
      
      - Pages under swapout (brw_page) are PageLocked, not PageWriteback.
        So their treatment is unchangeded.
      
        It's not obvious that pages which are under swapout actually need
        the more asynchronous behaviour of PageWriteback.
      
        I was setting the swapout pages PageWriteback and unlocking them
        prior to submitting the buffers in brw_page().  This led to deadlocks
        on the exit_mmap->zap_page_range->free_swap_and_cache path.  These
        functions call block_flushpage under spinlock.  If the page is
        unlocked but has locked buffers, block_flushpage->discard_buffer()
        sleeps.  Under spinlock.  So that will need fixing if for some reason
        we want swapout to use PageWriteback.
      
        Kernel has called block_flushpage() under spinlock for a long time.
         It is assuming that a locked page will never have locked buffers.
        This appears to be true, but it's ugly.
      
      - Adds new function wait_on_page_writeback().  Renames wait_on_page()
        to wait_on_page_locked() to remind people that they need to call the
        appropriate one.
      
      - Renames filemap_fdatasync() to filemap_fdatawrite().  It's more
        accurate - "sync" implies, if anything, writeout and wait.  (fsync,
        msync) Or writeout.  it's not clear.
      
      - Subtly changes the filemap_fdatawrite() internals - this function
        used to do a lock_page() - it waited for any other user of the page
        to let go before submitting new I/O against a page.  It has been
        changed to simply skip over any pages which are currently under
        writeback.
      
        This is the right thing to do for memory-cleansing reasons.
      
        But it's the wrong thing to do for data consistency operations (eg,
        fsync()).  For those operations we must ensure that all data which
        was dirty *at the time of the system call* are tight on disk before
        the call returns.
      
        So all places which care about this have been converted to do:
      
      	filemap_fdatawait(mapping);	/* Wait for current writeback */
      	filemap_fdatawrite(mapping);	/* Write all dirty pages */
      	filemap_fdatawait(mapping);	/* Wait for I/O to complete */
      
      - Fixes a truncate_inode_pages problem - truncate currently will
        block when it hits a locked page, so it ends up getting into lockstep
        behind writeback and all of the file is pointlessly written back.
      
        One fix for this is for truncate to simply walk the page list in the
        opposite direction from writeback.
      
        I chose to use a separate cleansing pass.  It is more
        CPU-intensive, but it is surer and clearer.  This is because there is
        no reason why the per-address_space ->vm_writeback and
        ->writeback_mapping functions *have* to perform writeout in
        ->dirty_pages order.  They may choose to do something totally
        different.
      
        (set_page_dirty() is an a_op now, so address_spaces could almost
        privatise the whole dirty-page handling thing.  Except
        truncate_inode_pages and invalidate_inode_pages assume that the pages
        are on the address_space lists.  hmm.  So making truncate_inode_pages
        and invalidate_inode_pages a_ops would make some sense).
      a2bcb3a0
    • Andrew Morton's avatar
      [PATCH] readahead fix · 00d6555e
      Andrew Morton authored
      Changes the way in which the readahead code locates the readahead
      setting for the underlying device.
      
      - struct block_device and struct address_space gain a *pointer* to the
        current readahead tunable.
      
      - The tunable lives in the request queue and is altered with the
        traditional ioctl.
      
      - The value gets *copied* into the struct file at open() time.  So a
        fcntl() mode to modify it per-fd is simple.
      
      - Filesystems which are not request_queue-backed get the address of the
        global `default_ra_pages'.  If we want, this can become a tunable.
      
      - Filesystems are at liberty to alter address_space.ra_pages to point
        at some other fs-private default at new_inode/read_inode/alloc_inode
        time.
      
      - The ra_pages pointer can become a structure pointer if, at some time
        in the future, high-level code needs more detailed information about
        device characteristics.
      
        In fact, it'll need to become a struct pointer for use by
        writeback: my current writeback code has the problem that multiple
        pdflush threads can get stuck on the same request queue.  That's a
        waste of resources.  I currently have a silly flag in the superblock
        to try to avoid this.
      
        The proper way to get this exclusion is for the high-level
        writeback code to be able to do a test-and-set against a
        per-request_queue flag.  That flag can live in a structure alongside
        ra_pages, conveniently accessible at the pagemap level.
      
      One thing still to-be-done is going into all callers of blk_init_queue
      and blk_queue_make_request and making sure that they're setting up a
      sensible default.  ATA wants 248 sectors, and floppy drives don't want
      128kbytes, I suspect.  Later.
      00d6555e
  22. 26 Apr, 2002 2 commits
  23. 25 Apr, 2002 1 commit
  24. 10 Apr, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] readahead · 8fa49846
      Andrew Morton authored
      I'd like to be able to claim amazing speedups, but
      the best benchmark I could find was diffing two
      256 megabyte files, which is about 10% quicker.  And
      that is probably due to the window size being effectively
      50% larger.
      
      Fact is, any disk worth owning nowadays has a segmented
      2-megabyte cache, and OS-level readahead mainly seems
      to save on CPU cycles rather than overall throughput.
      Once you start reading more streams than there are segments
      in the disk cache we start to win.
      
      Still.  The main motivation for this work is to
      clean the code up, and to create a central point at
      which many pages are marshalled together so that
      they can all be encapsulated into the smallest possible
      number of BIOs, and injected into the request layer.
      
      A number of filesystems were poking around inside the
      readahead state variables.  I'm not really sure what they
      were up to, but I took all that out.  The readahead
      code manages its own state autonomously and should not
      need any hints.
      
      - Unifies the current three readahead functions (mmap reads, read(2)
        and sys_readhead) into a single implementation.
      
      - More aggressive in building up the readahead windows.
      
      - More conservative in tearing them down.
      
      - Special start-of-file heuristics.
      
      - Preallocates the readahead pages, to avoid the (never demonstrated,
        but potentially catastrophic) scenario where allocation of readahead
        pages causes the allocator to perform VM writeout.
      
      - Gets all the readahead pages gathered together in
        one spot, so they can be marshalled into big BIOs.
      
      - reinstates the readahead ioctls, so hdparm(8) and blockdev(8)
        are working again.  The readahead settings are now per-request-queue,
        and the drivers never have to know about it.  I use blockdev(8).
        It works in units of 512 bytes.
      
      - Identifies readahead thrashing.
      
        Also attempts to handle it.  Certainly the changes here
        delay the onset of catastrophic readahead thrashing by
        quite a lot, and decrease it seriousness as we get more
        deeply into it, but it's still pretty bad.
      8fa49846