1. 30 Apr, 2002 6 commits
    • Andrew Morton's avatar
      [PATCH] remove buffer unused_list · 4beda7c1
      Andrew Morton authored
      Removes the buffer_head unused list.  Use a mempool instead.
      
      The reduced lock contention provided about a 10% boost on ANton's
      12-way.
      4beda7c1
    • Andrew Morton's avatar
      [PATCH] writeback from address spaces · 090da372
      Andrew Morton authored
      [ I reversed the order in which writeback walks the superblock's
        dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
        such a sleaze ]
      
      The core writeback patch.  Switches file writeback from the dirty
      buffer LRU over to address_space.dirty_pages.
      
      - The buffer LRU is removed
      
      - The buffer hash is removed (uses blockdev pagecache lookups)
      
      - The bdflush and kupdate functions are implemented against
        address_spaces, via pdflush.
      
      - The relationship between pages and buffers is changed.
      
        - If a page has dirty buffers, it is marked dirty
        - If a page is marked dirty, it *may* have dirty buffers.
        - A dirty page may be "partially dirty".  block_write_full_page
          discovers this.
      
      - A bunch of consistency checks of the form
      
      	if (!something_which_should_be_true())
      		buffer_error();
      
        have been introduced.  These fog the code up but are important for
        ensuring that the new buffer/page code is working correctly.
      
      - New locking (inode.i_bufferlist_lock) is introduced for exclusion
        from try_to_free_buffers().  This is needed because set_page_dirty
        is called under spinlock, so it cannot lock the page.  But it
        needs access to page->buffers to set them all dirty.
      
        i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
      
      - fs/inode.c has been split: all the code related to file data writeback
        has been moved into fs/fs-writeback.c
      
      - Code related to file data writeback at the address_space level is in
        the new mm/page-writeback.c
      
      - try_to_free_buffers() is now non-blocking
      
      - Switches vmscan.c over to understand that all pages with dirty data
        are now marked dirty.
      
      - Introduces a new a_op for VM writeback:
      
      	->vm_writeback(struct page *page, int *nr_to_write)
      
        this is a bit half-baked at present.  The intent is that the address_space
        is given the opportunity to perform clustered writeback.  To allow it to
        opportunistically write out disk-contiguous dirty data which may be in other zones.
        To allow delayed-allocate filesystems to get good disk layout.
      
      - Added address_space.io_pages.  Pages which are being prepared for
        writeback.  This is here for two reasons:
      
        1: It will be needed later, when BIOs are assembled direct
           against pagecache, bypassing the buffer layer.  It avoids a
           deadlock which would occur if someone moved the page back onto the
           dirty_pages list after it was added to the BIO, but before it was
           submitted.  (hmm.  This may not be a problem with PG_writeback logic).
      
        2: Avoids a livelock which would occur if some other thread is continually
           redirtying pages.
      
      - There are two known performance problems in this code:
      
        1: Pages which are locked for writeback cause undesirable
           blocking when they are being overwritten.  A patch which leaves
           pages unlocked during writeback comes later in the series.
      
        2: While inodes are under writeback, they are locked.  This
           causes namespace lookups against the file to get unnecessarily
           blocked in wait_on_inode().  This is a fairly minor problem.
      
           I don't have a fix for this at present - I'll fix this when I
           attach dirty address_spaces direct to super_blocks.
      
      - The patch vastly increases the amount of dirty data which the
        kernel permits highmem machines to maintain.  This is because the
        balancing decisions are made against the amount of memory in the
        machine, not against the amount of buffercache-allocatable memory.
      
        This may be very wrong, although it works fine for me (2.5 gigs).
      
        We can trivially go back to the old-style throttling with
        s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
        balance_dirty_pages().  But better would be to allow blockdev
        mappings to use highmem (I'm thinking about this one, slowly).  And
        to move writer-throttling and writeback decisions into the VM (modulo
        the file-overwriting problem).
      
      - Drops 24 bytes from struct buffer_head.  More to come.
      
      - There's some gunk like super_block.flags:MS_FLUSHING which needs to
        be killed.  Need a better way of providing collision avoidance
        between pdflush threads, to prevent more than one pdflush thread
        working a disk at the same time.
      
        The correct way to do that is to put a flag in the request queue to
        say "there's a pdlfush thread working this disk".  This is easy to
        do: just generalise the "ra_pages" pointer to point at a struct which
        includes ra_pages and the new collision-avoidance flag.
      090da372
    • Andrew Morton's avatar
      [PATCH] readahead fix · 00d6555e
      Andrew Morton authored
      Changes the way in which the readahead code locates the readahead
      setting for the underlying device.
      
      - struct block_device and struct address_space gain a *pointer* to the
        current readahead tunable.
      
      - The tunable lives in the request queue and is altered with the
        traditional ioctl.
      
      - The value gets *copied* into the struct file at open() time.  So a
        fcntl() mode to modify it per-fd is simple.
      
      - Filesystems which are not request_queue-backed get the address of the
        global `default_ra_pages'.  If we want, this can become a tunable.
      
      - Filesystems are at liberty to alter address_space.ra_pages to point
        at some other fs-private default at new_inode/read_inode/alloc_inode
        time.
      
      - The ra_pages pointer can become a structure pointer if, at some time
        in the future, high-level code needs more detailed information about
        device characteristics.
      
        In fact, it'll need to become a struct pointer for use by
        writeback: my current writeback code has the problem that multiple
        pdflush threads can get stuck on the same request queue.  That's a
        waste of resources.  I currently have a silly flag in the superblock
        to try to avoid this.
      
        The proper way to get this exclusion is for the high-level
        writeback code to be able to do a test-and-set against a
        per-request_queue flag.  That flag can live in a structure alongside
        ra_pages, conveniently accessible at the pagemap level.
      
      One thing still to-be-done is going into all callers of blk_init_queue
      and blk_queue_make_request and making sure that they're setting up a
      sensible default.  ATA wants 248 sectors, and floppy drives don't want
      128kbytes, I suspect.  Later.
      00d6555e
    • Andrew Morton's avatar
      [PATCH] page accounting · d878155c
      Andrew Morton authored
      This patch provides global accounting of locked and dirty pages.  It
      does this via lightweight per-CPU data structures.  The page_cache_size
      accounting has been changed to use this facility as well.
      
      Locked and dirty page accounting is needed for making writeback and
      throttling decisions.
      
      The patch also starts to move code which is related to page->flags
      out of linux/mm.h and into linux/page-flags.h
      d878155c
    • Andrew Morton's avatar
      [PATCH] ext2 directory handling · aa4f3f28
      Andrew Morton authored
      Convert ext2 directory handling to not rely on the contents of pages
      outside i_size.
      
      This is because block_write_full_page (which is used for all writeback)
      zaps the page outside i_size.
      aa4f3f28
    • Andrew Morton's avatar
      [PATCH] page_alloc failure printk · 63b060c4
      Andrew Morton authored
      Emit a printk when a page allocation fails.  Considered useful for
      diagnosing crashes.
      63b060c4
  2. 29 Apr, 2002 2 commits
    • Alexander Viro's avatar
      [PATCH] Re: 2.5.11 breakage · 85d217f4
      Alexander Viro authored
      	OK, here comes.  Patch below is an attempt to do the fastwalk
      stuff in right way and so far it seems to be working.
      
       - dentry leak is plugged
       - locked/unlocked state of nameidata doesn't depend on history - it
         depends only on point in code.
       - LOOKUP_LOCKED is gone.
       - following mounts and .. doesn't drop dcache_lock
       - light-weight permission check distinguishes between "don't know" and
         "permission denied", so we don't call full-blown permission() unless
         we have to.
       - code that changes root/pwd holds dcache_lock _and_ write lock on
         current->fs->lock.  I.e. if we hold dcache_lock we can safely
         access our ->fs->{root,pwd}{,mnt}
       - __d_lookup() does not increment refcount; callers do dget_locked()
         if they need it (behaviour of d_lookup() didn't change, obviously).
       - link_path_walk() logics had been (somewhat) cleaned up.
      85d217f4
    • Martin Dalecki's avatar
      [PATCH] 2.5.10 IDE 45 · 7ca32047
      Martin Dalecki authored
      - Fix bogus set_multimode() change. I tough I had reverted it before diff-ing.
         This was causing hangs of /dev/hdparm -m8 /dev/hda and similar commands.
      7ca32047
  3. 28 Apr, 2002 32 commits