1. 04 Jan, 2005 1 commit
    • David Howells's avatar
      [PATCH] VM routine fixes · 2954850e
      David Howells authored
      
      The attached patch fixes a number of problems in the VM routines:
      
       (1) Some inline funcs don't compile if CONFIG_MMU is not set.
      
       (2) swapper_pml4 needn't exist if CONFIG_MMU is not set.
      
       (3) __free_pages_ok() doesn't counter set_page_refs() different behaviour if
           CONFIG_MMU is not set.
      
       (4) swsusp.c invokes TLB flushing functions without including the header file
           that declares them.
      
      CONFIG_SHMEM semantics:
      
      - If MMU: Always enabled if !EMBEDDED
      
      - If MMU && EMBEDDED: configurable
      
      - If !MMU: disabled
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2954850e
  2. 31 Aug, 2004 1 commit
  3. 23 Aug, 2004 1 commit
    • Rik van Riel's avatar
      [PATCH] token based thrashing control · d4f9d02b
      Rik van Riel authored
      The following experimental patch implements token based thrashing
      protection, using the algorithm described in:
      
      	http://www.cs.wm.edu/~sjiang/token.htm
      
      
      
      When there are pageins going on, a task can grab a token, that protects the
      task from pageout (except by itself) until it is no longer doing heavy
      pageins, or until the maximum hold time of the token is over.
      
      If the maximum hold time is exceeded, the task isn't eligable to hold the
      token for a while more, since it wasn't doing it much good anyway.
      
      I have run a very unscientific benchmark on my system to test the
      effectiveness of the patch, timing how a 230MB two-process qsbench run
      takes, with and without the token thrashing protection present.
      
      normal 2.6.8-rc6:	6m45s
      2.6.8-rc6 + token:	4m24s
      
      This is a quick hack, implemented without having talked to the inventor of
      the algorithm.  He's copied on the mail and I suspect we'll be able to do
      better than my quick implementation ...
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d4f9d02b
  4. 22 May, 2004 2 commits
    • Andrew Morton's avatar
      [PATCH] rmap 17: real prio_tree · 2fe9c14c
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Rajesh Venkatasubramanian's implementation of a radix priority search tree of
      vmas, to handle object-based reverse mapping corner cases well.
      
      Amongst the objections to object-based rmap were test cases by akpm and by
      mingo, in which large numbers of vmas mapping disjoint or overlapping parts of
      a file showed strikingly poor performance of the i_mmap lists.  Perhaps those
      tests are irrelevant in the real world?  We cannot be too sure: the prio_tree
      is well-suited to solving precisely that problem, so unless it turns out to
      bring too much overhead, let's include it.
      
      Why is this prio_tree.c placed in mm rather than lib?  See GET_INDEX: this
      implementation is geared throughout to use with vmas, though the first half of
      the file appears more general than the second half.
      
      Each node of the prio_tree is itself (contained within) a vma: might save
      memory by allocating distinct nodes from which to hang vmas, but wouldn't save
      much, and would complicate the usage with preallocations.  Off each node of
      the prio_tree itself hangs a list of like vmas, if any.
      
      The connection from node to list is a little awkward, but probably the best
      compromise: it would be more straightforward to list likes directly from the
      tree node, but that would use more memory per vma, for the list_head and to
      identify that head.  Instead, node's shared.vm_set.head points to next vma
      (whose shared.vm_set.head points back to node vma), and that next contains the
      list_head from which the rest hang - reusing fields already used in the
      prio_tree node itself.
      
      Currently lacks prefetch: Rajesh hopes to add some soon.
      2fe9c14c
    • Andrew Morton's avatar
      [PATCH] numa api: Core NUMA API code · d3b8924a
      Andrew Morton authored
      From: Andi Kleen <ak@suse.de>
      
      The following patches add support for configurable NUMA memory policy
      for user processes. It is based on the proposal from last kernel summit
      with feedback from various people.
      
      This NUMA API doesn't not attempt to implement page migration or anything
      else complicated: all it does is to police the allocation when a page
      is first allocation or when a page is reallocated after swapping. Currently
      only support for shared memory and anonymous memory is there; policy for
      file based mappings is not implemented yet (although they get implicitely
      policied by the default process policy)
      
      It adds three new system calls: mbind to change the policy of a VMA,
      set_mempolicy to change the policy of a process, get_mempolicy to retrieve
      memory policy. User tools (numactl, libnuma, test programs, manpages) can be
      found in  ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz
      
      For details on the system calls see the manpages
      http://www.firstfloor.org/~andi/mbind.html
      http://www.firstfloor.org/~andi/set_mempolicy.html
      http://www.firstfloor.org/~andi/get_mempolicy.html
      Most user programs should actually not use the system calls directly,
      but use the higher level functions in libnuma
      (http://www.firstfloor.org/~andi/numa.html) or the command line tools
      (http://www.firstfloor.org/~andi/numactl.html
      
      The system calls allow user programs and administors to set various NUMA memory
      policies for putting memory on specific nodes. Here is a short description
      of the policies copied from the kernel patch:
      
       * NUMA policy allows the user to give hints in which node(s) memory should
       * be allocated.
       *
       * Support four policies per VMA and per process:
       *
       * The VMA policy has priority over the process policy for a page fault.
       *
       * interleave     Allocate memory interleaved over a set of nodes,
       *                with normal fallback if it fails.
       *                For VMA based allocations this interleaves based on the
       *                offset into the backing object or offset into the mapping
       *                for anonymous memory. For process policy an process counter
       *                is used.
       * bind           Only allocate memory on a specific set of nodes,
       *                no fallback.
       * preferred      Try a specific node first before normal fallback.
       *                As a special case node -1 here means do the allocation
       *                on the local CPU. This is normally identical to default,
       *                but useful to set in a VMA when you have a non default
       *                process policy.
       * default        Allocate on the local node first, or when on a VMA
       *                use the process policy. This is what Linux always did
       *                in a NUMA aware kernel and still does by, ahem, default.
       *
       * The process policy is applied for most non interrupt memory allocations
       * in that process' context. Interrupts ignore the policies and always
       * try to allocate on the local CPU. The VMA policy is only applied for memory
       * allocations for a VMA in the VM.
       *
       * Currently there are a few corner cases in swapping where the policy
       * is not applied, but the majority should be handled. When process policy
       * is used it is not remembered over swap outs/swap ins.
       *
       * Only the highest zone in the zone hierarchy gets policied. Allocations
       * requesting a lower zone just use default policy. This implies that
       * on systems with highmem kernel lowmem allocation don't get policied.
       * Same with GFP_DMA allocations.
       *
       * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
       * all users and remembered even when nobody has memory mapped.
      
      
      
      
      This patch:
      
      This is the core NUMA API code. This includes NUMA policy aware
      wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
      these are defined away.
      
      The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
      get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
      set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
      implemented here.
      
      Adds a vm_policy field to the VMA and to the process. The process
      also has field for interleaving. VMA interleaving uses the offset
      into the VMA, but that's not possible for process allocations.
      
      From: Andi Kleen <ak@muc.de>
      
        > Andi, how come policy_vma() calls ->set_policy under i_shared_sem?
      
        I think this can be actually dropped now.  In an earlier version I did
        walk the vma shared list to change the policies of other mappings to the
        same shared memory region.  This turned out too complicated with all the
        corner cases, so I eventually gave in and added ->get_policy to the fast
        path.  Also there is still the mmap_sem which prevents races in the same MM.
         
      
        Patch to remove it attached.  Also adds documentation and removes the
        bogus __alloc_page_vma() prototype noticed by hch.
      
      From: Andi Kleen <ak@suse.de>
      
        A few incremental fixes for NUMA API.
      
        - Fix a few comments
      
        - Add a compat_ function for get_mem_policy I considered changing the
          ABI to avoid this, but that would have made the API too ugly.  I put it
          directly into the file because a mm/compat.c didn't seem worth it just for
          this.
      
        - Fix the algorithm for VMA interleave.
      
      From: Matthew Dobson <colpatch@us.ibm.com>
      
        1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA.
          The only references to the function are in NUMA code in mempolicy.c
      
        2) Remove the definitions of __alloc_page_vma().  They aren't used.
      
        3) Move forward declaration of struct vm_area_struct to top of file.
      d3b8924a
  5. 12 Apr, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] hugetlb consolidation · c8b976af
      Andrew Morton authored
      From: William Lee Irwin III <wli@holomorphy.com>
      
      The following patch consolidates redundant code in various hugetlb
      implementations.  I took the liberty of renaming a few things, since the
      code was all moved anyway, and it has the benefit of helping to catch
      missed conversions and/or consolidations.
      c8b976af
  6. 05 Sep, 2003 1 commit
    • Jamie Lokier's avatar
      [PATCH] Unpinned futexes v2: indexing changes · 968f11a8
      Jamie Lokier authored
      This changes the way futexes are indexed, so that they don't pin pages. 
      It also fixes some bugs with private mappings and COW pages.
      
      Currently, all futexes look up the page at the userspace address and pin
      it, using the pair (page,offset) as an index into a table of waiting
      futexes.  Any page with a futex waiting on it remains pinned in RAM,
      which is a problem when many futexes are used, especially with FUTEX_FD.
      
      Another problem is that the page is not always the correct one, if it
      can be changed later by a COW (copy on write) operation.  This can
      happen when waiting on a futex without writing to it after fork(),
      exec() or mmap(), if the page is then written to before attempting to
      wake a futex at the same adress. 
      
      There are two symptoms of the COW problem:
       - The wrong process can receive wakeups
       - A process can fail to receive required wakeups. 
      
      This patch fixes both by changing the indexing so that VM_SHARED
      mappings use the triple (inode,offset,index), and private mappings use
      the pair (mm,virtual_address).
      
      The former correctly handles all shared mappings, including tmpfs and
      therefore all kinds of shared memory (IPC shm, /dev/shm and
      MAP_ANON|MAP_SHARED).  This works because every mapping which is
      VM_SHARED has an associated non-zero vma->vm_file, and hence inode.
      (This is ensured in do_mmap_pgoff, where it calls shmem_zero_setup). 
      
      The latter handles all private mappings, both files and anonymous.  It
      isn't affected by COW, because it doesn't care about the actual pages,
      just the virtual address.
      
      The patch has a few bonuses:
      
              1. It removes the vcache implementation, as only futexes were
                 using it, and they don't any more.
      
              2. Removing the vcache should make COW page faults a bit faster.
      
              3. Futex operations no longer take the page table lock, walk
                 the page table, fault in pages that aren't mapped in the
                 page table, or do a vcache hash lookup - they are mostly a
                 simple offset calculation with one hash for the futex
                 table.  So they should be noticably faster.
      
      Special thanks to Hugh Dickins, Andrew Morton and Rusty Russell for
      insightful feedback.  All suggestions are included.
      968f11a8
  7. 04 Feb, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] implement posix_fadvise64() · fccbe384
      Andrew Morton authored
      An implementation of posix_fadvise64().  It adds 368 bytes to my vmlinux and
      is worth it.
      
      I didn't bother doing posix_fadvise(), as userspace can implement that by
      calling fadvise64().
      
      The main reason for wanting this syscall is to provide userspace with the
      ability to explicitly shoot down pagecache when streaming large files.  This
      is what O_STEAMING does, only posix_fadvise() is standards-based, and harder
      to use.
      
      posix_fadvise() also subsumes sys_readahead().
      
      POSIX_FADV_WILLNEED will generally provide asynchronous readahead semantics
      for small amounts of I/O.  As long as things like indirect blocks are aready
      in core.
      
      POSIX_FADV_RANDOM gives unprivileged applications a way of disabling
      readahead on a per-fd basis, which may provide some benefit for super-seeky
      access patterns such as databases.
      
      
      
      The POSIX_FADV_* values are already implemented in glibc, and this patch
      ensures that they are in sync.
      
      A test app (fadvise.c) is available in ext3 CVS.  See
      
      	http://www.zip.com.au/~akpm/linux/ext3/
      
      for CVS details.
      
      Ulrich has reviewed this patch (thanks).
      fccbe384
  8. 03 Feb, 2003 1 commit
  9. 02 Dec, 2002 1 commit
  10. 01 Dec, 2002 1 commit
  11. 03 Nov, 2002 1 commit
    • Christoph Hellwig's avatar
      [PATCH] make swap code conditional · abcb2f16
      Christoph Hellwig authored
      Make the swap code conditional on CONFIG_SWAP.  This is mostly for
      uClinux, but !CONFIG_SWAP compiles and boots fine for i386, too -
      the only problem I've seen is that X doesn't starts, it's probably
      shm-related, thus it's disabled unconditionally for "normal" arches.
      
      The patch makes three files in mm/ conditional on CONFIG_SWAP, reorganzies
      include/linux/swap.h big time to provide stubs for the !CONFIG_SWAP case,
      moves the remaining /proc/swaps code to swapfile.c and cleans up some
      more MM code to compile fine without CONFIG_SWAP
      abcb2f16
  12. 31 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] sys_remap_file_pages · d16dc20c
      Andrew Morton authored
      Ingo's remap_file_pages patch.  Supported on ia32, x86-64, sparc
      and sparc64.  Others will need to update mman.h and the syscall
      tables.
      d16dc20c
  13. 08 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] free_area_init cleanup · 5b73f882
      Andrew Morton authored
      From Christoph Hellwig.
      
      If we always pass &contig_page_data into free_area_init_node for the
      non-distcontig case we can merge both versions of that function into
      one.  Move that one to page_alloc.c and thus kill numa.c which was
      totally misnamed, btw.
      5b73f882
  14. 03 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] truncate/invalidate_inode_pages rewrite · 735a2573
      Andrew Morton authored
      Rewrite these functions to use gang lookup.
      
      - This probably has similar performance to the old code in the common case.
      
      - It will be vastly quicker than current code for the worst case
        (single-page truncate).
      
      - invalidate_inode_pages() has been changed.  It used to use
        page_count(page) as the "is it mapped into pagetables" heuristic.  It
        now uses the (page->pte.direct != 0) heuristic.
      
      - Removes the worst cause of scheduling latency in the kernel.
      
      - It's a big code cleanup.
      
      - invalidate_inode_pages() has been changed to take an address_space
        *, not an inode *.
      
      - the maximum hold times for mapping->page_lock are enormously reduced,
        making it quite feasible to turn this into an irq-safe lock.  Which, it
        seems, is a requirement for sane AIO<->direct-io integration, as well
        as possibly other AIO things.
      
      (Thanks Hugh for fixing a bug in this one as well).
      
      (Christoph added some stuff too)
      735a2573
  15. 27 Sep, 2002 1 commit
    • Ingo Molnar's avatar
      [PATCH] virtual => physical page mapping cache · 7c2149e9
      Ingo Molnar authored
      Implement a "mapping change" notification for virtual lookup caches, and
      make the futex code use that to keep the futex page pinning consistent
      across copy-on-write events in the VM space.
      7c2149e9
  16. 18 Sep, 2002 1 commit
  17. 17 Sep, 2002 1 commit
  18. 19 Jul, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] minimal rmap · c48c43e6
      Andrew Morton authored
      This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
      Kulsea.
      
      Basically,
      
      before: When the page reclaim code decides that is has scanned too many
      unreclaimable pages on the LRU it does a scan of process virtual
      address spaces for pages to add to swapcache.  ptes pointing at the
      page are unmapped as the scan proceeds.  When all ptes referring to a
      page have been unmapped and it has been written to swap the page is
      reclaimable.
      
      after: When an anonymous page is encountered on the tail of the LRU we
      use the rmap to see if it hasn't been referenced lately.  If so then
      add it to swapcache.  When the page is again encountered on the LRU, if
      it is still unreferenced then try to unmap all ptes which refer to it
      in one hit, and if it is clean (ie: on swap) then free it.
      
      The rest of the VM - list management, the classzone concept, etc
      remains unchanged.
      
      There are a number of things which the per-page pte chain could be
      used for.  Bill Irwin has identified the following.
      
      
      (1)  page replacement no longer goes around randomly unmapping things
      
      (2)  referenced bits are more accurate because there aren't several ms
              or even seconds between find the multiple pte's mapping a page
      
      (3)  reduces page replacement from O(total virtually mapped) to O(physical)
      
      (4)  enables defragmentation of physical memory
      
      (5)  enables cooperative offlining of memory for friendly guest instance
              behavior in UML and/or LPAR settings
      
      (6)  demonstrable benefit in performance of swapping which is common in
              end-user interactive workstation workloads (I don't like the word
              "desktop"). c.f. Craig Kulesa's post wrt. swapping performance
      
      (7)  evidence from 2.4-based rmap trees indicates approximate parity
              with mainline in kernel compiles with appropriate locking bits
      
      (8)  partitioning of physical memory can reduce the complexity of page
              replacement searches by scanning only the "interesting" zones
              implemented and merged in 2.4-based rmap
      
      (9)  partitioning of physical memory can increase the parallelism of page
              replacement searches by independently processing different zones
              implemented, but not merged in 2.4-based rmap
      
      (10) the reverse mappings may be used for efficiently keeping pte cache
              attributes coherent
      
      (11) they may be used for virtual cache invalidation (with changes)
      
      (12) the reverse mappings enable proper RSS limit enforcement
              implemented and merged in 2.4-based rmap
      
      
      
      The code adds a pointer to struct page, consumes additional storage for
      the pte chains and adds computational expense to the page reclaim code
      (I measured it at 3% additional load during streaming I/O).  The
      benefits which we get back for all this are, I must say, theoretical
      and unproven.  If it has real advantages (or, indeed, disadvantages)
      then why has nobody demonstrated them?
      
      
      
      There are a number of things remaining to be done:
      
      1: Demonstrate the above advantages.
      
      2: Make it work with pte-highmem  (Bill Irwin is signed up for this)
      
      3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
         patch does this)
      
      4: Move the pte_chains into highmem too (Bill, I guess)
      
      5: per-cpu pte_chain freelists (Rik?)
      
      6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)
      
      7: multithread the page reclaim code.  (I have patches).
      
      8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
         often well-ordered-by-virtual-address on the LRU, so it "just
         works" for benchmarky loads.  But there may be some other loads...
      
      9: Fix bad IO latency in page reclaim (I have lame patches)
      
      10: Develop tuning tools, use them.
      
      11: The nightly updatedb run is still evicting everything.
      c48c43e6
  19. 30 Apr, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] writeback from address spaces · 090da372
      Andrew Morton authored
      [ I reversed the order in which writeback walks the superblock's
        dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
        such a sleaze ]
      
      The core writeback patch.  Switches file writeback from the dirty
      buffer LRU over to address_space.dirty_pages.
      
      - The buffer LRU is removed
      
      - The buffer hash is removed (uses blockdev pagecache lookups)
      
      - The bdflush and kupdate functions are implemented against
        address_spaces, via pdflush.
      
      - The relationship between pages and buffers is changed.
      
        - If a page has dirty buffers, it is marked dirty
        - If a page is marked dirty, it *may* have dirty buffers.
        - A dirty page may be "partially dirty".  block_write_full_page
          discovers this.
      
      - A bunch of consistency checks of the form
      
      	if (!something_which_should_be_true())
      		buffer_error();
      
        have been introduced.  These fog the code up but are important for
        ensuring that the new buffer/page code is working correctly.
      
      - New locking (inode.i_bufferlist_lock) is introduced for exclusion
        from try_to_free_buffers().  This is needed because set_page_dirty
        is called under spinlock, so it cannot lock the page.  But it
        needs access to page->buffers to set them all dirty.
      
        i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
      
      - fs/inode.c has been split: all the code related to file data writeback
        has been moved into fs/fs-writeback.c
      
      - Code related to file data writeback at the address_space level is in
        the new mm/page-writeback.c
      
      - try_to_free_buffers() is now non-blocking
      
      - Switches vmscan.c over to understand that all pages with dirty data
        are now marked dirty.
      
      - Introduces a new a_op for VM writeback:
      
      	->vm_writeback(struct page *page, int *nr_to_write)
      
        this is a bit half-baked at present.  The intent is that the address_space
        is given the opportunity to perform clustered writeback.  To allow it to
        opportunistically write out disk-contiguous dirty data which may be in other zones.
        To allow delayed-allocate filesystems to get good disk layout.
      
      - Added address_space.io_pages.  Pages which are being prepared for
        writeback.  This is here for two reasons:
      
        1: It will be needed later, when BIOs are assembled direct
           against pagecache, bypassing the buffer layer.  It avoids a
           deadlock which would occur if someone moved the page back onto the
           dirty_pages list after it was added to the BIO, but before it was
           submitted.  (hmm.  This may not be a problem with PG_writeback logic).
      
        2: Avoids a livelock which would occur if some other thread is continually
           redirtying pages.
      
      - There are two known performance problems in this code:
      
        1: Pages which are locked for writeback cause undesirable
           blocking when they are being overwritten.  A patch which leaves
           pages unlocked during writeback comes later in the series.
      
        2: While inodes are under writeback, they are locked.  This
           causes namespace lookups against the file to get unnecessarily
           blocked in wait_on_inode().  This is a fairly minor problem.
      
           I don't have a fix for this at present - I'll fix this when I
           attach dirty address_spaces direct to super_blocks.
      
      - The patch vastly increases the amount of dirty data which the
        kernel permits highmem machines to maintain.  This is because the
        balancing decisions are made against the amount of memory in the
        machine, not against the amount of buffercache-allocatable memory.
      
        This may be very wrong, although it works fine for me (2.5 gigs).
      
        We can trivially go back to the old-style throttling with
        s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
        balance_dirty_pages().  But better would be to allow blockdev
        mappings to use highmem (I'm thinking about this one, slowly).  And
        to move writer-throttling and writeback decisions into the VM (modulo
        the file-overwriting problem).
      
      - Drops 24 bytes from struct buffer_head.  More to come.
      
      - There's some gunk like super_block.flags:MS_FLUSHING which needs to
        be killed.  Need a better way of providing collision avoidance
        between pdflush threads, to prevent more than one pdflush thread
        working a disk at the same time.
      
        The correct way to do that is to put a flag in the request queue to
        say "there's a pdlfush thread working this disk".  This is easy to
        do: just generalise the "ra_pages" pointer to point at a struct which
        includes ra_pages and the new collision-avoidance flag.
      090da372
  20. 10 Apr, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] writeback daemons · 1ed704e9
      Andrew Morton authored
      This patch implements a gang-of-threads which are designed to
      be used for dirty data writeback. "pdflush" -> dirty page
      flush, or something.
      
      The number of threads is dynamically managed by a simple
      demand-driven algorithm.
      
      "Oh no, more kernel threads".  Don't worry, kupdate and
      bdflush disappear later.
      
      The intent is that no two pdflush threads are ever performing
      writeback against the same request queue at the same time.
      It would be wasteful to do that.  My current patches don't
      quite achieve this; I need to move the state into the request
      queue itself...
      
      The driver for implementing the thread pool was to avoid the
      possibility where bdflush gets stuck on one device's get_request_wait()
      queue while lots of other disks sit idle.  Also generality,
      abstraction, and the need to have something in place to perform
      the address_space-based writeback when the buffer_head-based
      writeback disappears.
      
      There is no provision inside the pdflush code itself to prevent
      many threads from working against the same device.  That's
      the responsibility of the caller.
      
      The main API function, `pdflush_operation()' attempts to find
      a thread to do some work for you.  It is not reliable - it may
      return -1 and say "sorry, I didn't do that".  This happens if
      all threads are busy.
      
      One _could_ extend pdflush_operation() to queue the work so that
      it is guaranteed to happen.  If there's a need, that additional
      minor complexity can be added.
      1ed704e9
    • Andrew Morton's avatar
      [PATCH] readahead · 8fa49846
      Andrew Morton authored
      I'd like to be able to claim amazing speedups, but
      the best benchmark I could find was diffing two
      256 megabyte files, which is about 10% quicker.  And
      that is probably due to the window size being effectively
      50% larger.
      
      Fact is, any disk worth owning nowadays has a segmented
      2-megabyte cache, and OS-level readahead mainly seems
      to save on CPU cycles rather than overall throughput.
      Once you start reading more streams than there are segments
      in the disk cache we start to win.
      
      Still.  The main motivation for this work is to
      clean the code up, and to create a central point at
      which many pages are marshalled together so that
      they can all be encapsulated into the smallest possible
      number of BIOs, and injected into the request layer.
      
      A number of filesystems were poking around inside the
      readahead state variables.  I'm not really sure what they
      were up to, but I took all that out.  The readahead
      code manages its own state autonomously and should not
      need any hints.
      
      - Unifies the current three readahead functions (mmap reads, read(2)
        and sys_readhead) into a single implementation.
      
      - More aggressive in building up the readahead windows.
      
      - More conservative in tearing them down.
      
      - Special start-of-file heuristics.
      
      - Preallocates the readahead pages, to avoid the (never demonstrated,
        but potentially catastrophic) scenario where allocation of readahead
        pages causes the allocator to perform VM writeout.
      
      - Gets all the readahead pages gathered together in
        one spot, so they can be marshalled into big BIOs.
      
      - reinstates the readahead ioctls, so hdparm(8) and blockdev(8)
        are working again.  The readahead settings are now per-request-queue,
        and the drivers never have to know about it.  I use blockdev(8).
        It works in units of 512 bytes.
      
      - Identifies readahead thrashing.
      
        Also attempts to handle it.  Certainly the changes here
        delay the onset of catastrophic readahead thrashing by
        quite a lot, and decrease it seriousness as we get more
        deeply into it, but it's still pretty bad.
      8fa49846
  21. 08 Mar, 2002 2 commits
  22. 19 Feb, 2002 1 commit
    • Rik van Riel's avatar
      [PATCH] new struct page shrinkage · e5191c50
      Rik van Riel authored
      The patch has been changed like you wanted, with page->zone
      shoved into page->flags. I've also pulled the thing up to
      your latest changes from linux.bkbits.net so you should be
      able to just pull it into your tree from:
      
      Rik
      e5191c50
  23. 05 Feb, 2002 5 commits
    • Linus Torvalds's avatar
      v2.5.1 -> v2.5.1.1 · 0925bad3
      Linus Torvalds authored
      - me: revert the "kill(-1..)" change.  POSIX isn't that clear on the
      issue anyway, and the new behaviour breaks things.
      - Jens Axboe: more bio updates
      - Al Viro: rd_load cleanups. hpfs mount fix, mount cleanups
      - Ingo Molnar: more raid updates
      - Jakub Jelinek: fix Linux/x86 confusion about arg passing of "save_v86_state" and "do_signal"
      - Trond Myklebust: fix NFS client race conditions
      0925bad3
    • Linus Torvalds's avatar
      v2.5.0.9 -> v2.5.0.10 · 80044607
      Linus Torvalds authored
      - Jens Axboe: more bio stuff
      - Ingo Molnar: mempool for bio
      - Niibe Yutaka: Super-H update
      80044607
    • Linus Torvalds's avatar
      v2.4.13 -> v2.4.13.1 · 980adcb2
      Linus Torvalds authored
        - Michael Warfield: computone serial driver update
        - Alexander Viro: cdrom module race fixes
        - David Miller: Acenic driver fix
        - Andrew Grover: ACPI update
        - Kai Germaschewski: ISDN update
        - Tim Waugh: parport update
        - David Woodhouse: JFFS garbage collect sleep
      980adcb2
    • Linus Torvalds's avatar
      v2.4.6.6 -> v2.4.6.7 · 74f5133b
      Linus Torvalds authored
        - Andreas Dilger: various ext2 cleanups
        - Richard Gooch: devfs update
        - Johannes Erdfelt: USB updates
        - Alan Cox: merges
        - David Miller: fix SMP pktsched bootup deadlock (CONFIG_NET_SCHED)
        - Roman Zippel: AFFS update
        - Anton Altaparmakov: NTFS update
        - me: fix races in vfork() (semaphores are not good completion handlers)
        - Jeff Garzik: net driver updates, sysvfs update
      74f5133b
    • Linus Torvalds's avatar
      Import changeset · 7a2deb32
      Linus Torvalds authored
      7a2deb32