1. 24 Aug, 2004 40 commits
    • Neil Brown's avatar
      [PATCH] md: assorted minor md/raid1 fixes · f57252ba
      Neil Brown authored
      1/ rationalise read_balance and "map" in raid1.  Discard map and
         tidyup the interface to read_balance so it can be used instead.
      
      2/ use offsetof rather than a caclulation to find the size of an
         structure with a var-length array at the end.
      
      3/ remove some meaningless #defines 
      
      4/ use printk_ratelimit to limit reports of failed sectors being redirected.
      Signed-off-by: default avatarNeil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f57252ba
    • Neil Brown's avatar
      [PATCH] md: assorted fixes/improvemnet to generic md resync code. · 32c31806
      Neil Brown authored
      1/ Introduce "mddev->resync_max_sectors" so that an md personality
      can ask for resync to cover a different address range than that of a
      single drive.  raid10 will use this.
      
      2/ fix is_mddev_idle so that if there seem to be a negative number
       of events, it doesn't immediately assume activity.
      
      3/ make "sync_io" (the count of IO sectors used for array resync)
       an atomic_t to avoid SMP races. 
      
      4/ Pass md_sync_acct a "block_device" rather than the containing "rdev",
        as the whole rdev isn't needed. Also make this an inline function.
      
      5/ Make sure recovery gets interrupted on any error.
      Signed-off-by: default avatarNeil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      32c31806
    • William Lee Irwin III's avatar
      [PATCH] hugetlb: permit executable mappings · b60e5e71
      William Lee Irwin III authored
      During the kernel summit, some discussion was had about the support
      requirements for a userspace program loader that loads executables into
      hugetlb on behalf of a major application (Oracle).  In order to support
      this in a robust fashion, the cleanup of the hugetlb must be robust in the
      presence of disorderly termination of the programs (e.g.  kill -9).  Hence,
      the cleanup semantics are those of System V shared memory, but Linux'
      System V shared memory needs one critical extension for this use:
      executability.
      
      The following microscopic patch enables this major application to provide
      robust hugetlb cleanup.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b60e5e71
    • William Lee Irwin III's avatar
      [PATCH] x86 PAE swapspace expansion · 93ff3346
      William Lee Irwin III authored
      PAE is artificially limited in terms of swapspace to the same bitsplit as
      ordinary i386, a 5/24 split (32 swapfiles, 64GB max swapfile size), when a
      5/27 split (32 swapfiles, 512GB max swapfile size) is feasible.  This patch
      transparently removes that limitation by using more of the space available
      in PAE's wider ptes for swap ptes.
      
      While this is obviously not likely to be used directly, it is important
      from the standpoint of strict non-overcommit, where the swapspace must be
      potentially usable in order to be reserved for non-overcommit.  There are
      workloads with Committed_AS of over 256GB on ia32 PAE wanting strict
      non-overcommit to prevent being OOM killed.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      93ff3346
    • Zwane Mwaikambo's avatar
      [PATCH] fix i386/x86_64 idle routine selection · 7177784e
      Zwane Mwaikambo authored
      This was broken when the mwait stuff went in since it executes after the
      initial idle_setup() has already selected an idle routine and overrides it
      with default_idle.
      Signed-off-by: default avatarVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: default avatarZwane Mwaikambo <zwane@linuxpower.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7177784e
    • Manfred Spraul's avatar
      [PATCH] remove magic +1 from shm segment count · fefd81e1
      Manfred Spraul authored
      Michael Kerrisk found a bug in the shm accounting code: sysv shm allows to
      create SHMMNI+1 shared memory segments, instead of SHMMNI segments.  The +1
      is probably from the first shared anonymous mapping implementation that
      used the sysv code to implement shared anon mappings.
      
      The implementation got replaced, it's now the other way around (sysv uses
      the shared anon code), but the +1 remained.
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fefd81e1
    • Zwane Mwaikambo's avatar
      [PATCH] OProfile/XScale fixes for PXA270/XScale2 · 11c61286
      Zwane Mwaikambo authored
      The incorrect mask was being used when writing back to PMNC write-only-zero
      bits as well as only ticking the CCNT every 64 processor cycles.  Tested on
      IOP331 and PXA270, i'm still looking for XScale1 users...
      Signed-off-by: default avatarLuca Rossato <l.rossato@tiscali.it>
      Signed-off-by: default avatarZwane Mwaikambo <zwane@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      11c61286
    • William Lee Irwin III's avatar
      [PATCH] kill CLONE_IDLETASK · 69c46983
      William Lee Irwin III authored
        The sole remaining usage of CLONE_IDLETASK is to determine whether pid
        allocation should be performed in copy_process().  This patch eliminates
        that last branch on CLONE_IDLETASK in the normal process creation path,
        removes the masking of CLONE_IDLETASK from clone_flags as it's now ignored
        under all circumstances, and furthermore eliminates the symbol
        CLONE_IDLETASK entirely.
      
      From: William Lee Irwin III <wli@holomorphy.com>
      
        Fix the fork-idle consolidation.  During that consolidation, the generic
        code was made to pass a pointer to on-stack pt_regs that had been memset()
        to 0.  ia64, however, requires a NULL pt_regs pointer argument and
        dispatches on that in its copy_thread() function to do SMP
        trampoline-specific RSE -related setup.  Passing pointers to zeroed pt_regs
        resulted in SMP wakeup -time deadlocks and exceptions.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      69c46983
    • William Lee Irwin III's avatar
      [PATCH] sched: consolidate CLONE_IDLETASK masking · f4205a53
      William Lee Irwin III authored
      Every arch now bears the burden of sanitizing CLONE_IDLETASK out of the
      clone_flags passed to do_fork() by userspace.  This patch hoists the
      masking of CLONE_IDLETASK out of the system call entrypoints into
      do_fork(), and thereby removes some small overheads from do_fork(), as
      do_fork() may now assume that CLONE_IDLETASK has been cleared.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f4205a53
    • Josh Aas's avatar
      [PATCH] improve speed of freeing bootmem · fe92ebf3
      Josh Aas authored
      Attached is a patch that greatly improves the speed of freeing boot memory.
       On ia64 machines with 2GB or more memory (I didn't test with less, but I
      can't imagine there being a problem), the speed improvement is about 75%
      for the function free_all_bootmem_core.  This translates to savings on the
      order of 1 minute / TB of memory during boot time.  That number comes from
      testing on a machine with 512GB, and extrapolating based on profiling of an
      unpatched 4TB machine.  For 4 and 8 TB machines, the time spent in this
      function is about 1 minutes/TB, which is painful especially given that
      there is no indication of what is going on put to the console (this issue
      to possibly be addressed later).
      
      The basic idea is to free higher order pages instead of going through every
      single one.  Also, some unnecessary atomic operations are done away with
      and replaced with non-atomic equivalents, and prefetching is done where it
      helps the most.  For a more in-depth discusion of this patch, please see
      the linux-ia64 archives (topic is "free bootmem feedback patch").
      
      The patch is originally Tony Luck's, and I added some further optimizations
      (non-atomic ops improvements and prefetching).
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarJosh Aas <josha@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fe92ebf3
    • Badari Pulavarty's avatar
      [PATCH] Fix mpage_readpage() for big requests · 13c61952
      Badari Pulavarty authored
      The problem is, if we increase our readhead size arbitrarily (say 2M), we
      call mpage_readpages() with 2M and when it tries to allocated a bio enough to
      fit 2M it fails, then we kick it back to "confused" code - which does 4K at
      a time.
      
      The fix is to ask for the maxium the driver can handle.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      13c61952
    • Roland Dreier's avatar
      [PATCH] x86: remove hard-coded numbers from ptr_ok() · 7b4731ff
      Roland Dreier authored
      Looks like arch/i386/kernel/doublefault.c is one place in the code that
      hardcodes the assumption that PAGE_OFFSET == 0xC0000000.  Here's a patch
      that fixes that.
      Signed-off-by: default avatarRoland Dreier <roland@topspin.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7b4731ff
    • James Courtier-Dutton's avatar
      [PATCH] emu10k1 maintainer update · a12d5822
      James Courtier-Dutton authored
      Rui Sousa has been unreachable for a long time now, so I have taken over
      the emu10k1 project on sf.net.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a12d5822
    • Andrea Arcangeli's avatar
      [PATCH] Correctly handle d_path error returns · 72c21479
      Andrea Arcangeli authored
      There's some minor bug in the d_path handling (the nfsd one may not the the
      correct fix, there's no failure path for it, so I just terminate the
      string, and the last one in the audit subsystem is just a robustness
      cleanup if somebody will extend d_path in the future, right now it's a
      noop).
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      72c21479
    • Andrew Morton's avatar
      [PATCH] alloc_pages priority tuning · 29d15009
      Andrew Morton authored
      Fix up the logic which decides when the caller can dip into page reserves.
      
      - If the caller has realtime scheduling policy, or if the caller cannot run
        direct reclaim, then allow the caller to use up to a quarter of the page
        reserves.
      
      - If the caller has __GFP_HIGH then allow the caller to use up to half of
        the page reserves.
      
      - If the caller has PF_MEMALLOC then the caller can use 100% of the page
        reserves.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      29d15009
    • Nick Piggin's avatar
      [PATCH] vm: alloc_pages watermark fixes · ac12db05
      Nick Piggin authored
      Previously the ->protection[] logic was broken.  It was difficult to follow
      and basically didn't use the asynch reclaim watermarks (pages_min,
      pages_low, pages_high) properly.
      
      Now use ->protection *only* for lower-zone protection.  So the allocator
      now explicitly uses the ->pages_low, ->pages_min watermarks and adds
      ->protection on top of that, instead of trying to use ->protection for
      everything.
      
      Pages are allocated down to (->pages_low + ->protection), once this is
      reached, kswapd the background reclaim is started; after this, we can
      allocate down to (->pages_min + ->protection) without blocking; the memory
      below pages_min is reserved for __GFP_HIGH and PF_MEMALLOC allocations. 
      kswapd attempts to reclaim memory until ->pages_high is reached.
      Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ac12db05
    • Nick Piggin's avatar
      [PATCH] vm: writeout watermark tuning · 0d761325
      Nick Piggin authored
      Slightly change the writeout watermark calculations so we keep background
      and synchronous writeout watermarks in the same ratios after adjusting them
      for the amout of mapped memory.  This ensures we should always attempt to
      start background writeout before synchronous writeout and preserves the
      admin's desired background-versus-forground ratios after we've
      auto-adjusted one of them.
      Signed-off-by: default avatarNick Piggin <nickpiggin@cyberone.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0d761325
    • Hugh Dickins's avatar
      [PATCH] simple fs stop -ve dentries · 8a34e562
      Hugh Dickins authored
      A tmpfs user reported increasingly slow directory reads when repeatedly
      creating and unlinking in a mkstemp-like way.  The negative dentries
      accumulate alarmingly (until memory pressure finally frees them), and are
      just a hindrance to any in-memory filesystem.  simple_lookup set d_op to
      arrange for negative dentries to be deleted immediately.
      
      (But I failed to discover how it is that on-disk filesystems seem to keep
      their negative dentries within manageable bounds: this effect was gross
      with tmpfs or ramfs, but no problem at all with extN or reiser.)
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8a34e562
    • Hugh Dickins's avatar
      [PATCH] clarify get_task_mm (mmgrab) · 7dbb1d67
      Hugh Dickins authored
      Clarify mmgrab by collapsing it into get_task_mm (in fork.c not inline),
      and commenting on the special case it is guarding against: when use_mm in
      an AIO daemon temporarily adopts the mm while it's on its way out.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7dbb1d67
    • Marcelo Tosatti's avatar
      [PATCH] x86 bitops.h commentary on instruction reordering · c524e494
      Marcelo Tosatti authored
      Back when we were discussing the need for a memory barrier in sync_page(),
      it came to me (thanks Andrea!) that the bit operations can be perfectly
      reordered on architectures other than x86.
      
      I think the commentary on i386 bitops.h is misleading, its worth to note
      that that these operations are not guaranteed not to be reordered on
      different architectures.
      
      clear_bit() already does that:
      
       * clear_bit() is atomic and may not be reordered.  However, it does
       * not contain a memory barrier, so if it is used for locking purposes,
       * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
       * in order to ensure changes are visible on other processors.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c524e494
    • Hugh Dickins's avatar
      [PATCH] rmaplock: swapoff use anon_vma · 69929041
      Hugh Dickins authored
      Swapoff can make good use of a page's anon_vma and index, while it's still
      left in swapcache, or once it's brought back in and the first pte mapped back:
      unuse_vma go directly to just one page of only those vmas with the same
      anon_vma.  And unuse_process can skip any vmas without an anon_vma (extending
      the hugetlb check: hugetlb vmas have no anon_vma).
      
      This just hacks in on top of the existing procedure, still going through all
      the vmas of all the mms in mmlist.  A more elegant procedure might replace
      mmlist by a list of anon_vmas: but that would be more work to implement, with
      apparently more overhead in the common paths.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      69929041
    • Hugh Dickins's avatar
      [PATCH] rmaplock: mm lock ordering · 9d9ae43b
      Hugh Dickins authored
      With page_map_lock out of the way, there's no need for page_referenced and
      try_to_unmap to use trylocks - provided we switch anon_vma->lock and
      mm->page_table_lock around in anon_vma_prepare.  Though I suppose it's
      possible that we'll find that vmscan makes better progress with trylocks than
      spinning - we're free to choose trylocks again if so.
      
      Try to update the mm lock ordering documentation in filemap.c.  But I still
      find it confusing, and I've no idea of where to stop.  So add an mm lock
      ordering list I can understand to rmap.c.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9d9ae43b
    • Hugh Dickins's avatar
      [PATCH] rmaplock: SLAB_DESTROY_BY_RCU · 77631565
      Hugh Dickins authored
      With page_map_lock gone, how to stabilize page->mapping's anon_vma while
      acquiring anon_vma->lock in page_referenced_anon and try_to_unmap_anon?
      
      The page cannot actually be freed (vmscan holds reference), but however much
      we check page_mapped (which guarantees that anon_vma is in use - or would
      guarantee that if we added suitable barriers), there's no locking against page
      becoming unmapped the instant after, then anon_vma freed.
      
      It's okay to take anon_vma->lock after it's freed, so long as it remains a
      struct anon_vma (its list would become empty, or perhaps reused for an
      unrelated anon_vma: but no problem since we always check that the page located
      is the right one); but corruption if that memory gets reused for some other
      purpose.
      
      This is not unique: it's liable to be problem whenever the kernel tries to
      approach a structure obliquely.  It's generally solved with an atomic
      reference count; but one advantage of anon_vma over anonmm is that it does not
      have such a count, and it would be a backward step to add one.
      
      Therefore...  implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a
      kmem_cache_alloc'ed structure cannot get freed to other use while the
      rcu_read_lock is held i.e.  preempt disabled; and use that for anon_vma.
      
      Fix concerns raised by Manfred: this flag is incompatible with poisoning and
      destructor, and kmem_cache_destroy needs to synchronize_kernel.
      
      I hope SLAB_DESTROY_BY_RCU may be useful elsewhere; but though it's safe for
      little anon_vma, I'd be reluctant to use it on any caches whose immediate
      shrinkage under pressure is important to the system.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      77631565
    • Hugh Dickins's avatar
      [PATCH] rmaplock: kill page_map_lock · edcc56dc
      Hugh Dickins authored
      The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to
      lock its pte_chains.  We kept this (as page_map_lock: bit_spin_lock on
      PG_maplock) when we moved to objrmap.  But the file objrmap locks its vma tree
      with mapping->i_mmap_lock, and the anon objrmap locks its vma list with
      anon_vma->lock: so isn't the page_map_lock superfluous?
      
      Pretty much, yes.  The mapcount was protected by it, and needs to become an
      atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely
      up and down.  The last page_remove_rmap can't clear anon page mapping any
      more, because of races with page_add_rmap; from which some BUG_ONs must go for
      the same reason, but they've served their purpose.
      
      vmscan decisions are naturally racy, little change there beyond removing
      page_map_lock/unlock.  But to stabilize the file-backed page->mapping against
      truncation while acquiring i_mmap_lock, page_referenced_file now needs page
      lock to be held even for refill_inactive_zone.  There's a similar issue in
      acquiring anon_vma->lock, where page lock doesn't help: which this patch
      pretends to handle, but actually it needs the next.
      
      Roughly 10% cut off lmbench fork numbers on my 2*HT*P4.  Must confess my
      testing failed to show the races even while they were knowingly exposed: would
      benefit from testing on racier equipment.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      edcc56dc
    • Hugh Dickins's avatar
      [PATCH] rmaplock: PageAnon in mapping · 6f055bc1
      Hugh Dickins authored
      First of a batch of five patches to eliminate rmap's page_map_lock, replace
      its trylocking by spinlocking, and use anon_vma to speed up swapoff.
      
      Patches updated from the originals against 2.6.7-mm7: nothing new so I won't
      spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting
      the unuse_process mmap_sem fix already in 2.6.8-rc3.
      
      
      This patch:
      
      Replace the PG_anon page->flags bit by setting the lower bit of the pointer in
      page->mapping when it's anon_vma: PAGE_MAPPING_ANON bit.
      
      We're about to eliminate the locking which kept the flags and mapping in
      synch: it's much easier to work on a local copy of page->mapping, than worry
      about whether flags and mapping are in synch (though I imagine it could be
      done, at greater cost, with some barriers).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6f055bc1
    • Roger Luethi's avatar
      [PATCH] Fix /proc/pid/statm documentation · 52ad51e6
      Roger Luethi authored
      I really wanted /proc/pid/statm to die and I still believe the
      reasoning is valid.  As it doesn't look like that is going to happen,
      though, I offer this fix for the respective documentation.  Note: lrs/drs
      fields are switched.
      Signed-off-by: default avatarRoger Luethi <rl@hellgate.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      52ad51e6
    • Arjan van de Ven's avatar
      [PATCH] Automatically enable bigsmp on big HP machines · c178f392
      Arjan van de Ven authored
      This enables apic=bigsmp automatically on some big HP machines that need
      it.  This makes them boot without kernel parameters on a generic arch
      kernel.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c178f392
    • William Lee Irwin III's avatar
      [PATCH] ia64: dma_mapping fix · a6843b89
      William Lee Irwin III authored
      We need to be able to dereference struct device in
      include/asm-ia64/dma-mapping.h.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a6843b89
    • Andi Kleen's avatar
      [PATCH] md: make MD no device warning KERN_WARNING · d033bbf5
      Andi Kleen authored
      Prevents some noise during boot up when no MD volumes are found.
      
      I think I picked it up from someone else, but I cannot remember from whom
      (sorry)
      
      Cc: Neil Brown <neilb@cse.unsw.edu.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d033bbf5
    • Pete Zaitcev's avatar
      [PATCH] Make MAX_INIT_ARGS 32 · 54d68822
      Pete Zaitcev authored
      We at Red Hat shipped a larger number of arguments for quite some time, it
      was required for installations on IBM mainframe (s390), which doesn't have
      a good way to pass arguments.
      
      There are a number of reasonable situations that go past the current limits
      of 8.  One that comes to mind is when you want to perform a manual vnc
      install on a headless machine using anaconda.  This requires passing in a
      number of parameters to get anaconda past the initial (no-gui) loader
      screens.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      54d68822
    • Suparna Bhattacharya's avatar
      [PATCH] AIO: workqueue context switch reduction · e84e486c
      Suparna Bhattacharya authored
      From: Chris Mason
      
      I compared the 2.6 pipetest results with the 2.4 suse kernel, and 2.6 was
      roughly 40% slower.  During the pipetest run, 2.6 generates ~600,000
      context switches per second while 2.4 generates 30 or so.
      
      aio-context-switch (attached) has a few changes that reduces our context
      switch rate, and bring performance back up to 2.4 levels.  These have only
      really been tested against pipetest, they might make other workloads worse.
      
      The basic theory behind the patch is that it is better for the userland
      process to call run_iocbs than it is to schedule away and let the worker
      thread do it.
      
                                                                                    
      1) on io_submit, use run_iocbs instead of run_iocb
      2) on io_getevents, call run_iocbs if no events were available.
      
      3) don't let two procs call run_iocbs for the same context at the same
         time.  They just end up bouncing on spinlocks.
      
      The first three optimizations got me down to 360,000 context switches per
      second, and they help build a little structure to allow optimization #4,
      which uses queue_delayed_work(HZ/10) instead of queue_work. 
      
      That brings down the number of context switches to 2.4 levels.
      
      Adds aio_run_all_iocbs so that normal processes can run all the pending
      retries on the run list.  This allows worker threads to keep using list
      splicing, but regular procs get to run the list until it stays empty.  The
      end result should be less work for the worker threads.
      
      I was able to trigger short stalls (1sec) with aio-stress, and with the
      current patch they are gone.  Could be wishful thinking on my part though,
      please let me know how this works for you.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e84e486c
    • Suparna Bhattacharya's avatar
      [PATCH] AIO: Splice runlist for fairness across io contexts · 068b52c1
      Suparna Bhattacharya authored
      This patch tries be a little fairer across multiple io contexts in handling
      retries, helping make sure progress happens uniformly across different io
      contexts (especially if they are acting on independent queues).
      
      It splices the ioctx runlist before processing it in __aio_run_iocbs.  If
      new iocbs get added to the ctx in meantime, it queues a fresh workqueue
      entry instead of handling them righaway, so that other ioctxs' retries get
      a chance to be processed before the newer entries in the queue.
      
      This might make a difference in a situation where retries are getting
      queued very fast on one ioctx, while the workqueue entry for another ioctx
      is stuck behind it.  I've only seen this occasionally earlier and can't
      recreate it consistently, but may be worth including anyway.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      068b52c1
    • Suparna Bhattacharya's avatar
      [PATCH] AIO: retry infrastructure fixes and enhancements · 63b05203
      Suparna Bhattacharya authored
      From: Daniel McNeil <daniel@osdl.org>
      From: Chris Mason <mason@suse.com>
      
       AIO: retry infrastructure fixes and enhancements
      
       Reorganises, comments and fixes the AIO retry logic. Fixes 
       and enhancements include:
      
         - Split iocb setup and execution in io_submit
              (also fixes io_submit error reporting)
         - Use aio workqueue instead of keventd for retries
         - Default high level retry methods
         - Subtle use_mm/unuse_mm fix
         - Code commenting
         - Fix aio process hang on EINVAL (Daniel McNeil)
         - Hold the context lock across unuse_mm
         - Acquire task_lock in use_mm()
         - Allow fops to override the retry method with their own
         - Elevated ref count for AIO retries (Daniel McNeil)
         - set_fs needed when calling use_mm
         - Flush workqueue on __put_ioctx (Chris Mason)
         - Fix io_cancel to work with retries (Chris Mason)
         - Read-immediate option for socket/pipe retry support
      
       Note on default high-level retry methods support
       ================================================
      
       High-level retry methods allows an AIO request to be executed as a series of
       non-blocking iterations, where each iteration retries the remaining part of
       the request from where the last iteration left off, by reissuing the
       corresponding AIO fop routine with modified arguments representing the
       remaining I/O.  The retries are "kicked" via the AIO waitqueue callback
       aio_wake_function() which replaces the default wait queue entry used for
       blocking waits.
      
       The high level retry infrastructure is responsible for running the
       iterations in the mm context (address space) of the caller, and ensures that
       only one retry instance is active at a given time, thus relieving the fops
       themselves from having to deal with potential races of that sort.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      63b05203
    • Bjorn Helgaas's avatar
      [PATCH] cpqfc: add missing pci_enable_device() · 86b9159a
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers
      often worked without this, but it is now required in order to route
      PCI interrupts correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      86b9159a
    • Bjorn Helgaas's avatar
      [PATCH] de4x5.c: add missing pci_enable_device() · 0ad8ac84
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers
      often worked without this, but it is now required in order to route
      PCI interrupts correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0ad8ac84
    • Bjorn Helgaas's avatar
      [PATCH] ioc3-eth.c: add missing pci_enable_device() · aa22e9a9
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers often
      worked without this, but it is now required in order to route PCI interrupts
      correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      aa22e9a9
    • Bjorn Helgaas's avatar
      [PATCH] hp100.c: add missing pci_enable_device() · 41b3f604
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers often
      worked without this, but it is now required in order to route PCI interrupts
      correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      41b3f604
    • Bjorn Helgaas's avatar
      [PATCH] ibmasm: add missing pci_enable_device() · 94ae67e9
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers often
      worked without this, but it is now required in order to route PCI
      interrupts correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      94ae67e9
    • Bjorn Helgaas's avatar
      [PATCH] tpam_main.c: add missing pci_enable_device() · f09e59b4
      Bjorn Helgaas authored
      Add pci_enable_device()/pci_disable_device().  In the past, drivers
      often worked without this, but it is now required in order to route
      PCI interrupts correctly.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f09e59b4
    • Bjorn Helgaas's avatar
      [PATCH] ip2main.c: add missing pci_enable_device() · dcd769e6
      Bjorn Helgaas authored
      I don't have this hardware, so this has been compiled but not tested.
      
      Add pci_enable_device()/pci_disable_device In the past, drivers often worked
      without this, but it is now required in order to route PCI interrupts
      correctly.  In addition, this driver incorrectly used the IRQ value from PCI
      config space rather than the one in the struct pci_dev.
      Signed-off-by: default avatarBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dcd769e6