An error occurred fetching the project authors.
  1. 14 May, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] filtered wakeups · 2f242854
      Andrew Morton authored
      From: William Lee Irwin III <wli@holomorphy.com>
      
      This patch series is solving the "thundering herd" problem that occurs in the
      mainline implementation of hashed waitqueues.  There are two sources of
      spurious wakeups in such arrangements:
      
      (a) Hash collisions that place waiters on different objects on the same
          waitqueue, which wakes threads falsely when any of the objects hashed to
          the same queue receives a wakeup.  i.e.  loss of information about which
          object a wakeup event is related to.
      
      (b) Loss of information about which object a given waiter is waiting on.
          This precludes wake-one semantics for mutual exclusion scenarios.  For
          instance, a lock bit may be slept on.  If there are any waiters on the
          object, a lock bit release event must wake at least one of them so as to
          prevent deadlock.  But without information as to which waiter is waiting
          on which object, we must resort to waking all waiters who could possibly
          be waiting on it.  Now, as the lock bit provides mutual exclusion, only
          one of the waiters woken can proceed, and the remainder will go back to
          sleep and wait for another event, creating unnecessary system load.  Once
          wake-one semantics are established, only one of the waiters waiting to
          acquire a lock bit need to be woken, which measurably reduces system load
          and improves efficiency (i.e.  it's the subject of the benchmarking I've
          been sending to you).
      
      Even beyond the measurable efficiency gains, there are reasons of robustness
      and responsiveness to motivate addressing the issue of thundering herds.  In a
      real-life scenario I've been personally involved in resolving, the thundering
      herd issue caused powerful modern SMP machines with fast IO systems to be
      unresponsive to user input for a minute at a time or more.  Analogues of these
      patches for the distro kernels involved fully resolved the issue to the
      customer's satisfaction and obviated workarounds to limit the pagecache's
      size.
      
      The latest spin of these patches basically shoves more pieces of the logic
      into the wakeup functions, with some efficiency gains from sharing the hot
      codepath with the rest of the kernel, and a slightly larger diff than the
      patches with the newly-introduced entrypoint.  Writing these was motivated by
      the push to insulate sched.c from more of the details of wakeup semantics by
      putting more of the logic into the wakeup functions.  In order to accomplish
      this while still solving (b), the wakeup functions grew a new argument for
      communication about what object a wakeup event is related to to be passed by
      the waker.
      
      =========
      
      This patch provides an additional argument to wakeup functions so that
      information may be passed from the waker to the waiter.  This is provided as a
      separate patch so that the overhead of the additional argument can be measured
      in isolation.  No change in performance was observable here.
      2f242854
  2. 10 May, 2004 2 commits
    • Andrew Morton's avatar
      [PATCH] Move migrate_all_tasks to CPU_DEAD handling · ddea677b
      Andrew Morton authored
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
      migrate_all_tasks is currently run with rest of the machine stopped.
      It iterates thr' the complete task table, turning off cpu affinity of any task
      that it finds affine to the dying cpu. Depending on the task table
      size this can take considerable time. All this time machine is stopped, doing
      nothing.
      
      Stopping the machine for such extended periods can be avoided if we do
      task migration in CPU_DEAD notification and that's precisely what this patch
      does.
      
      The patch puts idle task to the _front_ of the dying CPU's runqueue at the 
      highest priority possible. This cause idle thread to run _immediately_ after
      kstopmachine thread yields. Idle thread notices that its cpu is offline and
      dies quickly. Task migration can then be done at leisure in CPU_DEAD
      notification, when rest of the CPUs are running.
      
      Some advantages with this approach are:
      
      	- More scalable. Predicatable amout of time that machine is stopped.
      	- No changes to hot path/core code. We are just exploiting scheduler
      	  rules which runs the next high-priority task on the runqueue. Also
      	  since I put idle task to the _front_ of the runqueue, there
      	  are no races when a equally high priority task is woken up
      	  and added to the runqueue. It gets in at the back of the runqueue,
      	  _after_ idle task!
      	- cpu_is_offline check that is presenty required in try_to_wake_up,
      	  idle_balance and rebalance_tick can be removed, thus speeding them
      	  up a bit
      
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
        Rusty mentioned that the unlikely hints against cpu_is_offline is
        redundant since the macro already has that hint.  Patch below removes those
        redundant hints I added.
      ddea677b
    • Andrew Morton's avatar
      [PATCH] sched: balance-on-clone · 8c8cfc36
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      Implement balancing during clone().  It does the following things:
      
      - introduces SD_BALANCE_CLONE that can serve as a tool for an
        architecture to limit the search-idlest-CPU scope on clone().
        E.g. the 512-CPU systems should rather not enable this.
      
      - uses the highest sd for the imbalance_pct, not this_rq (which didnt
        make sense).
      
      - unifies balance-on-exec and balance-on-clone via the find_idlest_cpu()
        function. Gets rid of sched_best_cpu() which was still a bit
        inconsistent IMO, it used 'min_load < load' as a condition for
        balancing - while a more correct approach would be to use half of the
        imbalance_pct, like passive balancing does.
      
      - the patch also reintroduces the possibility to do SD_BALANCE_EXEC on
        SMP systems, and activates it - to get testing.
      
      - NOTE: there's one thing in this patch that is slightly unclean: i
        introduced wake_up_forked_thread. I did this to make it easier to get
        rid of this patch later (wake_up_forked_process() has lots of
        dependencies in various architectures). If this capability remains in
        the kernel then i'll clean it up and introduce one function for
        wake_up_forked_process/thread.
      
      - NOTE2: i added the SD_BALANCE_CLONE flag to the NUMA CPU template too.
        Some NUMA architectures probably want to disable this.
      8c8cfc36
  3. 30 Apr, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] task_struct alignment fix · 978b7ac2
      Andrew Morton authored
      The recent slab alignment changes broke an unknown number of architectures
      (parisc and x86_64 for sure) by causing task_structs to be insufficiently
      aligned.
      
      We need good alignemnt because architectures do things like dumping FP state
      into the task_struct with instructions which require particular alignment (I
      think).
      
      So change the default alignment to L1_CACHE_BYTES, which is what we used to
      have, via SLAB_HW_CACHE_ALIGN.
      978b7ac2
  4. 12 Apr, 2004 8 commits
    • Andrew Morton's avatar
      [PATCH] do_fork() error path memory leak · 23868940
      Andrew Morton authored
      From: <john.l.byrne@hp.com>
      
      In do_fork(), if an error occurs after the mm_struct for the child has been
      allocated, it is never freed.  The exit_mm() meant to free it increments
      the mm_count and this count is never decremented.  (For a running process
      that is exitting, schedule() takes care this; however, the child process
      being cleaned up is not running.) In the CLONE_VM case, the parent's
      mm_struct will get an extra mm_count and so it will never be freed.
      
      This patch should fix both the CLONE_VM and the not CLONE_VM case; the test
      of p->active_mm prevents a panic in the case that a kernel-thread is being
      cloned.
      23868940
    • Andrew Morton's avatar
      [PATCH] fix posix-timers to have proper per-process scope · 0e568881
      Andrew Morton authored
      From: Roland McGrath <roland@redhat.com>
      
      The posix-timers implementation associates timers with the creating thread
      and destroys timers when their creator thread dies.  POSIX clearly
      specifies that these timers are per-process, and a timer should not be torn
      down when the thread that created it exits.  I hope there won't be any
      controversy on what the correct semantics are here, since POSIX is clear
      and the Linux feature is called "posix-timers".
      
      The attached program built with NPTL -lrt -lpthread demonstrates the bug.
      The program is correct by POSIX, but fails on Linux.  Note that a until
      just the other day, NPTL had a trivial bug that always disabled its use of
      kernel timer syscalls (check strace for lack of timer_create/SYS_259).  So
      unless you have built your own NPTL libs very recently, you probably won't
      see the kernel calls actually used by this program.
      
      Also attached is my patch to fix this.  It (you guessed it) moves the
      posix_timers field from task_struct to signal_struct.  Access is now
      governed by the siglock instead of the task lock.  exit_itimers is called
      from __exit_signal, i.e.  only on the death of the last thread in the
      group, rather than from do_exit for every thread.  Timers' it_process
      fields store the group leader's pointer, which won't die.  For the case of
      SIGEV_THREAD_ID, I hold a ref on the task_struct for it_process to stay
      robust in case the target thread dies; the ref is released and the dangling
      pointer cleared when the timer fires and the target thread is dead.  (This
      should only come up in a buggy user program, so noone cares exactly how the
      kernel handles that case.  But I think what I did is robust and sensical.)
      
      /* Test for bogus per-thread deletion of timers.  */
      
      #include <stdio.h>
      #include <error.h>
      #include <time.h>
      #include <signal.h>
      #include <stdint.h>
      #include <sys/time.h>
      #include <sys/resource.h>
      #include <unistd.h>
      #include <pthread.h>
      
      /* Creating timers in another thread should work too.  */
      static void *do_timer_create(void *arg)
      {
      	struct sigevent *const sigev = arg;
      	timer_t *const timerId = sigev->sigev_value.sival_ptr;
      	if (timer_create(CLOCK_REALTIME, sigev, timerId) < 0) {
      		perror("timer_create");
      		return NULL;
      	}
      	return timerId;
      }
      
      int main(void)
      {
      	int i, res;
      	timer_t timerId;
      	struct itimerspec itval;
      	struct sigevent sigev;
      
      	itval.it_interval.tv_sec = 2;
      	itval.it_interval.tv_nsec = 0;
      	itval.it_value.tv_sec = 2;
      	itval.it_value.tv_nsec = 0;
      
      	sigev.sigev_notify = SIGEV_SIGNAL;
      	sigev.sigev_signo = SIGALRM;
      	sigev.sigev_value.sival_ptr = (void *)&timerId;
      
      	for (i = 0; i < 100; i++) {
      		printf("cnt = %d\n", i);
      
      		pthread_t thr;
      		res = pthread_create(&thr, NULL, &do_timer_create, &sigev);
      		if (res) {
      			error(0, res, "pthread_create");
      			continue;
      		}
      		void *val;
      		res = pthread_join(thr, &val);
      		if (res) {
      			error(0, res, "pthread_join");
      			continue;
      		}
      		if (val == NULL)
      			continue;
      
      		res = timer_settime(timerId, 0, &itval, NULL);
      		if (res < 0)
      			perror("timer_settime");
      
      		res = timer_delete(timerId);
      		if (res < 0)
      			perror("timer_delete");
      	}
      
      	return 0;
      }
      0e568881
    • Andrew Morton's avatar
      [PATCH] Light-weight Auditing Framework · f85a96f6
      Andrew Morton authored
      From: Rik Faith <faith@redhat.com>
      
      This patch provides a low-overhead system-call auditing framework for Linux
      that is usable by LSM components (e.g., SELinux).  This is an update of the
      patch discussed in this thread:
      
          http://marc.theaimsgroup.com/?t=107815888100001&r=1&w=2
      
      In brief, it provides for netlink-based logging of audit records that have
      been generated in other parts of the kernel (e.g., SELinux) as well as the
      ability to audit system calls, either independently (using simple
      filtering) or as a compliment to the audit record that another part of the
      kernel generated.
      
      The main goals were to provide system call auditing with 1) as low overhead
      as possible, and 2) without duplicating functionality that is already
      provided by SELinux (and/or other security infrastructures).  This
      framework will work "stand-alone", but is not designed to provide, e.g.,
      CAPP functionality without another security component in place.
      
      This updated patch includes changes from feedback I have received,
      including the ability to compile without CONFIG_NET (and better use of
      tabs, so use -w if you diff against the older patch).
      
      Please see http://people.redhat.com/faith/audit/ for an early example
      user-space client (auditd-0.4.tar.gz) and instructions on how to try it.
      
      My future intentions at the kernel level include improving filtering (e.g.,
      syscall personality/exit codes) and syscall support for more architectures.
       First, though, I'm going to work on documentation, a (real) audit daemon,
      and patches for other user-space tools so that people can play with the
      framework and understand how it can be used with and without SELinux.
      
      
      Update:
      
      Light-weight Auditing Framework receive filter fixes
      From: Rik Faith <faith@redhat.com>
      
      Since audit_receive_filter() is only called with audit_netlink_sem held, it
      cannot race with either audit_del_rule() or audit_add_rule(), so the
      list_for_each_entry_rcu()s may be replaced by list_for_each_entry()s, and
      the rcu_read_{un,}lock()s removed.  A fix for this is part of the attached
      patch.
      
      Other features of the attached patch are:
      
      1) generalized the ability to test for inequality
      
      2) added syscall exit status reporting and testing
      
      3) added ability to report and test first 4 syscall arguments (this adds
         a large amount of flexibility for little cost; not implemented or tested
         on ppc64)
      
      4) added ability to report and test personality
      
      User-space demo program enhanced for new fields and inequality testing:
      http://people.redhat.com/faith/audit/auditd-0.5.tar.gz
      f85a96f6
    • Andrew Morton's avatar
      [PATCH] fork vma ordering during fork · 424e44d1
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      First of six patches against 2.6.5-rc3, cleaning up mremap's move_vma, and
      fixing truncation orphan issues raised by Rajesh Venkatasubramanian. 
      Originally done as part of the anonymous objrmap work on mremap move, but
      useful fixes now extracted for mainline.  The mremap changes need some
      exposure in the -mm tree first, but the first (fork one-liner) is safe enough
      to go straight into 2.6.5.
      
      
      
      From: Rajesh Venkatasubramanian.  Despite the comment that child vma should
      be inserted just after parent vma, 2.5.6 did exactly the reverse: thus a
      racing vmtruncate may free the child's ptes, then advance to the parent, and
      meanwhile copy_page_range has propagated more ptes from the parent to the
      child, leaving file pages still mapped after truncation.
      424e44d1
    • Andrew Morton's avatar
      [PATCH] eliminate nswap and cnswap · 8398bcc6
      Andrew Morton authored
      From: Matt Mackall <mpm@selenic.com>
      
      The nswap and cnswap variables counters have never been incremented as
      Linux doesn't do task swapping.
      8398bcc6
    • Andrew Morton's avatar
      [PATCH] slab: updates for per-arch alignments · b9e55f3d
      Andrew Morton authored
      From: Manfred Spraul <manfred@colorfullife.com>
      
      Description:
      
      Right now kmem_cache_create automatically decides about the alignment of
      allocated objects. The automatic decisions are sometimes wrong:
      
      - for some objects, it's better to keep them as small as possible to
        reduce the memory usage.  Ingo already added a parameter to
        kmem_cache_create for the sigqueue cache, but it wasn't implemented.
      
      - for s390, normal kmalloc must be 8-byte aligned.  With debugging
        enabled, the default allocation was 4-bytes.  This means that s390 cannot
        enable slab debugging.
      
      - arm26 needs 1 kB aligned objects.  Previously this was impossible to
        generate, therefore arm has its own allocator in
        arm26/machine/small_page.c
      
      - most objects should be cache line aligned, to avoid false sharing.  But
        the cache line size was set at compile time, often to 128 bytes for
        generic kernels.  This wastes memory.  The new code uses the runtime
        determined cache line size instead.
      
      - some caches want an explicit alignment.  One example are the pte_chain
        objects: they must find the start of the object with addr&mask.  Right
        now pte_chain objects are scaled to the cache line size, because that was
        the only alignment that could be generated reliably.
      
      The implementation reuses the "offset" parameter of kmem_cache_create and
      now uses it to pass in the requested alignment.  offset was ignored by the
      current implementation, and the only user I found is sigqueue, which
      intended to set the alignment.
      
      In the long run, it might be interesting for the main tree: due to the 128
      byte alignment, only 7 inodes fit into one page, with 64-byte alignment, 9
      inodes - 20% memory recovered for Athlon systems.
      
      
      
      For generic kernels  running on P6 cpus (i.e. 32 byte cachelines), it means
      
      Number of objects per page:
      
       ext2_inode_cache: 8 instead of 7
       ext3_inode_cache: 8 instead of 7
       fat_inode_cache: 9 instead of 7
       rpc_tasks: 24 instead of 15
       tcp_tw_bucket: 40 instead of 30
       arp_cache: 40 instead of 30
       nfs_write_data: 9 instead of 7
      b9e55f3d
    • Andrew Morton's avatar
      [PATCH] move job control fields from task_struct to signal_struct · 7860b371
      Andrew Morton authored
      From: Roland McGrath <roland@redhat.com>
      
      This patch moves all the fields relating to job control from task_struct to
      signal_struct, so that all this info is properly per-process rather than
      being per-thread.
      7860b371
    • Andrew Morton's avatar
      [PATCH] posix message queues: code move · c334f752
      Andrew Morton authored
      From: Manfred Spraul <manfred@colorfullife.com>
      
      cleanup of sysv ipc as a preparation for posix message queues:
      
      - replace !CONFIG_SYSVIPC wrappers for copy_semundo and exit_sem with
        static inline wrappers.  Now the whole ipc/util.c file is only used if
        CONFIG_SYSVIPC is set, use makefile magic instead of #ifdef.
      
      - remove the prototypes for copy_semundo and exit_sem from kernel/fork.c
      
      - they belong into a header file.
      
      - create a new msgutil.c with the helper functions for message queues.
      
      - cleanup the helper functions: run Lindent, add __user tags.
      c334f752
  5. 08 Mar, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] vma corruption fix · bc3d0059
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Fixes bugzilla #2219
      
      fork's dup_mmap leaves child mm_rb as copied from parent mm while doing all
      the copy_page_ranges, and then calls build_mmap_rb without holding
      page_table_lock.
      
      try_to_unmap_one's find_vma (holding page_table_lock not mmap_sem) coming
      on another cpu may cause mm mayhem.  It may leave the child's mmap_cache
      pointing to a vma of the parent mm.
      
      When the parent exits and the child faults, quite what happens rather
      depends on what junk then inhabits vm_page_prot, which gets set in the page
      table, with page_add_rmap adding the ptep, but junk pte likely to fail the
      tests for page_remove_rmap.
      
      Eventually the child exits, the page table is freed and try_to_unmap_one
      oopses on null ptep_to_mm (but in a kernel with rss limiting, usually
      page_referenced hits the null ptep_to_mm first).
      
      This took me days and days to unravel!  Big thanks to Matthieu for
      reporting it with a good test case.
      bc3d0059
  6. 06 Mar, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] fastcall / regparm fixes · 20e39386
      Andrew Morton authored
      From: Gerd Knorr <kraxel@suse.de>
      
      Current gcc's error out if a function's declaration and definition disagree
      about the register passing convention.
      
      The patch adds a new `fastcall' declatation primitive, and uses that in all
      the FASTCALL functions which we could find.  A number of inconsistencies were
      fixed up along the way.
      20e39386
  7. 25 Feb, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] add syscalls.h · 0bab0642
      Andrew Morton authored
      From: "Randy.Dunlap" <rddunlap@osdl.org>
      
      Add syscalls.h, which contains prototypes for the kernel's system calls.
      Replace open-coded declarations all over the place.  This patch found a
      couple of prior bugs.  It appears to be more important with -mregparm=3 as we
      discover more asmlinkage mismatches.
      
      Some syscalls have arch-dependent arguments, so their prototypes are in the
      arch-specific unistd.h.  Maybe it should have been asm/syscalls.h, but there
      were already arch-specific syscall prototypes in asm/unistd.h...
      
      Tested on x86, ia64, x86_64, ppc64, s390 and sparc64.  May cause
      trivial-to-fix build breakage on other architectures.
      0bab0642
  8. 18 Feb, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] NGROUPS 2.6.2rc2 + fixups · a937b06e
      Andrew Morton authored
      From: Tim Hockin <thockin@sun.com>,
            Neil Brown <neilb@cse.unsw.edu.au>,
            me
      
      New groups infrastructure.  task->groups and task->ngroups are replaced by
      task->group_info.  Group)info is a refcounted, dynamic struct with an array
      of pages.  This allows for large numbers of groups.  The current limit of
      32 groups has been raised to 64k groups.  It can be raised more by changing
      the NGROUPS_MAX constant in limits.h
      a937b06e
  9. 04 Feb, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] Fix more gcc 3.4 warnings · d75cb184
      Andrew Morton authored
      From: Andi Kleen <ak@muc.de>
      
      Just many more warning fixes for a gcc 3.4 snapshot.
      
      It warns for a lot of things now, e.g.  for ?: and ({ ...  }) and casts as
      lvalues.  And for functions marked inline in headers, but no body.
      
      Actually there are more warnings, i stopped fixing at some point.  Some of
      the warnings seem to be dubious (e.g.  the binfmt_elf.c one, which looks
      more like a compiler bug to me)
      
      I also fixed the _exit() prototype to be void because gcc was complaining
      about this.
      d75cb184
  10. 19 Jan, 2004 4 commits
    • Andrew Morton's avatar
      [PATCH] Remove CLONE_DETACHED · 8ce5870d
      Andrew Morton authored
      From: Andries.Brouwer@cwi.nl
      
      Remove obsolete CLONE_DETACHED
      8ce5870d
    • Andrew Morton's avatar
      [PATCH] Use for_each_cpu() Where It's Meant To Be · 012061cc
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      Some places use cpu_online() where they should be using cpu_possible, most
      commonly for tallying statistics.  This makes no difference without hotplug
      CPU.
      
      Use the for_each_cpu() macro in those places, providing good examples (and
      making the external hotplug CPU patch smaller).
      
      Some places use cpu_online() where they should be using cpu_possible, most
      commonly for tallying statistics.  This makes no difference without hotplug
      CPU.
      
      Use the for_each_cpu() macro in those places, providing good examples (and
      making the external hotplug CPU patch smaller).
      012061cc
    • Andrew Morton's avatar
      [PATCH] CPU scheduler cleanup · 2df40901
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      - move scheduling-state initializtion from copy_process() to
        sched_fork() (Nick Piggin)
      2df40901
    • Andrew Morton's avatar
      [PATCH] bdev: move i_mapping -> f_mapping conversions · df6a148f
      Andrew Morton authored
      From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
      
      More uses of ->i_mapping switched to uses of ->f_mapping - stuff that was not
      caught by the earlier f_mapping conversion.
      df6a148f
  11. 08 Jan, 2004 1 commit
    • Linus Torvalds's avatar
      Fix subtle fork() race that Ingo noticed. · f7a1132c
      Linus Torvalds authored
      We must not mark the process TASK_STOPPED early, because
      that might allow a signal to wake it up before we actually
      got to the "wake_up_forked_process()" state. Total confusion
      would happen.
      
      Make wake_up_forked_process() verify the new world order.
      f7a1132c
  12. 29 Dec, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] unshare_files · 02cda956
      Andrew Morton authored
      From: Chris Wright <chrisw@osdl.org>
      
      Introduce unshare_files as a helper for use during execve to eliminate
      potential leak of the execve'd binary's fd.
      02cda956
  13. 13 Dec, 2003 1 commit
  14. 12 Dec, 2003 1 commit
  15. 25 Nov, 2003 1 commit
  16. 09 Oct, 2003 1 commit
    • Linus Torvalds's avatar
      Revert the process group accessor functions. They are buggy, and · 06349d9d
      Linus Torvalds authored
      cause NULL pointer references in /proc.
      
      Moreover, it's questionable whether the whole thing makes sense at all. 
      Per-thread state is good.
      
      Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097
      Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200
      Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211
      06349d9d
  17. 07 Oct, 2003 1 commit
  18. 05 Oct, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] move job control fields from task_struct to · 1bd563fd
      Andrew Morton authored
      From: Roland McGrath <roland@redhat.com>
      
      This patch completes what was started with the `process_group' accessor
      function, moving all the job control-related fields from task_struct into
      signal_struct and using process_foo accessor functions to read them.  All
      these things are per-process in POSIX, none per-thread.  Off hand it's hard
      to come up with the hairy MT scenarios in which the existing code would do
      insane things, but trust me, they're there.  At any rate, all the uses
      being done via inline accessor functions now has got to be all good.
      
      I did a "make allyesconfig" build and caught the few random drivers and
      whatnot that referred to these fields.  I was surprised to find how few
      references to ->tty there really were to fix up.  I'm sure there will be a
      few more fixups needed in non-x86 code.  The only actual testing of a
      running kernel with these patches I've done is on my normal minimal x86
      config.  Everything works fine as it did before as far as I can tell.
      
      One issue that may be of concern is the lack of any locking on multiple
      threads diddling these fields.  I don't think it really matters, though
      there might be some obscure races that could produce inconsistent job
      control results.  Nothing shattering, I'm sure; probably only something
      like a multi-threaded program calling setsid while its other threads do tty
      i/o, which never happens in reality.  This is the same situation we get by
      using ->group_leader->foo without other synchronization, which seemed to be
      the trend and noone was worried about it.
      1bd563fd
  19. 22 Sep, 2003 1 commit
  20. 21 Sep, 2003 3 commits
    • Andrew Morton's avatar
      [PATCH] Handle init_new_context failures · 1cfc080a
      Andrew Morton authored
      From: Anton Blanchard <anton@samba.org>
      
      If init_new_context fails we definitely do not want to call mmput, because
      that will call destroy_context against an uninitialised context.  Instead
      we should back out what we did in init_mm.  Fixes some weird failures on
      ppc64 when running a fork bomb.
      1cfc080a
    • Andrew Morton's avatar
      [PATCH] scheduler infrastructure · f221af36
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      the attached scheduler patch (against test2-mm2) adds the scheduling
      infrastructure items discussed on lkml. I got good feedback - and while i
      dont expect it to solve all problems, it does solve a number of bad ones:
      
       - test_starve.c code from David Mosberger
      
       - thud.c making the system unusuable due to unfairness
      
       - fair/accurate sleep average based on a finegrained clock
      
       - audio skipping way too easily
      
      other changes in sched-test2-mm2-A3:
      
       - ia64 sched_clock() code, from David Mosberger.
      
       - migration thread startup without relying on implicit scheduling
         behavior. While the current 2.6 code is correct (due to the cpu-up code
         adding CPUs one by one), but it's also fragile - and this code cannot
         be carried over into the 2.4 backports. So adding this method would
         clean up the startup and would make it easier to have 2.4 backports.
      
      and here's the original changelog for the scheduler changes:
      
       - cycle accuracy (nanosec resolution) timekeeping within the scheduler.
         This fixes a number of audio artifacts (skipping) i've reproduced. I
         dont think we can get away without going cycle accuracy - reading the
         cycle counter adds some overhead, but it's acceptable. The first
         nanosec-accuracy patch was done by Mike Galbraith - this patch is
         different but similar in nature. I went further in also changing the
         sleep_avg to be of nanosec resolution.
      
       - more finegrained timeslices: there's now a timeslice 'sub unit' of 50
         usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
         will roundrobin with this unit. This change is intended to make gaming
         latencies shorter.
      
       - include scheduling latency in sleep bonus calculation. This change
         extends the sleep-average calculation to the period of time a task
         spends on the runqueue but doesnt get scheduled yet, right after
         wakeup. Note that tasks that were preempted (ie. not woken up) and are
         still on the runqueue do not get this benefit. This change closes one
         of the last hole in the dynamic priority estimation, it should result
         in interactive tasks getting more priority under heavy load. This
         change also fixes the test-starve.c testcase from David Mosberger.
      
      
      The TSC-based scheduler clock is disabled on ia32 NUMA platforms.  (ie. 
      platforms that have unsynched TSC for sure.) Those platforms should provide
      the proper code to rely on the TSC in a global way.  (no such infrastructure
      exists at the moment - the monotonic TSC-based clock doesnt deal with TSC
      offsets either, as far as i can tell.)
      f221af36
    • Andrew Morton's avatar
      [PATCH] Fix setpgid and threads · feaecce4
      Andrew Morton authored
      From: Jeremy Fitzhardinge <jeremy@goop.org>
      
      I'm resending my patch to fix this problem.  To recap: every task_struct
      has its own copy of the thread group's pgrp.  Only the thread group
      leader is allowed to change the tgrp's pgrp, but it only updates its own
      copy of pgrp, while all the other threads in the tgrp use the old value
      they inherited on creation.
      
      This patch simply updates all the other thread's pgrp when the tgrp
      leader changes pgrp.  Ulrich has already expressed reservations about
      this patch since it is (1) incomplete (it doesn't cover the case of
      other ids which have similar problems), (2) racy (it doesn't synchronize
      with other threads looking at the task pgrp, so they could see an
      inconsistent view) and (3) slow (it takes linear time with respect to
      the number of threads in the tgrp).
      
      My reaction is that (1) it fixes the actual bug I'm encountering in a
      real program.  (2) doesn't really matter for pgrp, since it is mostly an
      issue with respect to the terminal job-control code (which is even more
      broken without this patch.  Regarding (3), I think there are very few
      programs which have a large number of threads which change process group
      id on a regular basis (a heavily multi-threaded job-control shell?).
      
      Ulrich also said he has a (proposed?) much better fix, which I've been
      looking forward to.  I'm submitting this patch as a stop-gap fix for a
      real bug, and perhaps to prompt the improved patch.
      
      An alternative fix, at least for pgrp, is to change all references to
      ->pgrp to group_leader->pgrp.  This may be sufficient on its own, but it
      would be a reasonably intrusive patch (I count 95 instances in 32 files
      in the 2.6.0-test3-mm3 tree).
      feaecce4
  21. 31 Aug, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] add context switch counters · a776ac8d
      Andrew Morton authored
      From: Peter Chubb <peterc@gelato.unsw.edu.au>
      
      Currently, the context switch counters reported by getrusage() are
      always zero.  The appended patch adds fields to struct task_struct to
      count context switches, and adds code to do the counting.
      
      The patch adds 4 longs to struct task struct, and a single addition to
      the fast path in schedule().
      a776ac8d
  22. 20 Aug, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] fix /proc mm_struct refcounting bug · 7d33101c
      Andrew Morton authored
      From: Suparna Bhattacharya <suparna@in.ibm.com>
      
      The /proc code's bare atomic_inc(&mm->count) is racy against __exit_mm()'s
      mmput() on another CPU: it calls mmput() outside task_lock(tsk), and
      task_lock() isn't appropriate locking anyway.
      
      So what happens is:
      
      	CPU0			          CPU1
      
            mmput()
            ->atomic_dec_and_lock(mm->mm_users)
                                                atomic_inc(mm->mm_users)
            ->list_del(mm->mmlist)
                                                mmput()
                                                ->atomic_dec_and_lock(mm->mm_users)
                                                ->list_del(mm->mmlist)
      
      And the double list_del() of course goes splat.
      
      So we use mmlist_lock to synchronise these steps.
      
      The patch implements a new mmgrab() routine which increments mm_users only if
      the mm isn't already going away.  Changes get_task_mm() and proc_pid_stat()
      to call mmgrab() instead of a direct atomic_inc(&mm->mm_users).
      
      Hugh, there's some cruft in swapoff which looks like it should be using
      mmgrab()...
      7d33101c
  23. 18 Aug, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] cpumask_t: allow more than BITS_PER_LONG CPUs · bf8cb61f
      Andrew Morton authored
      From: William Lee Irwin III <wli@holomorphy.com>
      
      Contributions from:
      	Jan Dittmer <jdittmer@sfhq.hn.org>
      	Arnd Bergmann <arnd@arndb.de>
      	"Bryan O'Sullivan" <bos@serpentine.com>
      	"David S. Miller" <davem@redhat.com>
      	Badari Pulavarty <pbadari@us.ibm.com>
      	"Martin J. Bligh" <mbligh@aracnet.com>
      	Zwane Mwaikambo <zwane@linuxpower.ca>
      
      It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64.
      
      cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their
      cpus by creating an abstract data type dedicated to representing cpu
      bitmasks, similar to fd sets from userspace, and sweeping the appropriate
      code to update callers to the access API.  The fd set-like structure is
      according to Linus' own suggestion; the macro calling convention to ambiguate
      representations with minimal code impact is my own invention.
      
      Specifically, a new set of inline functions for manipulating arbitrary-width
      bitmaps is introduced with a relatively simple implementation, in tandem with
      a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose
      accessor functions are defined in terms of the bitmap manipulation inlines.
      This bitmap ADT found an additional use in i386 arch code handling sparse
      physical APIC ID's, which was convenient to use in this case as the
      accounting structure was required to be wider to accommodate the physids
      consumed by larger numbers of cpus.
      
      For the sake of simplicity and low code impact, these cpu bitmasks are passed
      primarily by value; however, an additional set of accessors along with an
      auxiliary data type with const call-by-reference semantics is provided to
      address performance concerns raised in connection with very large systems,
      such as SGI's larger models, where copying and call-by-value overhead would
      be prohibitive.  Few (if any) users of the call-by-reference API are
      immediately introduced.
      
      Also, in order to avoid calling convention overhead on architectures where
      structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is
      special-cased so that cpumask_t falls back to an unsigned long and the
      accessors perform the usual bit twiddling on unsigned longs as opposed to
      arrays thereof.  Audits were done with the structure overhead in-place,
      restoring this special-casing only afterward so as to ensure a more complete
      API conversion while undergoing the majority of its end-user exposure in -mm.
       More -mm's were shipped after its restoration to be sure that was tested,
      too.
      
      The immediate users of this functionality are Sun sparc64 systems, SGI mips64
      and ia64 systems, and IBM ia32, ppc64, and s390 systems.  Of these, only the
      ppc64 machines needing the functionality have yet to be released; all others
      have had systems requiring it for full functionality for at least 6 months,
      and in some cases, since the initial Linux port to the affected architecture.
      bf8cb61f
  24. 14 Aug, 2003 1 commit
  25. 18 Jul, 2003 3 commits
    • Andrew Morton's avatar
      [PATCH] CLONE_STOPPED · 074127b5
      Andrew Morton authored
      From: Ulrich Drepper <drepper@redhat.com>
      
      CLONE_STOPPED: start a thread in a stopped state.  Required for NTPL.
      074127b5
    • Andrew Morton's avatar
      [PATCH] Fix two bugs with process limits (RLIMIT_NPROC) · 909cc4ae
      Andrew Morton authored
      From: Neil Brown <neilb@cse.unsw.edu.au>
      
      1/ If a setuid process swaps it's real and effective uids and then forks,
       the fork fails if the new realuid has more processes
       than the original process was limited to.
       This is particularly a problem if a user with a process limit
       (e.g. 256) runs a setuid-root program which does setuid() + fork()
       (e.g. lprng) while root already has more than 256 process (which
       is quite possible).
      
       The root problem here is that a limit which should be a per-user
       limit is being implemented as a per-process limit with
       per-process (e.g. CAP_SYS_RESOURCE) controls.
       Being a per-user limit, it should be that the root-user can over-ride
       it, not just some process with CAP_SYS_RESOURCE.
      
       This patch adds a test to ignore process limits if the real user is root.
      
      2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
        calls setuid, the setuid should fail if the user would then be running
        more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't.  This patch
        adds an appropriate test.  With this patch, and per-user process limit
        imposed in cgiwrap really works.
      909cc4ae
    • Andrew Morton's avatar
      [PATCH] remove task_cache entirely · 4da99f75
      Andrew Morton authored
      From: Manfred Spraul <manfred@colorfullife.com>
      
      kernel/fork.c contains a disabled cache for task stuctures.  task
      structures are placed into the task cache only if "tsk==current", and
      "tsk==current" is impossible.  There is even a WARN_ON against that in
      __put_task_struct().
      
      So remove it entirely - it's dead code.
      
      One problem is that order-1 allocations are not cached per-cpu - we can
      use kmalloc for the stack.
      4da99f75