Commits · 2f24285400352ec3b832db6de3644241ee7c2c84 · Kirill Smelkov / linux

An error occurred fetching the project authors.

14 May, 2004 1 commit

Andrew Morton authored 20 years ago

From: William Lee Irwin III <wli@holomorphy.com>

This patch series is solving the "thundering herd" problem that occurs in the
mainline implementation of hashed waitqueues.  There are two sources of
spurious wakeups in such arrangements:

(a) Hash collisions that place waiters on different objects on the same
    waitqueue, which wakes threads falsely when any of the objects hashed to
    the same queue receives a wakeup.  i.e.  loss of information about which
    object a wakeup event is related to.

(b) Loss of information about which object a given waiter is waiting on.
    This precludes wake-one semantics for mutual exclusion scenarios.  For
    instance, a lock bit may be slept on.  If there are any waiters on the
    object, a lock bit release event must wake at least one of them so as to
    prevent deadlock.  But without information as to which waiter is waiting
    on which object, we must resort to waking all waiters who could possibly
    be waiting on it.  Now, as the lock bit provides mutual exclusion, only
    one of the waiters woken can proceed, and the remainder will go back to
    sleep and wait for another event, creating unnecessary system load.  Once
    wake-one semantics are established, only one of the waiters waiting to
    acquire a lock bit need to be woken, which measurably reduces system load
    and improves efficiency (i.e.  it's the subject of the benchmarking I've
    been sending to you).

Even beyond the measurable efficiency gains, there are reasons of robustness
and responsiveness to motivate addressing the issue of thundering herds.  In a
real-life scenario I've been personally involved in resolving, the thundering
herd issue caused powerful modern SMP machines with fast IO systems to be
unresponsive to user input for a minute at a time or more.  Analogues of these
patches for the distro kernels involved fully resolved the issue to the
customer's satisfaction and obviated workarounds to limit the pagecache's
size.

The latest spin of these patches basically shoves more pieces of the logic
into the wakeup functions, with some efficiency gains from sharing the hot
codepath with the rest of the kernel, and a slightly larger diff than the
patches with the newly-introduced entrypoint.  Writing these was motivated by
the push to insulate sched.c from more of the details of wakeup semantics by
putting more of the logic into the wakeup functions.  In order to accomplish
this while still solving (b), the wakeup functions grew a new argument for
communication about what object a wakeup event is related to to be passed by
the waker.

=========

This patch provides an additional argument to wakeup functions so that
information may be passed from the waker to the waiter.  This is provided as a
separate patch so that the overhead of the additional argument can be measured
in isolation.  No change in performance was observable here.

2f242854

10 May, 2004 2 commits

[PATCH] Move migrate_all_tasks to CPU_DEAD handling · ddea677b

Andrew Morton authored 20 years ago

From: Srivatsa Vaddagiri <vatsa@in.ibm.com>

migrate_all_tasks is currently run with rest of the machine stopped.
It iterates thr' the complete task table, turning off cpu affinity of any task
that it finds affine to the dying cpu. Depending on the task table
size this can take considerable time. All this time machine is stopped, doing
nothing.

Stopping the machine for such extended periods can be avoided if we do
task migration in CPU_DEAD notification and that's precisely what this patch
does.

The patch puts idle task to the _front_ of the dying CPU's runqueue at the 
highest priority possible. This cause idle thread to run _immediately_ after
kstopmachine thread yields. Idle thread notices that its cpu is offline and
dies quickly. Task migration can then be done at leisure in CPU_DEAD
notification, when rest of the CPUs are running.

Some advantages with this approach are:

	- More scalable. Predicatable amout of time that machine is stopped.
	- No changes to hot path/core code. We are just exploiting scheduler
	  rules which runs the next high-priority task on the runqueue. Also
	  since I put idle task to the _front_ of the runqueue, there
	  are no races when a equally high priority task is woken up
	  and added to the runqueue. It gets in at the back of the runqueue,
	  _after_ idle task!
	- cpu_is_offline check that is presenty required in try_to_wake_up,
	  idle_balance and rebalance_tick can be removed, thus speeding them
	  up a bit

From: Srivatsa Vaddagiri <vatsa@in.ibm.com>

  Rusty mentioned that the unlikely hints against cpu_is_offline is
  redundant since the macro already has that hint.  Patch below removes those
  redundant hints I added.

ddea677b

[PATCH] sched: balance-on-clone · 8c8cfc36

Andrew Morton authored 20 years ago

From: Ingo Molnar <mingo@elte.hu>

Implement balancing during clone().  It does the following things:

- introduces SD_BALANCE_CLONE that can serve as a tool for an
  architecture to limit the search-idlest-CPU scope on clone().
  E.g. the 512-CPU systems should rather not enable this.

- uses the highest sd for the imbalance_pct, not this_rq (which didnt
  make sense).

- unifies balance-on-exec and balance-on-clone via the find_idlest_cpu()
  function. Gets rid of sched_best_cpu() which was still a bit
  inconsistent IMO, it used 'min_load < load' as a condition for
  balancing - while a more correct approach would be to use half of the
  imbalance_pct, like passive balancing does.

- the patch also reintroduces the possibility to do SD_BALANCE_EXEC on
  SMP systems, and activates it - to get testing.

- NOTE: there's one thing in this patch that is slightly unclean: i
  introduced wake_up_forked_thread. I did this to make it easier to get
  rid of this patch later (wake_up_forked_process() has lots of
  dependencies in various architectures). If this capability remains in
  the kernel then i'll clean it up and introduce one function for
  wake_up_forked_process/thread.

- NOTE2: i added the SD_BALANCE_CLONE flag to the NUMA CPU template too.
  Some NUMA architectures probably want to disable this.

8c8cfc36

30 Apr, 2004 1 commit

[PATCH] task_struct alignment fix · 978b7ac2

Andrew Morton authored 20 years ago

The recent slab alignment changes broke an unknown number of architectures
(parisc and x86_64 for sure) by causing task_structs to be insufficiently
aligned.

We need good alignemnt because architectures do things like dumping FP state
into the task_struct with instructions which require particular alignment (I
think).

So change the default alignment to L1_CACHE_BYTES, which is what we used to
have, via SLAB_HW_CACHE_ALIGN.

978b7ac2

12 Apr, 2004 8 commits

[PATCH] do_fork() error path memory leak · 23868940

Andrew Morton authored 20 years ago

From: <john.l.byrne@hp.com>

In do_fork(), if an error occurs after the mm_struct for the child has been
allocated, it is never freed. The exit_mm() meant to free it increments
the mm_count and this count is never decremented. (For a running process
that is exitting, schedule() takes care this; however, the child process
being cleaned up is not running.) In the CLONE_VM case, the parent's
mm_struct will get an extra mm_count and so it will never be freed.

This patch should fix both the CLONE_VM and the not CLONE_VM case; the test
of p->active_mm prevents a panic in the case that a kernel-thread is being
cloned.

23868940

[PATCH] fix posix-timers to have proper per-process scope · 0e568881

Andrew Morton authored 20 years ago

From: Roland McGrath <roland@redhat.com>

The posix-timers implementation associates timers with the creating thread
and destroys timers when their creator thread dies.  POSIX clearly
specifies that these timers are per-process, and a timer should not be torn
down when the thread that created it exits.  I hope there won't be any
controversy on what the correct semantics are here, since POSIX is clear
and the Linux feature is called "posix-timers".

The attached program built with NPTL -lrt -lpthread demonstrates the bug.
The program is correct by POSIX, but fails on Linux.  Note that a until
just the other day, NPTL had a trivial bug that always disabled its use of
kernel timer syscalls (check strace for lack of timer_create/SYS_259).  So
unless you have built your own NPTL libs very recently, you probably won't
see the kernel calls actually used by this program.

Also attached is my patch to fix this.  It (you guessed it) moves the
posix_timers field from task_struct to signal_struct.  Access is now
governed by the siglock instead of the task lock.  exit_itimers is called
from __exit_signal, i.e.  only on the death of the last thread in the
group, rather than from do_exit for every thread.  Timers' it_process
fields store the group leader's pointer, which won't die.  For the case of
SIGEV_THREAD_ID, I hold a ref on the task_struct for it_process to stay
robust in case the target thread dies; the ref is released and the dangling
pointer cleared when the timer fires and the target thread is dead.  (This
should only come up in a buggy user program, so noone cares exactly how the
kernel handles that case.  But I think what I did is robust and sensical.)

/* Test for bogus per-thread deletion of timers.  */

#include <stdio.h>
#include <error.h>
#include <time.h>
#include <signal.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <pthread.h>

/* Creating timers in another thread should work too.  */
static void *do_timer_create(void *arg)
{
	struct sigevent *const sigev = arg;
	timer_t *const timerId = sigev->sigev_value.sival_ptr;
	if (timer_create(CLOCK_REALTIME, sigev, timerId) < 0) {
		perror("timer_create");
		return NULL;
	}
	return timerId;
}

int main(void)
{
	int i, res;
	timer_t timerId;
	struct itimerspec itval;
	struct sigevent sigev;

	itval.it_interval.tv_sec = 2;
	itval.it_interval.tv_nsec = 0;
	itval.it_value.tv_sec = 2;
	itval.it_value.tv_nsec = 0;

	sigev.sigev_notify = SIGEV_SIGNAL;
	sigev.sigev_signo = SIGALRM;
	sigev.sigev_value.sival_ptr = (void *)&timerId;

	for (i = 0; i < 100; i++) {
		printf("cnt = %d\n", i);

		pthread_t thr;
		res = pthread_create(&thr, NULL, &do_timer_create, &sigev);
		if (res) {
			error(0, res, "pthread_create");
			continue;
		}
		void *val;
		res = pthread_join(thr, &val);
		if (res) {
			error(0, res, "pthread_join");
			continue;
		}
		if (val == NULL)
			continue;

		res = timer_settime(timerId, 0, &itval, NULL);
		if (res < 0)
			perror("timer_settime");

		res = timer_delete(timerId);
		if (res < 0)
			perror("timer_delete");
	}

	return 0;
}

0e568881

[PATCH] Light-weight Auditing Framework · f85a96f6

Andrew Morton authored 20 years ago

From: Rik Faith <faith@redhat.com>

This patch provides a low-overhead system-call auditing framework for Linux
that is usable by LSM components (e.g., SELinux).  This is an update of the
patch discussed in this thread:

    http://marc.theaimsgroup.com/?t=107815888100001&r=1&w=2

In brief, it provides for netlink-based logging of audit records that have
been generated in other parts of the kernel (e.g., SELinux) as well as the
ability to audit system calls, either independently (using simple
filtering) or as a compliment to the audit record that another part of the
kernel generated.

The main goals were to provide system call auditing with 1) as low overhead
as possible, and 2) without duplicating functionality that is already
provided by SELinux (and/or other security infrastructures).  This
framework will work "stand-alone", but is not designed to provide, e.g.,
CAPP functionality without another security component in place.

This updated patch includes changes from feedback I have received,
including the ability to compile without CONFIG_NET (and better use of
tabs, so use -w if you diff against the older patch).

Please see http://people.redhat.com/faith/audit/ for an early example
user-space client (auditd-0.4.tar.gz) and instructions on how to try it.

My future intentions at the kernel level include improving filtering (e.g.,
syscall personality/exit codes) and syscall support for more architectures.
 First, though, I'm going to work on documentation, a (real) audit daemon,
and patches for other user-space tools so that people can play with the
framework and understand how it can be used with and without SELinux.


Update:

Light-weight Auditing Framework receive filter fixes
From: Rik Faith <faith@redhat.com>

Since audit_receive_filter() is only called with audit_netlink_sem held, it
cannot race with either audit_del_rule() or audit_add_rule(), so the
list_for_each_entry_rcu()s may be replaced by list_for_each_entry()s, and
the rcu_read_{un,}lock()s removed.  A fix for this is part of the attached
patch.

Other features of the attached patch are:

1) generalized the ability to test for inequality

2) added syscall exit status reporting and testing

3) added ability to report and test first 4 syscall arguments (this adds
   a large amount of flexibility for little cost; not implemented or tested
   on ppc64)

4) added ability to report and test personality

User-space demo program enhanced for new fields and inequality testing:
http://people.redhat.com/faith/audit/auditd-0.5.tar.gz

f85a96f6

[PATCH] fork vma ordering during fork · 424e44d1

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

First of six patches against 2.6.5-rc3, cleaning up mremap's move_vma, and
fixing truncation orphan issues raised by Rajesh Venkatasubramanian.
Originally done as part of the anonymous objrmap work on mremap move, but
useful fixes now extracted for mainline. The mremap changes need some
exposure in the -mm tree first, but the first (fork one-liner) is safe enough
to go straight into 2.6.5.

From: Rajesh Venkatasubramanian. Despite the comment that child vma should
be inserted just after parent vma, 2.5.6 did exactly the reverse: thus a
racing vmtruncate may free the child's ptes, then advance to the parent, and
meanwhile copy_page_range has propagated more ptes from the parent to the
child, leaving file pages still mapped after truncation.

424e44d1

[PATCH] eliminate nswap and cnswap · 8398bcc6

Andrew Morton authored 20 years ago

From: Matt Mackall <mpm@selenic.com>

The nswap and cnswap variables counters have never been incremented as
Linux doesn't do task swapping.

8398bcc6

[PATCH] slab: updates for per-arch alignments · b9e55f3d

Andrew Morton authored 20 years ago

From: Manfred Spraul <manfred@colorfullife.com>

Description:

Right now kmem_cache_create automatically decides about the alignment of
allocated objects. The automatic decisions are sometimes wrong:

- for some objects, it's better to keep them as small as possible to
  reduce the memory usage.  Ingo already added a parameter to
  kmem_cache_create for the sigqueue cache, but it wasn't implemented.

- for s390, normal kmalloc must be 8-byte aligned.  With debugging
  enabled, the default allocation was 4-bytes.  This means that s390 cannot
  enable slab debugging.

- arm26 needs 1 kB aligned objects.  Previously this was impossible to
  generate, therefore arm has its own allocator in
  arm26/machine/small_page.c

- most objects should be cache line aligned, to avoid false sharing.  But
  the cache line size was set at compile time, often to 128 bytes for
  generic kernels.  This wastes memory.  The new code uses the runtime
  determined cache line size instead.

- some caches want an explicit alignment.  One example are the pte_chain
  objects: they must find the start of the object with addr&mask.  Right
  now pte_chain objects are scaled to the cache line size, because that was
  the only alignment that could be generated reliably.

The implementation reuses the "offset" parameter of kmem_cache_create and
now uses it to pass in the requested alignment.  offset was ignored by the
current implementation, and the only user I found is sigqueue, which
intended to set the alignment.

In the long run, it might be interesting for the main tree: due to the 128
byte alignment, only 7 inodes fit into one page, with 64-byte alignment, 9
inodes - 20% memory recovered for Athlon systems.



For generic kernels  running on P6 cpus (i.e. 32 byte cachelines), it means

Number of objects per page:

 ext2_inode_cache: 8 instead of 7
 ext3_inode_cache: 8 instead of 7
 fat_inode_cache: 9 instead of 7
 rpc_tasks: 24 instead of 15
 tcp_tw_bucket: 40 instead of 30
 arp_cache: 40 instead of 30
 nfs_write_data: 9 instead of 7

b9e55f3d

[PATCH] move job control fields from task_struct to signal_struct · 7860b371

Andrew Morton authored 20 years ago

From: Roland McGrath <roland@redhat.com>

This patch moves all the fields relating to job control from task_struct to
signal_struct, so that all this info is properly per-process rather than
being per-thread.

7860b371

[PATCH] posix message queues: code move · c334f752

Andrew Morton authored 20 years ago

From: Manfred Spraul <manfred@colorfullife.com>

cleanup of sysv ipc as a preparation for posix message queues:

- replace !CONFIG_SYSVIPC wrappers for copy_semundo and exit_sem with
  static inline wrappers.  Now the whole ipc/util.c file is only used if
  CONFIG_SYSVIPC is set, use makefile magic instead of #ifdef.

- remove the prototypes for copy_semundo and exit_sem from kernel/fork.c

- they belong into a header file.

- create a new msgutil.c with the helper functions for message queues.

- cleanup the helper functions: run Lindent, add __user tags.

c334f752

08 Mar, 2004 1 commit

[PATCH] vma corruption fix · bc3d0059

Andrew Morton authored 21 years ago

From: Hugh Dickins <hugh@veritas.com>

Fixes bugzilla #2219

fork's dup_mmap leaves child mm_rb as copied from parent mm while doing all
the copy_page_ranges, and then calls build_mmap_rb without holding
page_table_lock.

try_to_unmap_one's find_vma (holding page_table_lock not mmap_sem) coming
on another cpu may cause mm mayhem.  It may leave the child's mmap_cache
pointing to a vma of the parent mm.

When the parent exits and the child faults, quite what happens rather
depends on what junk then inhabits vm_page_prot, which gets set in the page
table, with page_add_rmap adding the ptep, but junk pte likely to fail the
tests for page_remove_rmap.

Eventually the child exits, the page table is freed and try_to_unmap_one
oopses on null ptep_to_mm (but in a kernel with rss limiting, usually
page_referenced hits the null ptep_to_mm first).

This took me days and days to unravel!  Big thanks to Matthieu for
reporting it with a good test case.

bc3d0059

06 Mar, 2004 1 commit

[PATCH] fastcall / regparm fixes · 20e39386

Andrew Morton authored 21 years ago

From: Gerd Knorr <kraxel@suse.de>

Current gcc's error out if a function's declaration and definition disagree
about the register passing convention.

The patch adds a new `fastcall' declatation primitive, and uses that in all
the FASTCALL functions which we could find.  A number of inconsistencies were
fixed up along the way.

20e39386

25 Feb, 2004 1 commit

[PATCH] add syscalls.h · 0bab0642

Andrew Morton authored 21 years ago

From: "Randy.Dunlap" <rddunlap@osdl.org>

Add syscalls.h, which contains prototypes for the kernel's system calls.
Replace open-coded declarations all over the place. This patch found a
couple of prior bugs. It appears to be more important with -mregparm=3 as we
discover more asmlinkage mismatches.

Some syscalls have arch-dependent arguments, so their prototypes are in the
arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there
were already arch-specific syscall prototypes in asm/unistd.h...

Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause
trivial-to-fix build breakage on other architectures.

0bab0642

18 Feb, 2004 1 commit

[PATCH] NGROUPS 2.6.2rc2 + fixups · a937b06e

Andrew Morton authored 21 years ago

From: Tim Hockin <thockin@sun.com>,
      Neil Brown <neilb@cse.unsw.edu.au>,
      me

New groups infrastructure.  task->groups and task->ngroups are replaced by
task->group_info.  Group)info is a refcounted, dynamic struct with an array
of pages.  This allows for large numbers of groups.  The current limit of
32 groups has been raised to 64k groups.  It can be raised more by changing
the NGROUPS_MAX constant in limits.h

a937b06e

04 Feb, 2004 1 commit

[PATCH] Fix more gcc 3.4 warnings · d75cb184

Andrew Morton authored 21 years ago

From: Andi Kleen <ak@muc.de>

Just many more warning fixes for a gcc 3.4 snapshot.

It warns for a lot of things now, e.g.  for ?: and ({ ...  }) and casts as
lvalues.  And for functions marked inline in headers, but no body.

Actually there are more warnings, i stopped fixing at some point.  Some of
the warnings seem to be dubious (e.g.  the binfmt_elf.c one, which looks
more like a compiler bug to me)

I also fixed the _exit() prototype to be void because gcc was complaining
about this.

d75cb184

19 Jan, 2004 4 commits

[PATCH] Remove CLONE_DETACHED · 8ce5870d
Andrew Morton authored 21 years ago
```
From: Andries.Brouwer@cwi.nl

Remove obsolete CLONE_DETACHED
```
8ce5870d

[PATCH] Use for_each_cpu() Where It's Meant To Be · 012061cc

Andrew Morton authored 21 years ago

From: Rusty Russell <rusty@rustcorp.com.au>

Some places use cpu_online() where they should be using cpu_possible, most
commonly for tallying statistics.  This makes no difference without hotplug
CPU.

Use the for_each_cpu() macro in those places, providing good examples (and
making the external hotplug CPU patch smaller).

Some places use cpu_online() where they should be using cpu_possible, most
commonly for tallying statistics.  This makes no difference without hotplug
CPU.

Use the for_each_cpu() macro in those places, providing good examples (and
making the external hotplug CPU patch smaller).

012061cc

[PATCH] CPU scheduler cleanup · 2df40901

Andrew Morton authored 21 years ago

From: Ingo Molnar <mingo@elte.hu>

- move scheduling-state initializtion from copy_process() to
  sched_fork() (Nick Piggin)

2df40901

[PATCH] bdev: move i_mapping -> f_mapping conversions · df6a148f

Andrew Morton authored 21 years ago

From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>

More uses of ->i_mapping switched to uses of ->f_mapping - stuff that was not
caught by the earlier f_mapping conversion.

df6a148f

08 Jan, 2004 1 commit

Fix subtle fork() race that Ingo noticed. · f7a1132c

Linus Torvalds authored 21 years ago

We must not mark the process TASK_STOPPED early, because
that might allow a signal to wake it up before we actually
got to the "wake_up_forked_process()" state. Total confusion
would happen.

Make wake_up_forked_process() verify the new world order.

f7a1132c

29 Dec, 2003 1 commit

[PATCH] unshare_files · 02cda956

Andrew Morton authored 21 years ago

From: Chris Wright <chrisw@osdl.org>

Introduce unshare_files as a helper for use during execve to eliminate
potential leak of the execve'd binary's fd.

02cda956

13 Dec, 2003 1 commit

More subtle SMP bugs in prepare_to_wait()/finish_wait(). · e220fdf7

Linus Torvalds authored 21 years ago

This time we have a SMP memory ordering issue in prepare_to_wait(),
where we really need to make sure that subsequent tests for the
event we are waiting for can not migrate up to before the wait
queue has been set up.

e220fdf7

12 Dec, 2003 1 commit

Fix subtle bug in "finish_wait()", which can cause kernel stack · a2c72fae

Linus Torvalds authored 21 years ago

corruption on SMP because of another CPU still accessing a waitqueue
even after it was de-allocated.

Use a careful version of the list emptiness check to make sure we
don't de-allocate the stack frame before the waitqueue is all done.

a2c72fae

25 Nov, 2003 1 commit
- Fix error return on concurrent fork() with threaded exit() · e1d592cd
  Linus Torvalds authored 21 years ago
  
  e1d592cd
09 Oct, 2003 1 commit

Revert the process group accessor functions. They are buggy, and · 06349d9d

Linus Torvalds authored 21 years ago

cause NULL pointer references in /proc.

Moreover, it's questionable whether the whole thing makes sense at all. 
Per-thread state is good.

Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211

06349d9d

07 Oct, 2003 1 commit
- o kernel/ksyms.c: move remaining EXPORT_SYMBOLs, remove this file from the tree · ff08f6fe
  Arnaldo Carvalho de Melo authored 21 years ago
  
  ff08f6fe
05 Oct, 2003 1 commit

[PATCH] move job control fields from task_struct to · 1bd563fd

Andrew Morton authored 21 years ago

From: Roland McGrath <roland@redhat.com>

This patch completes what was started with the `process_group' accessor
function, moving all the job control-related fields from task_struct into
signal_struct and using process_foo accessor functions to read them. All
these things are per-process in POSIX, none per-thread. Off hand it's hard
to come up with the hairy MT scenarios in which the existing code would do
insane things, but trust me, they're there. At any rate, all the uses
being done via inline accessor functions now has got to be all good.

I did a "make allyesconfig" build and caught the few random drivers and
whatnot that referred to these fields. I was surprised to find how few
references to ->tty there really were to fix up. I'm sure there will be a
few more fixups needed in non-x86 code. The only actual testing of a
running kernel with these patches I've done is on my normal minimal x86
config. Everything works fine as it did before as far as I can tell.

One issue that may be of concern is the lack of any locking on multiple
threads diddling these fields. I don't think it really matters, though
there might be some obscure races that could produce inconsistent job
control results. Nothing shattering, I'm sure; probably only something
like a multi-threaded program calling setsid while its other threads do tty
i/o, which never happens in reality. This is the same situation we get by
using ->group_leader->foo without other synchronization, which seemed to be
the trend and noone was worried about it.

1bd563fd

22 Sep, 2003 1 commit

[PATCH] shared signals require shared VM · 8912ad07

Albert Cahalan authored 21 years ago

Elimination of this nonsense allows for the assumption
that a task group shares VM. This lets procps run faster.

8912ad07

21 Sep, 2003 3 commits

[PATCH] Handle init_new_context failures · 1cfc080a

Andrew Morton authored 21 years ago

From: Anton Blanchard <anton@samba.org>

If init_new_context fails we definitely do not want to call mmput, because
that will call destroy_context against an uninitialised context.  Instead
we should back out what we did in init_mm.  Fixes some weird failures on
ppc64 when running a fork bomb.

1cfc080a

[PATCH] scheduler infrastructure · f221af36

Andrew Morton authored 21 years ago

From: Ingo Molnar <mingo@elte.hu>

the attached scheduler patch (against test2-mm2) adds the scheduling
infrastructure items discussed on lkml. I got good feedback - and while i
dont expect it to solve all problems, it does solve a number of bad ones:

 - test_starve.c code from David Mosberger

 - thud.c making the system unusuable due to unfairness

 - fair/accurate sleep average based on a finegrained clock

 - audio skipping way too easily

other changes in sched-test2-mm2-A3:

 - ia64 sched_clock() code, from David Mosberger.

 - migration thread startup without relying on implicit scheduling
   behavior. While the current 2.6 code is correct (due to the cpu-up code
   adding CPUs one by one), but it's also fragile - and this code cannot
   be carried over into the 2.4 backports. So adding this method would
   clean up the startup and would make it easier to have 2.4 backports.

and here's the original changelog for the scheduler changes:

 - cycle accuracy (nanosec resolution) timekeeping within the scheduler.
   This fixes a number of audio artifacts (skipping) i've reproduced. I
   dont think we can get away without going cycle accuracy - reading the
   cycle counter adds some overhead, but it's acceptable. The first
   nanosec-accuracy patch was done by Mike Galbraith - this patch is
   different but similar in nature. I went further in also changing the
   sleep_avg to be of nanosec resolution.

 - more finegrained timeslices: there's now a timeslice 'sub unit' of 50
   usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
   will roundrobin with this unit. This change is intended to make gaming
   latencies shorter.

 - include scheduling latency in sleep bonus calculation. This change
   extends the sleep-average calculation to the period of time a task
   spends on the runqueue but doesnt get scheduled yet, right after
   wakeup. Note that tasks that were preempted (ie. not woken up) and are
   still on the runqueue do not get this benefit. This change closes one
   of the last hole in the dynamic priority estimation, it should result
   in interactive tasks getting more priority under heavy load. This
   change also fixes the test-starve.c testcase from David Mosberger.


The TSC-based scheduler clock is disabled on ia32 NUMA platforms.  (ie. 
platforms that have unsynched TSC for sure.) Those platforms should provide
the proper code to rely on the TSC in a global way.  (no such infrastructure
exists at the moment - the monotonic TSC-based clock doesnt deal with TSC
offsets either, as far as i can tell.)

f221af36

[PATCH] Fix setpgid and threads · feaecce4

Andrew Morton authored 21 years ago

From: Jeremy Fitzhardinge <jeremy@goop.org>

I'm resending my patch to fix this problem. To recap: every task_struct
has its own copy of the thread group's pgrp. Only the thread group
leader is allowed to change the tgrp's pgrp, but it only updates its own
copy of pgrp, while all the other threads in the tgrp use the old value
they inherited on creation.

This patch simply updates all the other thread's pgrp when the tgrp
leader changes pgrp. Ulrich has already expressed reservations about
this patch since it is (1) incomplete (it doesn't cover the case of
other ids which have similar problems), (2) racy (it doesn't synchronize
with other threads looking at the task pgrp, so they could see an
inconsistent view) and (3) slow (it takes linear time with respect to
the number of threads in the tgrp).

My reaction is that (1) it fixes the actual bug I'm encountering in a
real program. (2) doesn't really matter for pgrp, since it is mostly an
issue with respect to the terminal job-control code (which is even more
broken without this patch. Regarding (3), I think there are very few
programs which have a large number of threads which change process group
id on a regular basis (a heavily multi-threaded job-control shell?).

Ulrich also said he has a (proposed?) much better fix, which I've been
looking forward to. I'm submitting this patch as a stop-gap fix for a
real bug, and perhaps to prompt the improved patch.

An alternative fix, at least for pgrp, is to change all references to
->pgrp to group_leader->pgrp. This may be sufficient on its own, but it
would be a reasonably intrusive patch (I count 95 instances in 32 files
in the 2.6.0-test3-mm3 tree).

feaecce4

31 Aug, 2003 1 commit

[PATCH] add context switch counters · a776ac8d

Andrew Morton authored 21 years ago

From: Peter Chubb <peterc@gelato.unsw.edu.au>

Currently, the context switch counters reported by getrusage() are
always zero.  The appended patch adds fields to struct task_struct to
count context switches, and adds code to do the counting.

The patch adds 4 longs to struct task struct, and a single addition to
the fast path in schedule().

a776ac8d

20 Aug, 2003 1 commit

[PATCH] fix /proc mm_struct refcounting bug · 7d33101c

Andrew Morton authored 21 years ago

From: Suparna Bhattacharya <suparna@in.ibm.com>

The /proc code's bare atomic_inc(&mm->count) is racy against __exit_mm()'s
mmput() on another CPU: it calls mmput() outside task_lock(tsk), and
task_lock() isn't appropriate locking anyway.

So what happens is:

	CPU0			          CPU1

      mmput()
      ->atomic_dec_and_lock(mm->mm_users)
                                          atomic_inc(mm->mm_users)
      ->list_del(mm->mmlist)
                                          mmput()
                                          ->atomic_dec_and_lock(mm->mm_users)
                                          ->list_del(mm->mmlist)

And the double list_del() of course goes splat.

So we use mmlist_lock to synchronise these steps.

The patch implements a new mmgrab() routine which increments mm_users only if
the mm isn't already going away.  Changes get_task_mm() and proc_pid_stat()
to call mmgrab() instead of a direct atomic_inc(&mm->mm_users).

Hugh, there's some cruft in swapoff which looks like it should be using
mmgrab()...

7d33101c

18 Aug, 2003 1 commit

[PATCH] cpumask_t: allow more than BITS_PER_LONG CPUs · bf8cb61f

Andrew Morton authored 21 years ago

From: William Lee Irwin III <wli@holomorphy.com>

Contributions from:
Jan Dittmer <jdittmer@sfhq.hn.org>
Arnd Bergmann <arnd@arndb.de>
"Bryan O'Sullivan" <bos@serpentine.com>
"David S. Miller" <davem@redhat.com>
Badari Pulavarty <pbadari@us.ibm.com>
"Martin J. Bligh" <mbligh@aracnet.com>
Zwane Mwaikambo <zwane@linuxpower.ca>

It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64.

cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their
cpus by creating an abstract data type dedicated to representing cpu
bitmasks, similar to fd sets from userspace, and sweeping the appropriate
code to update callers to the access API. The fd set-like structure is
according to Linus' own suggestion; the macro calling convention to ambiguate
representations with minimal code impact is my own invention.

Specifically, a new set of inline functions for manipulating arbitrary-width
bitmaps is introduced with a relatively simple implementation, in tandem with
a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose
accessor functions are defined in terms of the bitmap manipulation inlines.
This bitmap ADT found an additional use in i386 arch code handling sparse
physical APIC ID's, which was convenient to use in this case as the
accounting structure was required to be wider to accommodate the physids
consumed by larger numbers of cpus.

For the sake of simplicity and low code impact, these cpu bitmasks are passed
primarily by value; however, an additional set of accessors along with an
auxiliary data type with const call-by-reference semantics is provided to
address performance concerns raised in connection with very large systems,
such as SGI's larger models, where copying and call-by-value overhead would
be prohibitive. Few (if any) users of the call-by-reference API are
immediately introduced.

Also, in order to avoid calling convention overhead on architectures where
structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is
special-cased so that cpumask_t falls back to an unsigned long and the
accessors perform the usual bit twiddling on unsigned longs as opposed to
arrays thereof. Audits were done with the structure overhead in-place,
restoring this special-casing only afterward so as to ensure a more complete
API conversion while undergoing the majority of its end-user exposure in -mm.
More -mm's were shipped after its restoration to be sure that was tested,
too.

The immediate users of this functionality are Sun sparc64 systems, SGI mips64
and ia64 systems, and IBM ia32, ppc64, and s390 systems. Of these, only the
ppc64 machines needing the functionality have yet to be released; all others
have had systems requiring it for full functionality for at least 6 months,
and in some cases, since the initial Linux port to the affected architecture.

bf8cb61f

14 Aug, 2003 1 commit

Mark CLONE_DETACHED as being irrelevant: it must match CLONE_THREAD. · 856c781e

Linus Torvalds authored 21 years ago

CLONE_THREAD without CLONE_DETACHED will now return -EINVAL, and
for a while we will warn about anything that uses it (there are no
known users, but this will help pinpoint any problems if somebody
used to care about the invalid combination).

856c781e

18 Jul, 2003 3 commits

[PATCH] CLONE_STOPPED · 074127b5

Andrew Morton authored 21 years ago

From: Ulrich Drepper <drepper@redhat.com>

CLONE_STOPPED: start a thread in a stopped state.  Required for NTPL.

074127b5

[PATCH] Fix two bugs with process limits (RLIMIT_NPROC) · 909cc4ae

Andrew Morton authored 21 years ago

From: Neil Brown <neilb@cse.unsw.edu.au>

1/ If a setuid process swaps it's real and effective uids and then forks,
 the fork fails if the new realuid has more processes
 than the original process was limited to.
 This is particularly a problem if a user with a process limit
 (e.g. 256) runs a setuid-root program which does setuid() + fork()
 (e.g. lprng) while root already has more than 256 process (which
 is quite possible).

 The root problem here is that a limit which should be a per-user
 limit is being implemented as a per-process limit with
 per-process (e.g. CAP_SYS_RESOURCE) controls.
 Being a per-user limit, it should be that the root-user can over-ride
 it, not just some process with CAP_SYS_RESOURCE.

 This patch adds a test to ignore process limits if the real user is root.

2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
  calls setuid, the setuid should fail if the user would then be running
  more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't.  This patch
  adds an appropriate test.  With this patch, and per-user process limit
  imposed in cgiwrap really works.

909cc4ae

[PATCH] remove task_cache entirely · 4da99f75

Andrew Morton authored 21 years ago

From: Manfred Spraul <manfred@colorfullife.com>

kernel/fork.c contains a disabled cache for task stuctures.  task
structures are placed into the task cache only if "tsk==current", and
"tsk==current" is impossible.  There is even a WARN_ON against that in
__put_task_struct().

So remove it entirely - it's dead code.

One problem is that order-1 allocations are not cached per-cpu - we can
use kmalloc for the stack.

4da99f75