Commits · 57daa60466fac470193975bf8c89b796545332bb · Kirill Smelkov / linux

An error occurred fetching the project authors.

29 Sep, 2003 1 commit
- o kernel/ksyms.c: move relevant EXPORT_SYMBOLs to kernel/sched.c · a7a93360
  Arnaldo Carvalho de Melo authored 21 years ago
  
  a7a93360
21 Sep, 2003 5 commits

[PATCH] might_sleep diagnostics · d6dbfa23

Andrew Morton authored 21 years ago

might_sleep() can be triggered by either local interrupts being disabled or
by elevated preempt count.  Disambiguate them.

d6dbfa23

[PATCH] CPU scheduler interactivity changes · 2cf13d58

Andrew Morton authored 21 years ago

From: Con Kolivas <kernel@kolivas.org>

Interactivity scheduler tweaks on top of Ingo's A3 interactivity patch.

Interactive credit added to task struct to find truly interactive tasks and
treat them differently.

Extra #defines included as helpers for conversion to/from nanosecond timing,
to work out an average timeslice for nice 0 tasks, and the effective dynamic
priority bonuses that will be given to tasks.

MAX_SLEEP_AVG modified to change dynamic priority by one for a nice 0 task
sleeping or running for one full timeslice.

CREDIT_LIMIT is the number of times a task earns sleep_avg over MAX_SLEEP_AVG
before it is considered HIGH_CREDIT (truly interactive); and -CREDIT_LIMIT is
LOW_CREDIT

TIMESLICE GRANULARITY is modified to be more frequent for more
interactivetasks (10 ms for top 2 dynamic priorities and then halving each
priority belowthat) and less frequent per extra cpu.

JUST_INTERACTIVE_SLEEP logic created to be a sleep_avg consistent with giving
a task enough dynamic priority to remain on the active array.

Task preemption of equal priority tasks is dropped as requeuing with
TIMESLICE_GRANULARITY makes this unecessary.

Dynamic priority bonus simplified.

User tasks that sleep a long time and not waking from uninterruptible sleep
are sought and categorised as idle. Their sleep avg is limited in it's rise to
prevent them becoming high priority and suddenly turning into cpu hogs.

Bonus for sleeping is proportionately higher the lower the dynamic priority of
a task is; this allows for very rapid escalation to interactive status.

Tasks that are LOW_CREDIT are limited in rise per sleep to one priority level.

Non HIGH_CREDIT tasks waking from uninterruptible sleep are sought to detect
cpu hogs waiting on I/O and their sleep_avg rise is limited to just
interactive state to prevent cpu bound tasks from becoming interactive during
I/O wait.

Tasks that earn sleep_avg over MAX_SLEEP_AVG get interactive credits.

On runqueue bonus is not given to non HIGH_CREDIT tasks waking from
uninterruptible sleep.

Forked tasks and their parents get sleep_avg limited to the minimum necessary
to maintain their effective dynamic priority thus preventing repeated forking
from being a way to get highly interactive, but not penalise them noticably
otherwise.

CAN_MIGRATE_TASK cleaned up and modified to work with nanosecond timestamps.

Reverted Ingo's A3 Starvation limit change - it was making interactive tasks
suffer more under increasing load. If a cpu is grossly overloaded and
everyone is going to starve it may as well run interactive tasks
preferentially.

Task requeuing is limited to interactive tasks only (cpu bound tasks dont need
low latency and derive benefit from longer timeslices), and they must have at
least TIMESLICE_GRANULARITY remaining.

HIGH_CREDIT tasks get penalised less sleep_avg the more interactive they are
thus keeping them interactive for bursts but if they become sustained cpu hogs
they will slide increasingly rapidly down the dynamic priority scale.

Tasks that run out of sleep_avg, are still using up cpu time and are not high
or low credit yet get penalised interactive credits to determine LOW_CREDIT
tasks (cpu bound ones).

2cf13d58

[PATCH] CPU scheduler balancing fix · 875ee1e1

Andrew Morton authored 21 years ago

From: Nick Piggin <piggin@cyberone.com.au>

The patch changes the imbalance required before a balance to 25% from 50% -
as the comments intend.  It also changes a case where the balancing
wouldn't be done if the imbalance was >= 25% but only 1 task difference.

The downside of the second change is that one task may bounce from one cpu
to another for some loads.  This will only bounce once every 200ms, so it
shouldn't be a big problem.

(Benchmarking results are basically a wash - SDET is increased maybe 0.5%)

875ee1e1

[PATCH] scheduler infrastructure · f221af36

Andrew Morton authored 21 years ago

From: Ingo Molnar <mingo@elte.hu>

the attached scheduler patch (against test2-mm2) adds the scheduling
infrastructure items discussed on lkml. I got good feedback - and while i
dont expect it to solve all problems, it does solve a number of bad ones:

 - test_starve.c code from David Mosberger

 - thud.c making the system unusuable due to unfairness

 - fair/accurate sleep average based on a finegrained clock

 - audio skipping way too easily

other changes in sched-test2-mm2-A3:

 - ia64 sched_clock() code, from David Mosberger.

 - migration thread startup without relying on implicit scheduling
   behavior. While the current 2.6 code is correct (due to the cpu-up code
   adding CPUs one by one), but it's also fragile - and this code cannot
   be carried over into the 2.4 backports. So adding this method would
   clean up the startup and would make it easier to have 2.4 backports.

and here's the original changelog for the scheduler changes:

 - cycle accuracy (nanosec resolution) timekeeping within the scheduler.
   This fixes a number of audio artifacts (skipping) i've reproduced. I
   dont think we can get away without going cycle accuracy - reading the
   cycle counter adds some overhead, but it's acceptable. The first
   nanosec-accuracy patch was done by Mike Galbraith - this patch is
   different but similar in nature. I went further in also changing the
   sleep_avg to be of nanosec resolution.

 - more finegrained timeslices: there's now a timeslice 'sub unit' of 50
   usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level
   will roundrobin with this unit. This change is intended to make gaming
   latencies shorter.

 - include scheduling latency in sleep bonus calculation. This change
   extends the sleep-average calculation to the period of time a task
   spends on the runqueue but doesnt get scheduled yet, right after
   wakeup. Note that tasks that were preempted (ie. not woken up) and are
   still on the runqueue do not get this benefit. This change closes one
   of the last hole in the dynamic priority estimation, it should result
   in interactive tasks getting more priority under heavy load. This
   change also fixes the test-starve.c testcase from David Mosberger.


The TSC-based scheduler clock is disabled on ia32 NUMA platforms.  (ie. 
platforms that have unsynched TSC for sure.) Those platforms should provide
the proper code to rely on the TSC in a global way.  (no such infrastructure
exists at the moment - the monotonic TSC-based clock doesnt deal with TSC
offsets either, as far as i can tell.)

f221af36

[PATCH] real-time enhanced page allocator and throttling · 55b50278

Andrew Morton authored 21 years ago

From: Robert Love <rml@tech9.net>

- Let real-time tasks dip further into the reserves than usual in
  __alloc_pages().  There are a lot of ways to special case this.  This
  patch just cuts z->pages_low in half, before doing the incremental min
  thing, for real-time tasks.  I do not do anything in the low memory slow
  path.  We can be a _lot_ more aggressive if we want.  Right now, we just
  give real-time tasks a little help.

- Never ever call balance_dirty_pages() on a real-time task.  Where and
  how exactly we handle this is up for debate.  We could, for example,
  special case real-time tasks inside balance_dirty_pages().  This would
  allow us to perform some of the work (say, waking up pdflush) but not
  other work (say, the active throttling).  As it stands now, we do the
  per-processor accounting in balance_dirty_pages_ratelimited() but we
  never call balance_dirty_pages().  Lots of approaches work.  What we want
  to do is never engage the real-time task in forced writeback.

55b50278

09 Sep, 2003 1 commit

[PATCH] CPU scheduler CAN_MIGRATE fix · 9e3a8956

Andrew Morton authored 21 years ago

From: Andrew Theurer <habanero@us.ibm.com>

This change:
http://linux.bkbits.net:8080/linux-2.5/diffs/kernel/sched.c@1.202

does not seem to make sense:

#define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
	((!idle || (jiffies - (p)->last_run > cache_decay_ticks)) &&    \
		!task_running(rq, p) &&					\
			cpu_isset(this_cpu, (p)->cpus_allowed))

It should be just the opposite; an idle cpu should be able to have a more
aggressive steal, and a busy cpu should not.

9e3a8956

08 Sep, 2003 1 commit

[power] Add support for refrigerator to the migration_thread. · a803561d

Patrick Mochel authored 21 years ago

- The PM code currently must signal each kernel thread when suspending, and
  each thread must call refrigerator() to stop itself. This patch adds 
  support for this to migration_thread, which allows suspend states to work
  on an SMP-enabled kernel (though not necessarily an SMP machine).

- Note I do not know why the process freezing code was designed in such a 
  way. One would think we could do it without having to call each thread
  individually, and fix up the threads that need special work individually..

a803561d

31 Aug, 2003 1 commit

[PATCH] add context switch counters · a776ac8d

Andrew Morton authored 21 years ago

From: Peter Chubb <peterc@gelato.unsw.edu.au>

Currently, the context switch counters reported by getrusage() are
always zero.  The appended patch adds fields to struct task_struct to
count context switches, and adds code to do the counting.

The patch adds 4 longs to struct task struct, and a single addition to
the fast path in schedule().

a776ac8d

18 Aug, 2003 1 commit

[PATCH] cpumask_t: allow more than BITS_PER_LONG CPUs · bf8cb61f

Andrew Morton authored 21 years ago

From: William Lee Irwin III <wli@holomorphy.com>

Contributions from:
Jan Dittmer <jdittmer@sfhq.hn.org>
Arnd Bergmann <arnd@arndb.de>
"Bryan O'Sullivan" <bos@serpentine.com>
"David S. Miller" <davem@redhat.com>
Badari Pulavarty <pbadari@us.ibm.com>
"Martin J. Bligh" <mbligh@aracnet.com>
Zwane Mwaikambo <zwane@linuxpower.ca>

It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64.

cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their
cpus by creating an abstract data type dedicated to representing cpu
bitmasks, similar to fd sets from userspace, and sweeping the appropriate
code to update callers to the access API. The fd set-like structure is
according to Linus' own suggestion; the macro calling convention to ambiguate
representations with minimal code impact is my own invention.

Specifically, a new set of inline functions for manipulating arbitrary-width
bitmaps is introduced with a relatively simple implementation, in tandem with
a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose
accessor functions are defined in terms of the bitmap manipulation inlines.
This bitmap ADT found an additional use in i386 arch code handling sparse
physical APIC ID's, which was convenient to use in this case as the
accounting structure was required to be wider to accommodate the physids
consumed by larger numbers of cpus.

For the sake of simplicity and low code impact, these cpu bitmasks are passed
primarily by value; however, an additional set of accessors along with an
auxiliary data type with const call-by-reference semantics is provided to
address performance concerns raised in connection with very large systems,
such as SGI's larger models, where copying and call-by-value overhead would
be prohibitive. Few (if any) users of the call-by-reference API are
immediately introduced.

Also, in order to avoid calling convention overhead on architectures where
structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is
special-cased so that cpumask_t falls back to an unsigned long and the
accessors perform the usual bit twiddling on unsigned longs as opposed to
arrays thereof. Audits were done with the structure overhead in-place,
restoring this special-casing only afterward so as to ensure a more complete
API conversion while undergoing the majority of its end-user exposure in -mm.
More -mm's were shipped after its restoration to be sure that was tested,
too.

The immediate users of this functionality are Sun sparc64 systems, SGI mips64
and ia64 systems, and IBM ia32, ppc64, and s390 systems. Of these, only the
ppc64 machines needing the functionality have yet to be released; all others
have had systems requiring it for full functionality for at least 6 months,
and in some cases, since the initial Linux port to the affected architecture.

bf8cb61f

17 Aug, 2003 1 commit
- Add irq and softirq time accounting to the kernel · 39d2edc4
  Doug Ledford authored 21 years ago
  
  39d2edc4
14 Aug, 2003 1 commit

[PATCH] fix task struct refcount bug · 4c0d7322

Andrew Morton authored 21 years ago

From: Manfred Spraul <manfred@colorfullife.com>

(We think this might be the mystery bug which has been hanging about for
months)


We found a [the?] task struct refcount error: A task that dies sets
tsk->state to TASK_ZOMBIE.  The next scheduled task checks prev->state, and
if it's ZOMBIE, then it decrements the reference count of prev.  The
prev->state & _ZOMBIE test is not atomic with schedule, thus if prev is
scheduled again and dies between dropping the runqueue lock and checking
prev->state, then the reference it dropped twice.

This is possible with either preemption [schedule_tail is called by
ret_from_fork with preemption count 1, finish_arch_switch drops it to 0] or
profiling [profile_exit_mmap can sleep on profile_rwsem, called by
mmdrop()] enabled.

4c0d7322

21 Jul, 2003 1 commit

ISDN: Export "kstat" · 7ec85ce3

Kai Germaschewski authored 21 years ago

This patch exports the kstat per-cpu variable, needed for
hisax, which uses kstat_irqs() during card probing to make sure
that irqs actually work. This could possibly replaced by a
private counter in the hisax ISRs, but that's really just
unnecessary overhead, since the core kernel already does the work
anyway.

7ec85ce3

18 Jul, 2003 1 commit
- [PATCH] fix a doc error and misleading printk · 4cbb05e8
  Alan Cox authored 21 years ago
  
  4cbb05e8
10 Jul, 2003 1 commit

[PATCH] fix for CPU scheduler load distribution · e0a3db1a

Andrew Morton authored 21 years ago

From: Ingo Molnar <mingo@elte.hu>

It makes hot-balancing happen in the 'busy tick' case as well, which should
spread out processes more agressively.

e0a3db1a

07 Jul, 2003 2 commits

[PATCH] switch_mm and enter_lazy_tlb: remove cpu arg · 8a6879c6

Rusty Russell authored 21 years ago

switch_mm and enter_lazy_tlb take a CPU arg, which is always
smp_processor_id().  This is misleading, and pointless if they use
per-cpu variables or other optimizations.  gcc will eliminate
redundant smp_processor_id() (in inline functions) anyway.

This removes that arg from all the architectures.

8a6879c6

[PATCH] Make kstat_this_cpu in terms of __get_cpu_var and use it · b993be7e

Rusty Russell authored 21 years ago

kstat_this_cpu() is defined in terms of per_cpu instead of __get_cpu_var.

This patch changes that, and uses it everywhere appropriate. The sched.c
change puts it in a local variable, which helps gcc generate better code.

b993be7e

06 Jul, 2003 1 commit

[PATCH] use task_cpu() not ->thread_info->cpu in sched.c · 8ffcb67a

Andrew Morton authored 21 years ago

From: Mikael Pettersson <mikpe@csd.uu.se>

This patch fixes two p->thread_info->cpu occurrences in kernel/sched.c to
use the task_cpu(p) macro instead, which is optimised on UP.  Although one
of the occurrences is under #ifdef CONFIG_SMP, it's bad style to use the
raw non-optimisable form in non-arch code.

8ffcb67a

01 Jul, 2003 1 commit
- [PATCH] Make runqueues a per-cpu variable · deadcf58
  Rusty Russell authored 21 years ago
```
Makes scheduler use per-cpu variables for the runqueues.
```
  deadcf58
25 Jun, 2003 2 commits

[PATCH] normalise node load for NUMA · 325a2824

Andrew Morton authored 21 years ago

From: Andrew Theurer <habanero@us.ibm.com>

This patch ensures that when node loads are compared, the load value is
normalised. Without this, load balance across nodes of dissimilar cpu
counts can cause unfairness and sometimes lower overall performance.

For example, a 2 node system with 4 cpus in the first node and 2 cpus in
the second. A workload with 6 running tasks would have 3 tasks running on
one node and 3 on the other, leaving one cpu idle in the first node and two
tasks sharing a cpu in the second node. The patch would ensure that 4
tasks run in the first node and 2 in the second.

I ran some kernel compiles comparing this patch on a 2 node 4 cpu/2 cpu
system to show the benefits. Without the patch I got 140 second elapsed
time. With the patch I get 132 seconds (6% better).

Although it is not very common to have nodes with dissimilar cpu counts, it
is already happening. PPC64 systems with partitioning have this happen,
and I expect it to be more common on ia32 as partitioning becomes more
common.

325a2824

[PATCH] setscheduler needs to force a reschedule · 71c19018

Andrew Morton authored 21 years ago

From: Robert Love <rml@tech9.net>

Basically, the problem is that setscheduler() does not set need_resched
when needed.  There are two basic cases where this is needed:

	- the task is running, but now it is no longer the highest
	  priority task on the rq
	- the task is not running, but now it is the highest
	  priority task on the rq

In either case, we need to set need_resched to invoke the scheduler.

71c19018

21 Jun, 2003 1 commit

[PATCH] More care in sys_setaffinity · 7cd3f199

Rusty Russell authored 21 years ago

We currently mask off offline CPUs in both set_cpus_allowed and
sys_sched_setaffinity.  This is firstly redundant, and secondly
erroneous when more CPUs come online (eg. setting affinity to all 1s
should mean all CPUs, including future ones).

We mask with cpu_online_map() in sys_sched_getaffinity *anyway* (which
is another issue, since this is not valid with changing of online
cpus either), so userspace won't see any difference.

This patch makes set_cpus_allowed() return -errno, and check that in
sys_sched_setaffinity.

7cd3f199

20 Jun, 2003 1 commit

[PATCH] show_stack() portability and cleanup patch · 0d5ff9d0

Andrew Morton authored 21 years ago

From: David Mosberger <davidm@napali.hpl.hp.com>

This is an attempt at sanitizing the interface for stack trace dumping
somewhat.  It's basically the last thing which prevents 2.5.x from working
out-of-the-box for ia64.  ia64 apparently cannot reasonably implement the
show_stack interface declared in sched.h.

Here is the rationale: modern calling conventions don't maintain a frame
pointer and it's not possible to get a reliable stack trace with only a stack
pointer as the starting point.  You really need more machine state to start
with.  For a while, I thought the solution is to pass a task pointer to
show_stack(), but it turns out that this would negatively impact x86 because
it's sometimes useful to show only portions of a stack trace (e.g., starting
from the point at which a trap occurred).  Thus, this patch _adds_ the task
pointer instead:

 extern void show_stack(struct task_struct *tsk, unsigned long *sp);

The idea here is that show_stack(tsk, sp) will show the backtrace of task
"tsk", starting from the stack frame that "sp" is pointing to.  If tsk is
NULL, the trace will be for the current task.  If "sp" is NULL, all stack
frames of the task are shown.  If both are NULL, you'll get the full trace of
the current task.

I _think_ this should make everyone happy.

The patch also removes the declaration of show_trace() in linux/sched.h (it
never was a generic function; some platforms, in particular x86, may want to
update accordingly).

Finally, the patch replaces the one call to show_trace_task() with the
equivalent call show_stack(task, NULL).

The patch below is for Alpha and i386, since I can (compile-)test those (I'll
provide the ia64 update through my regular updates).  The other arches will
break visibly and updating the code should be trivial:

- add a task pointer argument to show_stack() and pass NULL as the first
  argument where needed

- remove show_trace_task()

- declare show_trace() in a platform-specific header file if you really
  want to keep it around

0d5ff9d0

14 Jun, 2003 3 commits

[PATCH] NUMA fixes · 1d292c60

Andrew Morton authored 21 years ago

From: Anton Blanchard <anton@samba.org>


Anton has been testing odd setups:

/* node 0 - no cpus, no memory */
/* node 1 - 1 cpu, no memory */
/* node 2 - 0 cpus, 1GB memory */
/* node 3 - 3 cpus, 3GB memory */

Two things tripped so far.  Firstly the ppc64 debug check for invalid cpus
in cpu_to_node().  Fix that in kernel/sched.c:node_nr_running_init().

The other problem concerned nodes with memory but no cpus.  kswapd tries to
set_cpus_allowed(0) and bad things happen.  So we only set cpu affinity
for kswapd if there are cpus in the node.

1d292c60

[PATCH] sched.c neatening and fixes. · 03540697

Rusty Russell authored 21 years ago

1) Fix the comments for the migration_thread.  A while back Ingo
   agreed they were exactly wrong, IIRC. 8).

2) Changed spin_lock_irqsave to spin_lock_irq, since it's in a
   kernel thread.

3) Don't repeat if the task has moved off the original CPU, just finish.
   This is because we are simply trying to push the task off this CPU:
   if it's already moved, great.  Currently we might theoretically move
   a task which is actually running on another CPU, which is v. bad.

4) Replace the __ffs(p->cpus_allowed) with any_online_cpu(), since
   that's what it's for, and __ffs() can give the wrong answer, eg. if
   there's no CPU 0.

5) Move the core functionality of migrate_task into a separate function,
   move_task_away, which I want for the hotplug CPU patch.

03540697

[PATCH] Nuke check_highmem_ptes() · f3d844bc

Benjamin Herrenschmidt authored 21 years ago

It was broken on at least ppc32 & sparc32, and the debugging it
offered wasn't worth it any more anyway.

f3d844bc

10 Jun, 2003 1 commit

[PATCH] fix scheduler bug not passing idle · dd1b5a41

Andrew Morton authored 21 years ago

From: "Martin J. Bligh" <mbligh@aracnet.com>

rebalance_tick is not properly passing the idle argument through to
load_balance in one case. The fix is trivial. Pointed out by John Hawkes.

dd1b5a41

06 Jun, 2003 2 commits

[PATCH] Move cpu notifiers et al to cpu.h · 542f238e
Rusty Russell authored 21 years ago
```
Trivial patch: when these were introduced cpu.h didn't exist.
```
542f238e

[PATCH] Don't let processes be scheduled on CPU-less nodes (3/3) · 946ac12e

Andrew Morton authored 21 years ago

From: Matthew Dobson <colpatch@us.ibm.com>

This patch implements a generic version of the nr_cpus_node(node) macro
implemented for ppc64 by the previous patch.

The generic version simply computes an hweight of the bitmask returned by
node_to_cpumask(node) topology macro.

This patch also adds a generic_hweight64() function and an hweight_long()
function which are used as helpers for the generic nr_cpus_node() macro.

This patch also adds a for_each_node_with_cpus() macro, which is used in
sched_best_cpu() in kernel/sched.c to fix the original problem of
scheduling processes on CPU-less nodes.  This macro should also be used in
the future to avoid similar problems.

Test compiled and booted by Andrew Theurer (habanero@us.ibm.com) on both
x440 and ppc64.

946ac12e

27 May, 2003 1 commit
- [PATCH] Use '#ifdef' to test for CONFIG options · 015498d5
  Steven Cole authored 21 years ago
  
  015498d5
26 May, 2003 1 commit

[PATCH] signal latency improvement · ee2f48bc

Ingo Molnar authored 21 years ago

This further optimizes the 'kick wakeup' scheduler feature:

 - do not kick any CPU on UP

 - no need to mark the target task for reschedule - it's enough to send an
   interrupt to that CPU, that will initiate a signal processing pass.

ee2f48bc

19 May, 2003 4 commits

[PATCH] Fix lost scheduler rebalances · e7778aa6

Ingo Molnar authored 21 years ago

This fixes a race noticed by Mike Galbraith: the scheduler can lose a
rebalance tick if some task happens to not be rescheduled in time. This
is not a fatal condition, but an inconsistency nevertheless.

e7778aa6

[PATCH] sync wakeup on UP · 84205d05

Ingo Molnar authored 21 years ago

This fixes the scheduler's sync-wakeup code to be consistent on UP as
well.

Right now there's a behavioral difference between an UP kernel and an
SMP kernel running on a UP box: sync wakeups (which are only activated
on SMP) can cause a wakeup of a higher prio task, without preemption.
On UP kernels this does not happen.  This difference in wakeup behavior
is bad.

This patch activates sync wakeups on UP as well - in the cases sync
wakeups are done the waker knows that it will schedule away soon, so
this 'delay preemption' decision is correct on UP as well.

84205d05

[PATCH] scheduler cleanup · d1347e18
Ingo Molnar authored 21 years ago
```
This removes the unused requeueing code.
```
d1347e18

[PATCH] signal latency fixes · 79e4dd94

Ingo Molnar authored 21 years ago

This fixes an SMP window where the kernel could miss to handle a signal,
and increase signal delivery latency up to 200 msecs.  Sun has reported
to Ulrich that their JVM sees occasional unexpected signal delays under
Linux.  The more CPUs, the more delays.

The cause of the problem is that the current signal wakeup
implementation is racy in kernel/signal.c:signal_wake_up():

        if (t->state == TASK_RUNNING)
                kick_if_running(t);
	...
        if (t->state & mask) {
                wake_up_process(t);
                return;
        }

If thread (or process) 't' is woken up on another CPU right after the
TASK_RUNNING check, and thread starts to run, then the wake_up_process()
here will do nothing, and the signal stays pending up until the thread
will call into the kernel next time - which can be up to 200 msecs
later.

The solution is to do the 'kicking' of a running thread on a remote CPU
atomically with the wakeup.  For this i've added wake_up_process_kick().
There is no slowdown for the other wakeup codepaths, the new flag to
try_to_wake_up() is compiled off for them.  Some other subsystems might
want to use this wakeup facility as well in the future (eg.  AIO).

In fact this race triggers quite often under Volanomark rusg, with this
change added, Volanomark performance is up from 500-800 to 2000-3000, on
a 4-way x86 box.

79e4dd94

12 May, 2003 1 commit
- [PATCH] Use '#ifdef' to test for CONFIG_xxx variables · 1bdbda8c
  Steven Cole authored 21 years ago
```
Don't depend on undefined preprocessor symbols evaluating to zero.
```
  1bdbda8c
21 Apr, 2003 1 commit

[PATCH] trivial task_prio() fix · 7957f703

Robert Love authored 21 years ago

Here is a trivial fix for task_prio() in the case MAX_RT_PRIO !=
MAX_USER_RT_PRIO.  In this case, all priorities are skewed by
(MAX_RT_PRIO - MAX_USER_RT_PRIO).

The fix is to subtract the full MAX_RT_PRIO value from p->prio, not just
MAX_USER_RT_PRIO.  This makes sense, as the full priority range is
unrelated to the maximum user value.  Only the real maximum RT value
matters.

This has been in Andrew's tree for awhile, with no issue.  Also, Ingo
acked it.

7957f703

20 Apr, 2003 1 commit

[PATCH] Turn on NUMA rebalancing · 26fbf90f

Andrew Morton authored 21 years ago

From: "Martin J. Bligh" <mbligh@aracnet.com>

I'd forgotten that I'd set this to only fire every 20s in the past, because
it would rebalance too agressively.  That seems to be fixed now, so we should
turn it back on.

26fbf90f

12 Apr, 2003 1 commit

[PATCH] remove the test for null waitqueue in __wake_up() · 831cbe24

Andrew Morton authored 21 years ago

I've had a warning in there for 4-5 months and it has never triggered.  I
think it's safe to remove this test.

831cbe24

08 Apr, 2003 1 commit
- Annotate scheduler system calls as taking user pointers. · afb34093
  Linus Torvalds authored 21 years ago
  
  afb34093