An error occurred fetching the project authors.
- 29 Sep, 2003 1 commit
-
-
Arnaldo Carvalho de Melo authored
-
- 21 Sep, 2003 5 commits
-
-
Andrew Morton authored
might_sleep() can be triggered by either local interrupts being disabled or by elevated preempt count. Disambiguate them.
-
Andrew Morton authored
From: Con Kolivas <kernel@kolivas.org> Interactivity scheduler tweaks on top of Ingo's A3 interactivity patch. Interactive credit added to task struct to find truly interactive tasks and treat them differently. Extra #defines included as helpers for conversion to/from nanosecond timing, to work out an average timeslice for nice 0 tasks, and the effective dynamic priority bonuses that will be given to tasks. MAX_SLEEP_AVG modified to change dynamic priority by one for a nice 0 task sleeping or running for one full timeslice. CREDIT_LIMIT is the number of times a task earns sleep_avg over MAX_SLEEP_AVG before it is considered HIGH_CREDIT (truly interactive); and -CREDIT_LIMIT is LOW_CREDIT TIMESLICE GRANULARITY is modified to be more frequent for more interactivetasks (10 ms for top 2 dynamic priorities and then halving each priority belowthat) and less frequent per extra cpu. JUST_INTERACTIVE_SLEEP logic created to be a sleep_avg consistent with giving a task enough dynamic priority to remain on the active array. Task preemption of equal priority tasks is dropped as requeuing with TIMESLICE_GRANULARITY makes this unecessary. Dynamic priority bonus simplified. User tasks that sleep a long time and not waking from uninterruptible sleep are sought and categorised as idle. Their sleep avg is limited in it's rise to prevent them becoming high priority and suddenly turning into cpu hogs. Bonus for sleeping is proportionately higher the lower the dynamic priority of a task is; this allows for very rapid escalation to interactive status. Tasks that are LOW_CREDIT are limited in rise per sleep to one priority level. Non HIGH_CREDIT tasks waking from uninterruptible sleep are sought to detect cpu hogs waiting on I/O and their sleep_avg rise is limited to just interactive state to prevent cpu bound tasks from becoming interactive during I/O wait. Tasks that earn sleep_avg over MAX_SLEEP_AVG get interactive credits. On runqueue bonus is not given to non HIGH_CREDIT tasks waking from uninterruptible sleep. Forked tasks and their parents get sleep_avg limited to the minimum necessary to maintain their effective dynamic priority thus preventing repeated forking from being a way to get highly interactive, but not penalise them noticably otherwise. CAN_MIGRATE_TASK cleaned up and modified to work with nanosecond timestamps. Reverted Ingo's A3 Starvation limit change - it was making interactive tasks suffer more under increasing load. If a cpu is grossly overloaded and everyone is going to starve it may as well run interactive tasks preferentially. Task requeuing is limited to interactive tasks only (cpu bound tasks dont need low latency and derive benefit from longer timeslices), and they must have at least TIMESLICE_GRANULARITY remaining. HIGH_CREDIT tasks get penalised less sleep_avg the more interactive they are thus keeping them interactive for bursts but if they become sustained cpu hogs they will slide increasingly rapidly down the dynamic priority scale. Tasks that run out of sleep_avg, are still using up cpu time and are not high or low credit yet get penalised interactive credits to determine LOW_CREDIT tasks (cpu bound ones).
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The patch changes the imbalance required before a balance to 25% from 50% - as the comments intend. It also changes a case where the balancing wouldn't be done if the imbalance was >= 25% but only 1 task difference. The downside of the second change is that one task may bounce from one cpu to another for some loads. This will only bounce once every 200ms, so it shouldn't be a big problem. (Benchmarking results are basically a wash - SDET is increased maybe 0.5%)
-
Andrew Morton authored
From: Ingo Molnar <mingo@elte.hu> the attached scheduler patch (against test2-mm2) adds the scheduling infrastructure items discussed on lkml. I got good feedback - and while i dont expect it to solve all problems, it does solve a number of bad ones: - test_starve.c code from David Mosberger - thud.c making the system unusuable due to unfairness - fair/accurate sleep average based on a finegrained clock - audio skipping way too easily other changes in sched-test2-mm2-A3: - ia64 sched_clock() code, from David Mosberger. - migration thread startup without relying on implicit scheduling behavior. While the current 2.6 code is correct (due to the cpu-up code adding CPUs one by one), but it's also fragile - and this code cannot be carried over into the 2.4 backports. So adding this method would clean up the startup and would make it easier to have 2.4 backports. and here's the original changelog for the scheduler changes: - cycle accuracy (nanosec resolution) timekeeping within the scheduler. This fixes a number of audio artifacts (skipping) i've reproduced. I dont think we can get away without going cycle accuracy - reading the cycle counter adds some overhead, but it's acceptable. The first nanosec-accuracy patch was done by Mike Galbraith - this patch is different but similar in nature. I went further in also changing the sleep_avg to be of nanosec resolution. - more finegrained timeslices: there's now a timeslice 'sub unit' of 50 usecs (TIMESLICE_GRANULARITY) - CPU hogs on the same priority level will roundrobin with this unit. This change is intended to make gaming latencies shorter. - include scheduling latency in sleep bonus calculation. This change extends the sleep-average calculation to the period of time a task spends on the runqueue but doesnt get scheduled yet, right after wakeup. Note that tasks that were preempted (ie. not woken up) and are still on the runqueue do not get this benefit. This change closes one of the last hole in the dynamic priority estimation, it should result in interactive tasks getting more priority under heavy load. This change also fixes the test-starve.c testcase from David Mosberger. The TSC-based scheduler clock is disabled on ia32 NUMA platforms. (ie. platforms that have unsynched TSC for sure.) Those platforms should provide the proper code to rely on the TSC in a global way. (no such infrastructure exists at the moment - the monotonic TSC-based clock doesnt deal with TSC offsets either, as far as i can tell.)
-
Andrew Morton authored
From: Robert Love <rml@tech9.net> - Let real-time tasks dip further into the reserves than usual in __alloc_pages(). There are a lot of ways to special case this. This patch just cuts z->pages_low in half, before doing the incremental min thing, for real-time tasks. I do not do anything in the low memory slow path. We can be a _lot_ more aggressive if we want. Right now, we just give real-time tasks a little help. - Never ever call balance_dirty_pages() on a real-time task. Where and how exactly we handle this is up for debate. We could, for example, special case real-time tasks inside balance_dirty_pages(). This would allow us to perform some of the work (say, waking up pdflush) but not other work (say, the active throttling). As it stands now, we do the per-processor accounting in balance_dirty_pages_ratelimited() but we never call balance_dirty_pages(). Lots of approaches work. What we want to do is never engage the real-time task in forced writeback.
-
- 09 Sep, 2003 1 commit
-
-
Andrew Morton authored
From: Andrew Theurer <habanero@us.ibm.com> This change: http://linux.bkbits.net:8080/linux-2.5/diffs/kernel/sched.c@1.202 does not seem to make sense: #define CAN_MIGRATE_TASK(p,rq,this_cpu) \ ((!idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \ !task_running(rq, p) && \ cpu_isset(this_cpu, (p)->cpus_allowed)) It should be just the opposite; an idle cpu should be able to have a more aggressive steal, and a busy cpu should not.
-
- 08 Sep, 2003 1 commit
-
-
Patrick Mochel authored
- The PM code currently must signal each kernel thread when suspending, and each thread must call refrigerator() to stop itself. This patch adds support for this to migration_thread, which allows suspend states to work on an SMP-enabled kernel (though not necessarily an SMP machine). - Note I do not know why the process freezing code was designed in such a way. One would think we could do it without having to call each thread individually, and fix up the threads that need special work individually..
-
- 31 Aug, 2003 1 commit
-
-
Andrew Morton authored
From: Peter Chubb <peterc@gelato.unsw.edu.au> Currently, the context switch counters reported by getrusage() are always zero. The appended patch adds fields to struct task_struct to count context switches, and adds code to do the counting. The patch adds 4 longs to struct task struct, and a single addition to the fast path in schedule().
-
- 18 Aug, 2003 1 commit
-
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> Contributions from: Jan Dittmer <jdittmer@sfhq.hn.org> Arnd Bergmann <arnd@arndb.de> "Bryan O'Sullivan" <bos@serpentine.com> "David S. Miller" <davem@redhat.com> Badari Pulavarty <pbadari@us.ibm.com> "Martin J. Bligh" <mbligh@aracnet.com> Zwane Mwaikambo <zwane@linuxpower.ca> It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64. cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their cpus by creating an abstract data type dedicated to representing cpu bitmasks, similar to fd sets from userspace, and sweeping the appropriate code to update callers to the access API. The fd set-like structure is according to Linus' own suggestion; the macro calling convention to ambiguate representations with minimal code impact is my own invention. Specifically, a new set of inline functions for manipulating arbitrary-width bitmaps is introduced with a relatively simple implementation, in tandem with a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose accessor functions are defined in terms of the bitmap manipulation inlines. This bitmap ADT found an additional use in i386 arch code handling sparse physical APIC ID's, which was convenient to use in this case as the accounting structure was required to be wider to accommodate the physids consumed by larger numbers of cpus. For the sake of simplicity and low code impact, these cpu bitmasks are passed primarily by value; however, an additional set of accessors along with an auxiliary data type with const call-by-reference semantics is provided to address performance concerns raised in connection with very large systems, such as SGI's larger models, where copying and call-by-value overhead would be prohibitive. Few (if any) users of the call-by-reference API are immediately introduced. Also, in order to avoid calling convention overhead on architectures where structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is special-cased so that cpumask_t falls back to an unsigned long and the accessors perform the usual bit twiddling on unsigned longs as opposed to arrays thereof. Audits were done with the structure overhead in-place, restoring this special-casing only afterward so as to ensure a more complete API conversion while undergoing the majority of its end-user exposure in -mm. More -mm's were shipped after its restoration to be sure that was tested, too. The immediate users of this functionality are Sun sparc64 systems, SGI mips64 and ia64 systems, and IBM ia32, ppc64, and s390 systems. Of these, only the ppc64 machines needing the functionality have yet to be released; all others have had systems requiring it for full functionality for at least 6 months, and in some cases, since the initial Linux port to the affected architecture.
-
- 17 Aug, 2003 1 commit
-
-
Doug Ledford authored
-
- 14 Aug, 2003 1 commit
-
-
Andrew Morton authored
From: Manfred Spraul <manfred@colorfullife.com> (We think this might be the mystery bug which has been hanging about for months) We found a [the?] task struct refcount error: A task that dies sets tsk->state to TASK_ZOMBIE. The next scheduled task checks prev->state, and if it's ZOMBIE, then it decrements the reference count of prev. The prev->state & _ZOMBIE test is not atomic with schedule, thus if prev is scheduled again and dies between dropping the runqueue lock and checking prev->state, then the reference it dropped twice. This is possible with either preemption [schedule_tail is called by ret_from_fork with preemption count 1, finish_arch_switch drops it to 0] or profiling [profile_exit_mmap can sleep on profile_rwsem, called by mmdrop()] enabled.
-
- 21 Jul, 2003 1 commit
-
-
Kai Germaschewski authored
This patch exports the kstat per-cpu variable, needed for hisax, which uses kstat_irqs() during card probing to make sure that irqs actually work. This could possibly replaced by a private counter in the hisax ISRs, but that's really just unnecessary overhead, since the core kernel already does the work anyway.
-
- 18 Jul, 2003 1 commit
-
-
Alan Cox authored
-
- 10 Jul, 2003 1 commit
-
-
Andrew Morton authored
From: Ingo Molnar <mingo@elte.hu> It makes hot-balancing happen in the 'busy tick' case as well, which should spread out processes more agressively.
-
- 07 Jul, 2003 2 commits
-
-
Rusty Russell authored
switch_mm and enter_lazy_tlb take a CPU arg, which is always smp_processor_id(). This is misleading, and pointless if they use per-cpu variables or other optimizations. gcc will eliminate redundant smp_processor_id() (in inline functions) anyway. This removes that arg from all the architectures.
-
Rusty Russell authored
kstat_this_cpu() is defined in terms of per_cpu instead of __get_cpu_var. This patch changes that, and uses it everywhere appropriate. The sched.c change puts it in a local variable, which helps gcc generate better code.
-
- 06 Jul, 2003 1 commit
-
-
Andrew Morton authored
From: Mikael Pettersson <mikpe@csd.uu.se> This patch fixes two p->thread_info->cpu occurrences in kernel/sched.c to use the task_cpu(p) macro instead, which is optimised on UP. Although one of the occurrences is under #ifdef CONFIG_SMP, it's bad style to use the raw non-optimisable form in non-arch code.
-
- 01 Jul, 2003 1 commit
-
-
Rusty Russell authored
Makes scheduler use per-cpu variables for the runqueues.
-
- 25 Jun, 2003 2 commits
-
-
Andrew Morton authored
From: Andrew Theurer <habanero@us.ibm.com> This patch ensures that when node loads are compared, the load value is normalised. Without this, load balance across nodes of dissimilar cpu counts can cause unfairness and sometimes lower overall performance. For example, a 2 node system with 4 cpus in the first node and 2 cpus in the second. A workload with 6 running tasks would have 3 tasks running on one node and 3 on the other, leaving one cpu idle in the first node and two tasks sharing a cpu in the second node. The patch would ensure that 4 tasks run in the first node and 2 in the second. I ran some kernel compiles comparing this patch on a 2 node 4 cpu/2 cpu system to show the benefits. Without the patch I got 140 second elapsed time. With the patch I get 132 seconds (6% better). Although it is not very common to have nodes with dissimilar cpu counts, it is already happening. PPC64 systems with partitioning have this happen, and I expect it to be more common on ia32 as partitioning becomes more common.
-
Andrew Morton authored
From: Robert Love <rml@tech9.net> Basically, the problem is that setscheduler() does not set need_resched when needed. There are two basic cases where this is needed: - the task is running, but now it is no longer the highest priority task on the rq - the task is not running, but now it is the highest priority task on the rq In either case, we need to set need_resched to invoke the scheduler.
-
- 21 Jun, 2003 1 commit
-
-
Rusty Russell authored
We currently mask off offline CPUs in both set_cpus_allowed and sys_sched_setaffinity. This is firstly redundant, and secondly erroneous when more CPUs come online (eg. setting affinity to all 1s should mean all CPUs, including future ones). We mask with cpu_online_map() in sys_sched_getaffinity *anyway* (which is another issue, since this is not valid with changing of online cpus either), so userspace won't see any difference. This patch makes set_cpus_allowed() return -errno, and check that in sys_sched_setaffinity.
-
- 20 Jun, 2003 1 commit
-
-
Andrew Morton authored
From: David Mosberger <davidm@napali.hpl.hp.com> This is an attempt at sanitizing the interface for stack trace dumping somewhat. It's basically the last thing which prevents 2.5.x from working out-of-the-box for ia64. ia64 apparently cannot reasonably implement the show_stack interface declared in sched.h. Here is the rationale: modern calling conventions don't maintain a frame pointer and it's not possible to get a reliable stack trace with only a stack pointer as the starting point. You really need more machine state to start with. For a while, I thought the solution is to pass a task pointer to show_stack(), but it turns out that this would negatively impact x86 because it's sometimes useful to show only portions of a stack trace (e.g., starting from the point at which a trap occurred). Thus, this patch _adds_ the task pointer instead: extern void show_stack(struct task_struct *tsk, unsigned long *sp); The idea here is that show_stack(tsk, sp) will show the backtrace of task "tsk", starting from the stack frame that "sp" is pointing to. If tsk is NULL, the trace will be for the current task. If "sp" is NULL, all stack frames of the task are shown. If both are NULL, you'll get the full trace of the current task. I _think_ this should make everyone happy. The patch also removes the declaration of show_trace() in linux/sched.h (it never was a generic function; some platforms, in particular x86, may want to update accordingly). Finally, the patch replaces the one call to show_trace_task() with the equivalent call show_stack(task, NULL). The patch below is for Alpha and i386, since I can (compile-)test those (I'll provide the ia64 update through my regular updates). The other arches will break visibly and updating the code should be trivial: - add a task pointer argument to show_stack() and pass NULL as the first argument where needed - remove show_trace_task() - declare show_trace() in a platform-specific header file if you really want to keep it around
-
- 14 Jun, 2003 3 commits
-
-
Andrew Morton authored
From: Anton Blanchard <anton@samba.org> Anton has been testing odd setups: /* node 0 - no cpus, no memory */ /* node 1 - 1 cpu, no memory */ /* node 2 - 0 cpus, 1GB memory */ /* node 3 - 3 cpus, 3GB memory */ Two things tripped so far. Firstly the ppc64 debug check for invalid cpus in cpu_to_node(). Fix that in kernel/sched.c:node_nr_running_init(). The other problem concerned nodes with memory but no cpus. kswapd tries to set_cpus_allowed(0) and bad things happen. So we only set cpu affinity for kswapd if there are cpus in the node.
-
Rusty Russell authored
1) Fix the comments for the migration_thread. A while back Ingo agreed they were exactly wrong, IIRC. 8). 2) Changed spin_lock_irqsave to spin_lock_irq, since it's in a kernel thread. 3) Don't repeat if the task has moved off the original CPU, just finish. This is because we are simply trying to push the task off this CPU: if it's already moved, great. Currently we might theoretically move a task which is actually running on another CPU, which is v. bad. 4) Replace the __ffs(p->cpus_allowed) with any_online_cpu(), since that's what it's for, and __ffs() can give the wrong answer, eg. if there's no CPU 0. 5) Move the core functionality of migrate_task into a separate function, move_task_away, which I want for the hotplug CPU patch.
-
Benjamin Herrenschmidt authored
It was broken on at least ppc32 & sparc32, and the debugging it offered wasn't worth it any more anyway.
-
- 10 Jun, 2003 1 commit
-
-
Andrew Morton authored
From: "Martin J. Bligh" <mbligh@aracnet.com> rebalance_tick is not properly passing the idle argument through to load_balance in one case. The fix is trivial. Pointed out by John Hawkes.
-
- 06 Jun, 2003 2 commits
-
-
Rusty Russell authored
Trivial patch: when these were introduced cpu.h didn't exist.
-
Andrew Morton authored
From: Matthew Dobson <colpatch@us.ibm.com> This patch implements a generic version of the nr_cpus_node(node) macro implemented for ppc64 by the previous patch. The generic version simply computes an hweight of the bitmask returned by node_to_cpumask(node) topology macro. This patch also adds a generic_hweight64() function and an hweight_long() function which are used as helpers for the generic nr_cpus_node() macro. This patch also adds a for_each_node_with_cpus() macro, which is used in sched_best_cpu() in kernel/sched.c to fix the original problem of scheduling processes on CPU-less nodes. This macro should also be used in the future to avoid similar problems. Test compiled and booted by Andrew Theurer (habanero@us.ibm.com) on both x440 and ppc64.
-
- 27 May, 2003 1 commit
-
-
Steven Cole authored
-
- 26 May, 2003 1 commit
-
-
Ingo Molnar authored
This further optimizes the 'kick wakeup' scheduler feature: - do not kick any CPU on UP - no need to mark the target task for reschedule - it's enough to send an interrupt to that CPU, that will initiate a signal processing pass.
-
- 19 May, 2003 4 commits
-
-
Ingo Molnar authored
This fixes a race noticed by Mike Galbraith: the scheduler can lose a rebalance tick if some task happens to not be rescheduled in time. This is not a fatal condition, but an inconsistency nevertheless.
-
Ingo Molnar authored
This fixes the scheduler's sync-wakeup code to be consistent on UP as well. Right now there's a behavioral difference between an UP kernel and an SMP kernel running on a UP box: sync wakeups (which are only activated on SMP) can cause a wakeup of a higher prio task, without preemption. On UP kernels this does not happen. This difference in wakeup behavior is bad. This patch activates sync wakeups on UP as well - in the cases sync wakeups are done the waker knows that it will schedule away soon, so this 'delay preemption' decision is correct on UP as well.
-
Ingo Molnar authored
This removes the unused requeueing code.
-
Ingo Molnar authored
This fixes an SMP window where the kernel could miss to handle a signal, and increase signal delivery latency up to 200 msecs. Sun has reported to Ulrich that their JVM sees occasional unexpected signal delays under Linux. The more CPUs, the more delays. The cause of the problem is that the current signal wakeup implementation is racy in kernel/signal.c:signal_wake_up(): if (t->state == TASK_RUNNING) kick_if_running(t); ... if (t->state & mask) { wake_up_process(t); return; } If thread (or process) 't' is woken up on another CPU right after the TASK_RUNNING check, and thread starts to run, then the wake_up_process() here will do nothing, and the signal stays pending up until the thread will call into the kernel next time - which can be up to 200 msecs later. The solution is to do the 'kicking' of a running thread on a remote CPU atomically with the wakeup. For this i've added wake_up_process_kick(). There is no slowdown for the other wakeup codepaths, the new flag to try_to_wake_up() is compiled off for them. Some other subsystems might want to use this wakeup facility as well in the future (eg. AIO). In fact this race triggers quite often under Volanomark rusg, with this change added, Volanomark performance is up from 500-800 to 2000-3000, on a 4-way x86 box.
-
- 12 May, 2003 1 commit
-
-
Steven Cole authored
Don't depend on undefined preprocessor symbols evaluating to zero.
-
- 21 Apr, 2003 1 commit
-
-
Robert Love authored
Here is a trivial fix for task_prio() in the case MAX_RT_PRIO != MAX_USER_RT_PRIO. In this case, all priorities are skewed by (MAX_RT_PRIO - MAX_USER_RT_PRIO). The fix is to subtract the full MAX_RT_PRIO value from p->prio, not just MAX_USER_RT_PRIO. This makes sense, as the full priority range is unrelated to the maximum user value. Only the real maximum RT value matters. This has been in Andrew's tree for awhile, with no issue. Also, Ingo acked it.
-
- 20 Apr, 2003 1 commit
-
-
Andrew Morton authored
From: "Martin J. Bligh" <mbligh@aracnet.com> I'd forgotten that I'd set this to only fire every 20s in the past, because it would rebalance too agressively. That seems to be fixed now, so we should turn it back on.
-
- 12 Apr, 2003 1 commit
-
-
Andrew Morton authored
I've had a warning in there for 4-5 months and it has never triggered. I think it's safe to remove this test.
-
- 08 Apr, 2003 1 commit
-
-
Linus Torvalds authored
-