1. 10 May, 2004 40 commits
    • Andrew Morton's avatar
      [PATCH] mips: Simplify expression · 2cb5274c
      Andrew Morton authored
      From: Ralf Baechle <ralf@linux-mips.org>
      
      CONFIG_MIPS is always defined, for 32-bit and 64-bit.
      2cb5274c
    • Andrew Morton's avatar
      [PATCH] mips: fix 2.6 fb setup · bbbc024a
      Andrew Morton authored
      From: Ralf Baechle <ralf@linux-mips.org>
      bbbc024a
    • Andrew Morton's avatar
      [PATCH] MIPS update · 4b35ee7f
      Andrew Morton authored
      From: Ralf Baechle <ralf@linux-mips.org>
      
       - Kconfig cleanups:
          - enable DMA_NONCOHERENT, DMA_COHERENT or DMA_IP27 via reverse dependencies
          - untangle VRC4171 / VRC4173 selection
          - R10000 support enables PREFETCH
          - SEAD needs IRQ_CPU
       - Update defconfig against latest Kconfig files.
       - Fix computation of return address if syscall number was out of range
       - Add power managment hooks in signal code.
       - Don't try to handle signals when previous context was not in user mode.
       - Fix serial interface setup for VR41xx systems.
       - Build fixes after CLEAR_BITMAP changed name.
       - Removes bogus comment from <asm/checksum.h>
       - <asm/hdreg.h> is dead.
       - Start collecting common definitions for PMON firmware in <asm/pmon.h>
       - Define ARCH_MIN_TASKALIGN to 8; we have 64-bit members even on 32-bit
         kernels if we're running on MIPS II or better.
      4b35ee7f
    • Andrew Morton's avatar
      [PATCH] Fix deadlock in journalled quota · 844ef7b9
      Andrew Morton authored
      From: Jan Kara <jack@ucw.cz>
      
      Attached patch should fix reported deadlock in journalled quota code.
      quotactl() call was violating the locking rules and didn't start transaction
      when it should.
      
      From: <raven@themaw.net>
      
        Found a couple of symbols not exported that were needed by the ext3.ko
        module.
      844ef7b9
    • Andrew Morton's avatar
      [PATCH] migration_thread() race fix · 74499d32
      Andrew Morton authored
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
      Noticed that migration_thread can examine "kthread_should_stop()?" without
      setting its state to TASK_INTERRUPTIBLE first.  This can cause kthread_stop
      on that thread to block forever ...
      
      P.S 	- I assumed that having the task state set to TASK_INTERRUTIBLE
      	  while it is doing active_load_balance is fine. It seemed to be
      	  the case earlier also.
      74499d32
    • Andrew Morton's avatar
      [PATCH] sched_getaffinity vs cpu hotplug race fix · 870d3c0a
      Andrew Morton authored
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
      Fix the race in sys_sched_getaffinity.  Patch below takes cpu_hotplug lock
      before reading cpus_allowed mask of a task.
      870d3c0a
    • Andrew Morton's avatar
      [PATCH] Move migrate_all_tasks to CPU_DEAD handling · ddea677b
      Andrew Morton authored
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
      migrate_all_tasks is currently run with rest of the machine stopped.
      It iterates thr' the complete task table, turning off cpu affinity of any task
      that it finds affine to the dying cpu. Depending on the task table
      size this can take considerable time. All this time machine is stopped, doing
      nothing.
      
      Stopping the machine for such extended periods can be avoided if we do
      task migration in CPU_DEAD notification and that's precisely what this patch
      does.
      
      The patch puts idle task to the _front_ of the dying CPU's runqueue at the 
      highest priority possible. This cause idle thread to run _immediately_ after
      kstopmachine thread yields. Idle thread notices that its cpu is offline and
      dies quickly. Task migration can then be done at leisure in CPU_DEAD
      notification, when rest of the CPUs are running.
      
      Some advantages with this approach are:
      
      	- More scalable. Predicatable amout of time that machine is stopped.
      	- No changes to hot path/core code. We are just exploiting scheduler
      	  rules which runs the next high-priority task on the runqueue. Also
      	  since I put idle task to the _front_ of the runqueue, there
      	  are no races when a equally high priority task is woken up
      	  and added to the runqueue. It gets in at the back of the runqueue,
      	  _after_ idle task!
      	- cpu_is_offline check that is presenty required in try_to_wake_up,
      	  idle_balance and rebalance_tick can be removed, thus speeding them
      	  up a bit
      
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      
        Rusty mentioned that the unlikely hints against cpu_is_offline is
        redundant since the macro already has that hint.  Patch below removes those
        redundant hints I added.
      ddea677b
    • Andrew Morton's avatar
      [PATCH] sched: Look at another CPU's domain · 4197ad87
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      The SMT wake_idle code really wants to look at a non-local CPU's domain in
      order to check for idle siblings.
      
      So change the domain attachment code a little bit so we continue to hold a
      runqueue's lock while attaching a new domain.  This means the locking rules
      have changed to: you may access your own domain without any lock, you must
      hold a remote runqueue's lock in order to view its domain.
      4197ad87
    • Andrew Morton's avatar
      [PATCH] sched: micro-optimisation for wake_up · 25de0902
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      This actually does produce better code, especially under the locked
      section.
      
      Turns a conditional + unconditional jump under the lock in the unlikely
      case into a cmov outside the lock.
      25de0902
    • Andrew Morton's avatar
      [PATCH] sched: reduce idle time · 85841fc0
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      It makes NEWLY_IDLE balances cause find_busiest_group return the busiest
      available group even if there isn't an imbalance.  Basically - try a bit
      harder to prevent schedule emptying the runqueue.
      
      It is quite aggressive, but that isn't so bad because we don't (by default)
      do NEWLY_IDLE balancing across NUMA nodes, and NEWLY_IDLE balancing is always
      restricted to cache_hot tasks.
      
      It picked up a little bit of idle time that dbt2-pgsql was seeing...
      85841fc0
    • Andrew Morton's avatar
      [PATCH] sched: balance-on-clone · 8c8cfc36
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      Implement balancing during clone().  It does the following things:
      
      - introduces SD_BALANCE_CLONE that can serve as a tool for an
        architecture to limit the search-idlest-CPU scope on clone().
        E.g. the 512-CPU systems should rather not enable this.
      
      - uses the highest sd for the imbalance_pct, not this_rq (which didnt
        make sense).
      
      - unifies balance-on-exec and balance-on-clone via the find_idlest_cpu()
        function. Gets rid of sched_best_cpu() which was still a bit
        inconsistent IMO, it used 'min_load < load' as a condition for
        balancing - while a more correct approach would be to use half of the
        imbalance_pct, like passive balancing does.
      
      - the patch also reintroduces the possibility to do SD_BALANCE_EXEC on
        SMP systems, and activates it - to get testing.
      
      - NOTE: there's one thing in this patch that is slightly unclean: i
        introduced wake_up_forked_thread. I did this to make it easier to get
        rid of this patch later (wake_up_forked_process() has lots of
        dependencies in various architectures). If this capability remains in
        the kernel then i'll clean it up and introduce one function for
        wake_up_forked_process/thread.
      
      - NOTE2: i added the SD_BALANCE_CLONE flag to the NUMA CPU template too.
        Some NUMA architectures probably want to disable this.
      8c8cfc36
    • Andrew Morton's avatar
      [PATCH] sched: cpu load management cleanup · a690c9b7
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      This does the source/target cleanup.  This is a no-functionality patch which
      also adds more comments to explain these functions.
      a690c9b7
    • Andrew Morton's avatar
      [PATCH] sched: passive balancing damping · df65cdbf
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      This patch starts to balance woken processes when half the relevant domain's
      imbalance_pct is reached.  Previously balancing would start after a small,
      constant difference in waker/wakee runqueue loads was reached, which would
      cause too much process movement when there are lots of processes running.
      
      It also turns wake balancing into a domain flag while previously it was always
      on.  Now sched domains can "soft partition" an SMP system without using
      processor affinities.
      df65cdbf
    • Andrew Morton's avatar
      [PATCH] sched: cleanups · 237eaf03
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      This re-adds cleanups which were lost in splitups of an earlier patch.
      237eaf03
    • Andrew Morton's avatar
      [PATCH] sched: lock cpu_attach_domain for hotplug · 2ce2e329
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      The attached patch is required to work correctly with the CPU hotplug
      framework.  John Hawkes reports successful booting with this.
      2ce2e329
    • Andrew Morton's avatar
      [PATCH] sched: extend sync wakeups · 7dc12702
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      The attached patch extends sync wakeups to the process sys_exit() path too:
      the chldwait wakeup can be done sync, since we know that the process is
      going to exit (and thus deschedule).
      
      The most visible effect of this change is strace's behavior on SMP systems:
      it now stays on a single CPU, together with the traced child.  (previously
      it would run in parallel to the child, bouncing around madly.)
      7dc12702
    • Andrew Morton's avatar
      [PATCH] sched: add enqueeu_task_head() · 3c6f29aa
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      Helper function for later patches
      3c6f29aa
    • Andrew Morton's avatar
      [PATCH] sched: uninlinings · 78650e1b
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      Uninline things
      78650e1b
    • Andrew Morton's avatar
      [PATCH] sched: minor cleanups · 2f16618a
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      Minor cleanups from Ingo's patch including task_hot (do it right in
      try_to_wake_up too).
      2f16618a
    • Andrew Morton's avatar
      [PATCH] sched: fix setup races · 80b19256
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      De-racify the sched domain setup code.  This involves creating a dummy
      "init" domain during sched_init (which is called early).
      
      When topology information becomes available, the sched domains are then
      built and attached.  The attach mechanism is asynchronous and uses the
      migration threads, which perform the switch with interrupts off.  This is a
      quiescent state, so domains can still be lockless on the read side.  It
      also allows us to change the domains at runtime without much more work. 
      This is something SGI is interested in to elegantly do soft partitioning of
      their systems without having to use hard cpu affinities (which cause
      balancing problems of their own).
      
      The current setup code also has a race somewhere because it is unable to
      boot on a 384 CPU system.
      
      
      
      From: Anton Blanchard <anton@samba.org>
      
         This is basically a mindless ppc64 merge of the x86 changes to sched
         domain init code.
      
         Actually if I produce a sibling_map[] then the x86 code and the ppc64
         will be identical.  Maybe we can merge it.
      80b19256
    • Andrew Morton's avatar
      [PATCH] ARCH_HAS_SCHED_WAKE_BALANCE doesnt exist · 17d66773
      Andrew Morton authored
      From: Anton Blanchard <anton@samba.org>
      
      It seems someone has been making trivial changes without using grep.
      17d66773
    • Andrew Morton's avatar
      [PATCH] ppc64: sched-domain support · 019bc3be
      Andrew Morton authored
      From: Anton Blanchard <anton@samba.org>
      
      Below are the diffs between the current ppc64 sched init stuff and x86.
      
      - Ignore the POWER5 specific stuff, I dont set up a sibling map yet.
      - What should I set cache_hot_time to?
      
      large cpumask typechecking requirements (perhaps useful on x86 as well):
      - cpu->cpumask = CPU_MASK_NONE -> cpus_clear(cpu->cpumask);
      - cpus_and(nodemask, node_to_cpumask(i), cpu_possible_map) doesnt work,
        need to use a temporary
      019bc3be
    • Andrew Morton's avatar
      [PATCH] sched: oops fix · a65fb1d0
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      After the for_each_domain change, the warn here won't trigger, instead it
      will oops in the if statement.  Also, make sure we don't pass an empty
      cpumask to for_each_cpu.
      a65fb1d0
    • Andrew Morton's avatar
      [PATCH] sched: altix tuning · fd7b7b0f
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      From: John Hawkes
      
      The following brings up performance on a 64-way Altix.  This system being on
      the smaller end of the scale should also be applicable to other NUMA systems.
      fd7b7b0f
    • Andrew Morton's avatar
      [PATCH] sched: fix imbalance calculations · 50386fbc
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      Imbalance calculations were not right.  This would cause unneeded migration.
      50386fbc
    • Andrew Morton's avatar
      [PATCH] sched: wakeup balancing fixes · 713551bc
      Andrew Morton authored
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      Make affine wakes and "passive load balancing" more conservative.  Aggressive
      affine wakeups were causing huge regressions in dbt3-pgsql on 8-way non NUMA
      systems at OSDL's STP.
      713551bc
    • Andrew Morton's avatar
      [PATCH] Hotplug CPU sched_balance_exec Fix · 91bc0bf7
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      From: Andrew Morton <akpm@osdl.org>
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      We want to get rid of lock_cpu_hotplug() in sched_migrate_task.  Found
      that lockless migration of execing task is _extremely_ racy.  The
      races I hit are described below, alongwith probable solutions.
      
      Task migration done elsewhere should be safe (?) since they either
      hold the lock (sys_sched_setaffinity) or are done entirely with preemption 
      disabled (load_balance).
      
         sched_balance_exec does:
      
      	a. disables preemption
      	b. finds new_cpu for current
      	c. enables preemption
      	d. calls sched_migrate_task to migrate current to new_cpu
      
         and sched_migrate_task does:
      
      	e. task_rq_lock(p)
      	f. migrate_task(p, dest_cpu ..)
      		(if we have to wait for migration thread)
      		g. task_rq_unlock()
      		h. wake_up_process(rq->migration_thread)
      		i. wait_for_completion()
      
         Several things can happen here:
      
      	1. new_cpu can go down after h and before migration thread has
      	   got around to handle the request
      
      	   ==> we need to add a cpu_is_offline check in __migrate_task
      
      	2. new_cpu can go down between c and d or before f.
      
      	   ===> Even though this case is automatically handled by the above 
      	        change (migrate_task being called on a running task, current,
      		will delegate migration to migration thread), would it be 
      	 	good practice to avoid calling migrate_task in the first place
      		itself when dest_cpu is offline. This means adding another
      	 	cpu_is_offline check after e in sched_migrate_task
      
      	3. The 'current' task can get preempted _immediately_ after
      	   g and when it comes back, task_cpu(p) can be dead. In
      	   which case, it is invalid to do wake_up on a non-existent migration 
      	   thread.  (rq->migration_thread can be NULL).
      
      	   ===> We should disable preemption thr' g and h
      
      	4. Before migration thread gets around to handle the request, its cpu
      	   goes dead. This will leave unhandled migration requests in the dead 
      	   cpu. 
      
      	   ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD
      	        notification.
      
      I really wonder if we can get rid of these issues by avoiding balancing at 
      exec time and instead have it balanced during load_balance ..Alternately
      if this is valuable and we want to retain it, I think we still need to
      consider a read/write sem, with sched_migrate_task doing down_read_trylock.
      This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE 
      notification, which had forced me away from r/w sem.
      
      Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1
      and has been tested on a 4way Intel Pentium SMP m/c.
      
      
      Rusty sez:
      
      Two other changes:
      1) I grabbed a reference to the thread, rather than using
      preempt_disable().  It's the more obvious way I think.
      
      2) Why the wait_to_die code?  It might be needed if we move tasks after
      stop_machine, but for nowI don't see the problem with the migration
      thread running on the wrong CPU for a bit: nothing is on this runqueue
      so active_load_balance is safe, and __migrate task will be a noop (due
      to cpu_is_offline() check).  If there is a problem, your fix is racy,
      because we could be preempted immediately afterwards.
      
      So I just stop the kthread then wakeup any remaining...
      91bc0bf7
    • Andrew Morton's avatar
      [PATCH] sched: trivial fixes, cleanups · 850f7d78
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      The trivial fixes.
      
      - added recent trivial bits from Nick's and my patches.
      - hotplug CPU fix
      - early init cleanup
      850f7d78
    • Andrew Morton's avatar
      [PATCH] Reduce TLB flushing during process migration · fa8f2c50
      Andrew Morton authored
      From: Martin Hicks <mort@wildopensource.com>
      
      Another optimization patch from Jack Steiner, intended to reduce TLB
      flushes during process migration.
      
      Most architextures should define tlb_migrate_prepare() to be flush_tlb_mm(),
      but on i386, it would be a wasted flush, because i386 disconnects previous
      cpus from the tlb flush automatically.
      fa8f2c50
    • Andrew Morton's avatar
      [PATCH] sched: add local load metrics · 1ec43096
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This patch removes the per runqueue array of NR_CPU arrays.  Each time we
      want to check a remote CPU's load we check nr_running as well anyway, so
      introduce a cpu_load which is the load of the local runqueue and is kept
      updated in the timer tick.  Put them in the same cacheline.
      
      This has additional benefits of having the cpu_load consistent across all
      CPUs and more up to date.  It is sampled better too, being updated once per
      timer tick.
      
      This shouldn't make much difference in scheduling behaviour, but all
      benchmarks are either as good or better on the 16-way NUMAQ: hackbench,
      reaim, volanomark are about the same, tbench and dbench are maybe a bit
      better.  kernbench is about one percent better.
      
      John reckons it isn't a big deal, but it does save 4K per CPU or 2MB total
      on his big systems, so I figure it must be a bit kinder on the caches.  I
      think it is just nicer in general anyway.
      1ec43096
    • Andrew Morton's avatar
      [PATCH] sched: SMT niceness handling · 47ad0fce
      Andrew Morton authored
      From: Con Kolivas <kernel@kolivas.org>
      
      This patch provides full per-package priority support for SMT processors
      (aka pentium4 hyperthreading) when combined with CONFIG_SCHED_SMT.
      
      It maintains cpu percentage distribution within each physical cpu package
      by limiting the time a lower priority task can run on a sibling cpu
      concurrently with a higher priority task.
      
      It introduces a new flag into the scheduler domain
      unsigned int per_cpu_gain;	/* CPU % gained by adding domain cpus */
      
      This is empirically set to 15% for pentium4 at the moment and can be
      modified to support different values dynamically as newer processors come
      out with improved SMT performance.  It should not matter how many siblings
      there are.
      
      How it works is it compares tasks running on sibling cpus and when a lower
      static priority task is running it will delay it till
      high_priority_timeslice * (100 - per_cpu_gain) / 100 <= low_prio_timeslice
      
      eg.  a nice 19 task timeslice is 10ms and nice 0 timeslice is 102ms On
      vanilla the nice 0 task runs on one logical cpu while the nice 19 task runs
      unabated on the other logical cpu.  With smtnice the nice 0 runs on one
      logical cpu for 102ms and the nice 19 sleeps till the nice 0 task has 12ms
      remaining and then will schedule.
      
      Real time tasks and kernel threads are not altered by this code, and kernel
      threads do not delay lower priority user tasks.
      
      with lots of thanks to Zwane Mwaikambo and Nick Piggin for help with the
      coding of this version.
      
      If this is merged, it is probably best to delay pushing this upstream in
      mainline till sched_domains gets tested for at least one major release.
      47ad0fce
    • Andrew Morton's avatar
      [PATCH] sched_domains: use cpu_possible_map · a5f39fd8
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This changes sched domains to contain all possible CPUs, and check for
      online as needed.  It's in order to play nicely with CPU hotplug.
      a5f39fd8
    • Andrew Morton's avatar
      [PATCH] sched-group-power · 482b9933
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The following patch implements a cpu_power member to struct sched_group.
      
      This allows special casing to be removed for SMT groups in the balancing
      code.  It does not take CPU hotplug into account yet, but that shouldn't be
      too hard.
      
      I have tested it on the NUMAQ by pretending it has SMT.  Works as expected.
      Active balances across nodes.
      482b9933
    • Andrew Morton's avatar
      [PATCH] sched_balance_exec(): don't fiddle with the cpus_allowed mask · 3de8a6b4
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>,
            Nick Piggin <piggin@cyberone.com.au>
      
      The current sched_balance_exec() sets the task's cpus_allowed mask
      temporarily to move it to a different CPU.  This has several issues,
      including the fact that a task will see its affinity at a bogus value.
      
      So we change the migration_req_t to explicitly specify a destination CPU,
      rather than the migration thread deriving it from cpus_allowed.  If the
      requested CPU is no longer valid (racing with another set_cpus_allowed,
      say), it can be ignored: if the task is not allowed on this CPU, there will
      be another migration request pending.
      
      This change allows sched_balance_exec() to tell the migration thread what
      to do without changing the cpus_allowed mask.
      
      So we rename __set_cpus_allowed() to move_task(), as the cpus_allowed mask
      is now set by the caller.  And move_task_away(), which the migration thread
      uses to actually perform the move, is renamed __move_task().
      
      I also ignore offline CPUs in sched_best_cpu(), so sched_migrate_task()
      doesn't need to check for offline CPUs.
      
      Ulterior motive: this approach also plays well with CPU Hotplug.
      Previously that patch might have seen a task with cpus_allowed only
      containing the dying CPU (temporarily due to sched_balance_exec) and
      forcibly reset it to all cpus, which might be wrong.  The other approach is
      to hold the cpucontrol sem around sched_balance_exec(), which is too much
      of a bottleneck.
      3de8a6b4
    • Andrew Morton's avatar
      [PATCH] sched: handle inter-CPU jiffies skew · db05a192
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      John Hawkes discribed this problem to me:
      
      There *is* a small problem in this area, though, that SuSE avoids.
      "jiffies" gets updated by cpu0.  The other CPUs may, over time, get out of
      sync (and they're initialized on ia64 to start out being out of sync), so
      it's no guarantee that every CPU will wake up from its timer interrupt and
      see a "jiffies" value that is guaranteed to be last_jiffies+1.  Sometimes
      the jiffies value may be unchanged since the last wakeup.  Sometimes the
      jiffies value may have incremented by 2 (or more, especially if cpu0's
      interrupts are disabled for long stretches of time).  So an algoithm that
      says, "I'll call load_balance() only when jiffies is *exactly* N" is going
      to fail on occasion, either by calling load_balance() too often or not
      often enough.  ***
      
      I fixed this by adding a last_balance field to struct sched_domain, and
      working off that.
      db05a192
    • Andrew Morton's avatar
      [PATCH] sched: implement domains for i386 HT · e18e19ad
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The following patch builds a scheduling description for the i386
      architecture using cpu_sibling_map to set up SMT if CONFIG_SCHED_SMT is
      set.
      
      It could be made more fancy and collapse degenerate domains at runtime (ie.
      1 sibling per CPU, or 1 NUMA node in the computer).
      
      
      From: Zwane Mwaikambo <zwane@arm.linux.org.uk>
      
         This fixes an oops due to cpu_sibling_map being uninitialised when a
         system with no MP table (most UP boxen) boots a CONFIG_SMT kernel.  What
         also happens is that the cpu_group lists end up not being terminated
         properly, but this oops kills it first.  Patch tested on UP w/o MP table,
         2x P2 and UP Xeon w/ no siblings.
      
      From: "Martin J. Bligh" <mbligh@aracnet.com>,
            Nick Piggin <piggin@cyberone.com.au>
      
         Change arch_init_sched_domains to use cpu_online_map
      
      From: Anton Blanchard <anton@samba.org>
      
         Fix build with NR_CPUS > BITS_PER_LONG
      e18e19ad
    • Andrew Morton's avatar
      [PATCH] sched: cpu_sibling_map to cpu_mask · 7a1dc0ea
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This is a (somewhat) trivial patch which converts cpu_sibling_map from an
      array of CPUs to an array of cpumasks.  Needed for >2 siblings per package,
      but it actually can simplify code as it allows the cpu_sibling_map to be
      set up even when there is 1 sibling per package.  Intel want this, I use it
      in the next patch to build scheduling domains for the P4 HT.
      
      From: Thomas Schlichter <thomas.schlichter@web.de>
      
         Build fix
      
      From: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
      
         Fix to handle more than 2 siblings per package.
      7a1dc0ea
    • Andrew Morton's avatar
      [PATCH] scheduler domain balancing improvements · 3dfa303d
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This patch gets the sched_domain scheduler working better WRT balancing.
      Its been tested on the NUMAQ.  Among other things it changes to the way SMT
      load calculation works so as not to active load blances when it shouldn't.
      
      It still has a problem with SMT and NUMA: it will put a task on each
      sibling in a node before moving tasks to another node.  It should probably
      start moving tasks after each *physical* CPU is filled.
      
      To fix, you need "how much CPU power in this domain?" At the moment we
      approximate # runqueues == CPU power, and hack around it at the CPU
      physical domain by counting all sibling runqueues as 1.
      
      It isn't hard to correctly work the CPU power out, but once CPU hotplug is
      in the equation it becomes much more hotplug events.  If anyone is actually
      interested in getting this fixed, that is.
      3dfa303d
    • Andrew Morton's avatar
      [PATCH] sched_domain debugging · b45bb339
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      Anton was attempting to make a sched domain topology for his POWER5 and was
      having some trouble.
      
      This patch only includes code which is ifdefed out, but hopefully it will
      be of some use to implementors.
      b45bb339
    • Andrew Morton's avatar
      [PATCH] sched: scheduler domain support · 8c136f71
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This is the core sched domains patch.  It can handle any number of levels
      in a scheduling heirachy, and allows architectures to easily customize how
      the scheduler behaves.  It also provides progressive balancing backoff
      needed by SGI on their large systems (although they have not yet tested
      it).
      
      It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and
      gets results very similar to them when using the default scheduling
      description.
      
      Benchmarks
      ==========
      
      Martin was seeing I think 10-20% better system times in kernbench on the 32
      way.  I was seeing improvements in dbench, tbench, kernbench, reaim,
      hackbench on a 16-way NUMAQ.  Hackbench in fact had a non linear element
      which is all but eliminated.  Large improvements in volanomark.
      
      Cross node task migration was decreased in all above benchmarks, sometimes by
      a factor of 100!!  Cross CPU migration was also generally decreased.  See
      this post:
      http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2
      
      Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues
      patch (which is a big improvement).
      
      Some examples on the 16-way NUMAQ (this is slightly older sched domain code):
      
       http://www.kerneltrap.org/~npiggin/w26/hbench.png
       http://www.kerneltrap.org/~npiggin/w26/vmark.html
      
      From: Jes Sorensen <jes@wildopensource.com>
      
         Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS >
         BITS_PER_LONG.
      
      From: "Martin J. Bligh" <mbligh@aracnet.com>
      
         Fix a minor nit with the find_busiest_group code.  No functional change,
         but makes the code simpler and clearer.  This patch does two things ... 
         adds some more expansive comments, and removes this if clause:
      
            if (*imbalance < SCHED_LOAD_SCALE
                            && max_load - this_load > SCHED_LOAD_SCALE)
      		*imbalance = SCHED_LOAD_SCALE;
      
         If we remove the scaling factor, we're basically conditionally doing:
      
      	if (*imbalance < 1)
      		*imbalance = 1;
      
         Which is pointless, as the very next thing we do is to remove the
         scaling factor, rounding up to the nearest integer as we do:
      
      	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
      
         Thus the if statement is redundant, and only makes the code harder to
         read ;-)
      
      From: Rick Lindsley <ricklind@us.ibm.com>
      
         In find_busiest_group(), after we exit the do/while, we select our
         imbalance.  But max_load, avg_load, and this_load are all unsigned, so
         min(x,y) will make a bad choice if max_load < avg_load < this_load (that
         is, a choice between two negative [very large] numbers).
      
         Unfortunately, there is a bug when max_load never gets changed from zero
         (look in the loop and think what happens if the only load on the machine is
         being created by cpu groups of which we are a member).  And you have a
         recipe for some really bogus values for imbalance.
      
         Even if you fix the max_load == 0 bug, there will still be times when
         avg_load - this_load will be negative (thus very large) and you'll make the
         decision to move stuff when you shouldn't have.
      
         This patch allows for this_load to set max_load, which if I understand
         the logic properly is correct.  With this patch applied, the algorithm is
         *much* more conservative ...  maybe *too* conservative but that's for
         another round of testing ...
      
      From: Ingo Molnar <mingo@elte.hu>
      
         sched-find-busiest-fix
      8c136f71