1. 28 Jun, 2006 40 commits
    • Ingo Molnar's avatar
      [PATCH] pi-futex: rt mutex debug · e7eebaf6
      Ingo Molnar authored
      Runtime debugging functionality for rt-mutexes.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e7eebaf6
    • Steven Rostedt's avatar
      [PATCH] pi-futex: rt mutex docs · a6537be9
      Steven Rostedt authored
      Add rt-mutex documentation.
      
      [rostedt@goodmis.org: Update rt-mutex-design.txt as per Randy Dunlap suggestions]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a6537be9
    • Ingo Molnar's avatar
      [PATCH] pi-futex: rt mutex core · 23f78d4a
      Ingo Molnar authored
      Core functions for the rt-mutex subsystem.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      23f78d4a
    • Ingo Molnar's avatar
      [PATCH] pi-futex: scheduler support for pi · b29739f9
      Ingo Molnar authored
      Add framework to boost/unboost the priority of RT tasks.
      
      This consists of:
      
       - caching the 'normal' priority in ->normal_prio
       - providing a functions to set/get the priority of the task
       - make sched_setscheduler() aware of boosting
      
      The effective_prio() cleanups also fix a priority-calculation bug pointed out
      by Andrey Gelman, in set_user_nice().
      
      has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Andrey Gelman <agelman@012.net.il>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b29739f9
    • Ingo Molnar's avatar
      [PATCH] pi-futex: add plist implementation · 77ba89c5
      Ingo Molnar authored
      Add the priority-sorted list (plist) implementation.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      77ba89c5
    • Ingo Molnar's avatar
      [PATCH] pi-futex: introduce WARN_ON_SMP · 8eb94f80
      Ingo Molnar authored
      Introduce a new WARN_ON variant: WARN_ON_SMP(cond).
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8eb94f80
    • Ingo Molnar's avatar
      [PATCH] pi-futex: introduce debug_check_no_locks_freed() · f9b8404c
      Ingo Molnar authored
      Add debug_check_no_locks_freed(), as a central inline to add
      bad-lock-free-debugging functionality to.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f9b8404c
    • Ingo Molnar's avatar
      [PATCH] pi-futex: robust futex docs fix · 6abdce76
      Ingo Molnar authored
      Fix typo in Documentation/robust-futexes.txt.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6abdce76
    • Ingo Molnar's avatar
      [PATCH] pi-futex: futex code cleanups · e2970f2f
      Ingo Molnar authored
      We are pleased to announce "lightweight userspace priority inheritance" (PI)
      support for futexes.  The following patchset and glibc patch implements it,
      ontop of the robust-futexes patchset which is included in 2.6.16-mm1.
      
      We are calling it lightweight for 3 reasons:
      
       - in the user-space fastpath a PI-enabled futex involves no kernel work
         (or any other PI complexity) at all.  No registration, no extra kernel
         calls - just pure fast atomic ops in userspace.
      
       - in the slowpath (in the lock-contention case), the system call and
         scheduling pattern is in fact better than that of normal futexes, due to
         the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]
      
       - the in-kernel PI implementation is streamlined around the mutex
         abstraction, with strict rules that keep the implementation relatively
         simple: only a single owner may own a lock (i.e.  no read-write lock
         support), only the owner may unlock a lock, no recursive locking, etc.
      
        Priority Inheritance - why, oh why???
        -------------------------------------
      
      Many of you heard the horror stories about the evil PI code circling Linux for
      years, which makes no real sense at all and is only used by buggy applications
      and which has horrible overhead.  Some of you have dreaded this very moment,
      when someone actually submits working PI code ;-)
      
      So why would we like to see PI support for futexes?
      
      We'd like to see it done purely for technological reasons.  We dont think it's
      a buggy concept, we think it's useful functionality to offer to applications,
      which functionality cannot be achieved in other ways.  We also think it's the
      right thing to do, and we think we've got the right arguments and the right
      numbers to prove that.  We also believe that we can address all the
      counter-arguments as well.  For these reasons (and the reasons outlined below)
      we are submitting this patch-set for upstream kernel inclusion.
      
      What are the benefits of PI?
      
        The short reply:
        ----------------
      
      User-space PI helps achieving/improving determinism for user-space
      applications.  In the best-case, it can help achieve determinism and
      well-bound latencies.  Even in the worst-case, PI will improve the statistical
      distribution of locking related application delays.
      
        The longer reply:
        -----------------
      
      Firstly, sharing locks between multiple tasks is a common programming
      technique that often cannot be replaced with lockless algorithms.  As we can
      see it in the kernel [which is a quite complex program in itself], lockless
      structures are rather the exception than the norm - the current ratio of
      lockless vs.  locky code for shared data structures is somewhere between 1:10
      and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
      endangers to ability to do robust reviews of said code.  I.e.  critical RT
      apps often choose lock structures to protect critical data structures, instead
      of lockless algorithms.  Furthermore, there are cases (like shared hardware,
      or other resource limits) where lockless access is mathematically impossible.
      
      Media players (such as Jack) are an example of reasonable application design
      with multiple tasks (with multiple priority levels) sharing short-held locks:
      for example, a highprio audio playback thread is combined with medium-prio
      construct-audio-data threads and low-prio display-colory-stuff threads.  Add
      video and decoding to the mix and we've got even more priority levels.
      
      So once we accept that synchronization objects (locks) are an unavoidable fact
      of life, and once we accept that multi-task userspace apps have a very fair
      expectation of being able to use locks, we've got to think about how to offer
      the option of a deterministic locking implementation to user-space.
      
      Most of the technical counter-arguments against doing priority inheritance
      only apply to kernel-space locks.  But user-space locks are different, there
      we cannot disable interrupts or make the task non-preemptible in a critical
      section, so the 'use spinlocks' argument does not apply (user-space spinlocks
      have the same priority inversion problems as other user-space locking
      constructs).  Fact is, pretty much the only technique that currently enables
      good determinism for userspace locks (such as futex-based pthread mutexes) is
      priority inheritance:
      
      Currently (without PI), if a high-prio and a low-prio task shares a lock [this
      is a quite common scenario for most non-trivial RT applications], even if all
      critical sections are coded carefully to be deterministic (i.e.  all critical
      sections are short in duration and only execute a limited number of
      instructions), the kernel cannot guarantee any deterministic execution of the
      high-prio task: any medium-priority task could preempt the low-prio task while
      it holds the shared lock and executes the critical section, and could delay it
      indefinitely.
      
        Implementation:
        ---------------
      
      As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
      involves no kernel work at all - they behave quite similarly to normal
      futex-based locks: a 0 value means unlocked, and a value==TID means locked.
      (This is the same method as used by list-based robust futexes.) Userspace uses
      atomic ops to lock/unlock these mutexes without entering the kernel.
      
      To handle the slowpath, we have added two new futex ops:
      
        FUTEX_LOCK_PI
        FUTEX_UNLOCK_PI
      
      If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
      fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
      if there is no futex-queue attached to the futex address yet then the code
      looks up the task that owns the futex [it has put its own TID into the futex
      value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
      includes an rt-mutex, which is a PI-aware, kernel-based synchronization
      object.  The 'other' task is made the owner of the rt-mutex, and the
      FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
      to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
      acquired, and it sets the futex value to its own TID and returns.  Userspace
      has no other work to perform - it now owns the lock, and futex value contains
      FUTEX_WAITERS|TID.
      
      If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
      0 atomic transition of the futex value], then no kernel work is triggered.
      
      If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
      FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
      userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
      up any potential waiters.
      
      Note that under this approach, contrary to other PI-futex approaches, there is
      no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
      due to existing ABI properties of pthread mutexes.]
      
      Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
      of futexes, and all four combinations are possible: futex, robust-futex,
      PI-futex, robust+PI-futex.
      
        glibc support:
        --------------
      
      Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
      (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
      mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
      additional kernel changes are needed for that).  [NOTE: The glibc patch is
      obviously inofficial and unsupported without matching upstream kernel
      functionality.]
      
      the patch-queue and the glibc patch can also be downloaded from:
      
        http://redhat.com/~mingo/PI-futex-patches/
      
      Many thanks go to the people who helped us create this kernel feature: Steven
      Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
      van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
      to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.
      
      Clean up the futex code, before adding more features to it:
      
       - use u32 as the futex field type - that's the ABI
       - use __user and pointers to u32 instead of unsigned long
       - code style / comment style cleanups
       - rename hash-bucket name from 'bh' to 'hb'.
      
      I checked the pre and post futex.o object files to make sure this
      patch has no code effects.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e2970f2f
    • Steven Rostedt's avatar
      [PATCH] BUG() if setscheduler is called from interrupt context · 66e5393a
      Steven Rostedt authored
      Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
      This call grabs a spin_lock that is not always protected by interrupts
      disabled.  So this means that setscheduler cant be called from interrupt
      context.
      
      To prevent this from happening in the future, this patch adds a
      BUG_ON(in_interrupt()) in that function.  (Thanks to akpm <aka.  Andrew
      Morton> for this suggestion).
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      66e5393a
    • Oleg Nesterov's avatar
      [PATCH] sched: uninline task_rq_lock() · 9fea80e4
      Oleg Nesterov authored
      Saves 543 bytes from sched.o (gcc 3.3.3).
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Con Kolivas <kernel@kolivas.org>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9fea80e4
    • Siddha, Suresh B's avatar
      [PATCH] sched: mc/smt power savings sched policy · 5c45bf27
      Siddha, Suresh B authored
      sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
      /sys/devices/system/cpu/ control the MC/SMT power savings policy for the
      scheduler.
      
      Based on the values (1-enable, 0-disable) for these controls, sched groups
      cpu power will be determined for different domains.  When power savings
      policy is enabled and under light load conditions, scheduler will minimize
      the physical packages/cpu cores carrying the load and thus conserving
      power(with a perf impact based on the workload characteristics...  see OLS
      2005 CMP kernel scheduler paper for more details..)
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Con Kolivas <kernel@kolivas.org>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5c45bf27
    • Srivatsa Vaddagiri's avatar
      [PATCH] sched_domai: Allocate sched_group structures dynamically · 36938169
      Srivatsa Vaddagiri authored
      As explained here:
      	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2
      
      there is a problem with sharing sched_group structures between two
      separate sched_group structures for different sched_domains.
      
      The patch has been tested and found to avoid the kernel lockup problem
      described in above URL.
      Signed-off-by: default avatarSrivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      36938169
    • Srivatsa Vaddagiri's avatar
      [PATCH] sched_domai: Use kmalloc_node · 15f0b676
      Srivatsa Vaddagiri authored
      The sched group structures used to represent various nodes need to be
      allocated from respective nodes (as suggested here also:
      
      	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)
      Signed-off-by: default avatarSrivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      15f0b676
    • Srivatsa Vaddagiri's avatar
      [PATCH] sched_domai: Don't use GFP_ATOMIC · d3a5aa98
      Srivatsa Vaddagiri authored
      Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
      allocation.
      
      Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d3a5aa98
    • Srivatsa Vaddagiri's avatar
      [PATCH] sched_domain: handle kmalloc failure · 51888ca2
      Srivatsa Vaddagiri authored
      Try to handle mem allocation failures in build_sched_domains by bailing out
      and cleaning up thus-far allocated memory.  The patch has a direct consequence
      that we disable load balancing completely (even at sibling level) upon *any*
      memory allocation failure.
      
      [Lee.Schermerhorn@hp.com: bugfix]
      Signed-off-by: default avatarSrivatsa Vaddagir <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      51888ca2
    • Peter Williams's avatar
      [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks() · 615052dc
      Peter Williams authored
      Problem:
      
      To help distribute high priority tasks evenly across the available CPUs
      move_tasks() does not, under some circumstances, skip tasks whose load
      weight is bigger than the designated amount.  Because the highest priority
      task on the busiest queue may be on the expired array it may be moved as a
      result of this mechanism.  Apart from not being the most desirable way to
      redistribute the high priority tasks (we'd rather move the second highest
      priority task), there is a risk that this could set up a loop with this
      task bouncing backwards and forwards between the two queues.  (This latter
      possibility can be demonstrated by running a nice==-20 CPU bound task on an
      otherwise quiet 2 CPU system.)
      
      Solution:
      
      Modify the mechanism so that it does not override skip for the highest
      priority task on the CPU.  Of course, if there are more than one tasks at
      the highest priority then it will allow the override for one of them as
      this is a desirable redistribution of high priority tasks.
      Signed-off-by: default avatarPeter Williams <pwil3058@bigpond.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      615052dc
    • Peter Williams's avatar
      [PATCH] sched: modify move_tasks() to improve load balancing outcomes · 50ddd969
      Peter Williams authored
      Problem:
      
      The move_tasks() function is designed to move UP TO the amount of load it
      is asked to move and in doing this it skips over tasks looking for ones
      whose load weights are less than or equal to the remaining load to be
      moved.  This is (in general) a good thing but it has the unfortunate result
      of breaking one of the original load balancer's good points: namely, that
      (within the limits imposed by the active/expired array model and the fact
      the expired is processed first) it moves high priority tasks before low
      priority ones and this means there's a good chance (see active/expired
      problem for why it's only a chance) that the highest priority task on the
      queue but not actually on the CPU will be moved to the other CPU where (as
      a high priority task) it may preempt the current task.
      
      Solution:
      
      Modify move_tasks() so that high priority tasks are not skipped when moving
      them will make them the highest priority task on their new run queue.
      Signed-off-by: default avatarPeter Williams <pwil3058@bigpond.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      50ddd969
    • Peter Williams's avatar
      [PATCH] sched: implement smpnice · 2dd73a4f
      Peter Williams authored
      Problem:
      
      The introduction of separate run queues per CPU has brought with it "nice"
      enforcement problems that are best described by a simple example.
      
      For the sake of argument suppose that on a single CPU machine with a
      nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
      task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
      suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
      2 nice==0 hard spinners running.  The user of this system would be entitled
      to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
      tasks only get 5% each.  However, whether this expectation is met is pretty
      much down to luck as there are four equally likely distributions of the
      tasks to the CPUs that the load balancing code will consider to be balanced
      with loads of 2.0 for each CPU.  Two of these distributions involve one
      nice==0 and one nice==19 task per CPU and in these circumstances the users
      expectations will be met.  The other two distributions both involve both
      nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
      each task will get 50% of a CPU and the user's expectations will not be
      met.
      
      Solution:
      
      The solution to this problem that is implemented in the attached patch is
      to use weighted loads when determining if the system is balanced and, when
      an imbalance is detected, to move an amount of weighted load between run
      queues (as opposed to a number of tasks) to restore the balance.  Once
      again, the easiest way to explain why both of these measures are necessary
      is to use a simple example.  Suppose that (in a slight variation of the
      above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
      hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
      the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
      CPUs would be 4.0 and 0.2 respectively and the load balancing code would
      move 2 tasks resulting in one CPU with a load of 2.0 and the other with
      load of 2.2.  If this was considered to be a big enough imbalance to
      justify moving a task and that task was moved using the current
      move_tasks() then it would move the highest priority task that it found and
      this would result in one CPU with a load of 3.0 and the other with a load
      of 1.2 which would result in the movement of a task in the opposite
      direction and so on -- infinite loop.  If, on the other hand, an amount of
      load to be moved is calculated from the imbalance (in this case 0.1) and
      move_tasks() skips tasks until it find ones whose contributions to the
      weighted load are less than this amount it would move two of the nice==19
      tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
      loads of 2.1 for each CPU.
      
      One of the advantages of this mechanism is that on a system where all tasks
      have nice==0 the load balancing calculations would be mathematically
      identical to the current load balancing code.
      
      Notes:
      
      struct task_struct:
      
      has a new field load_weight which (in a trade off of space for speed)
      stores the contribution that this task makes to a CPU's weighted load when
      it is runnable.
      
      struct runqueue:
      
      has a new field raw_weighted_load which is the sum of the load_weight
      values for the currently runnable tasks on this run queue.  This field
      always needs to be updated when nr_running is updated so two new inline
      functions inc_nr_running() and dec_nr_running() have been created to make
      sure that this happens.  This also offers a convenient way to optimize away
      this part of the smpnice mechanism when CONFIG_SMP is not defined.
      
      int try_to_wake_up():
      
      in this function the value SCHED_LOAD_BALANCE is used to represent the load
      contribution of a single task in various calculations in the code that
      decides which CPU to put the waking task on.  While this would be a valid
      on a system where the nice values for the runnable tasks were distributed
      evenly around zero it will lead to anomalous load balancing if the
      distribution is skewed in either direction.  To overcome this problem
      SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
      or by the average load_weight per task for the queue in question (as
      appropriate).
      
      int move_tasks():
      
      The modifications to this function were complicated by the fact that
      active_load_balance() uses it to move exactly one task without checking
      whether an imbalance actually exists.  This precluded the simple
      overloading of max_nr_move with max_load_move and necessitated the addition
      of the latter as an extra argument to the function.  The internal
      implementation is then modified to move up to max_nr_move tasks and
      max_load_move of weighted load.  This slightly complicates the code where
      move_tasks() is called and if ever active_load_balance() is changed to not
      use move_tasks() the implementation of move_tasks() should be simplified
      accordingly.
      
      struct sched_group *find_busiest_group():
      
      Similar to try_to_wake_up(), there are places in this function where
      SCHED_LOAD_SCALE is used to represent the load contribution of a single
      task and the same issues are created.  A similar solution is adopted except
      that it is now the average per task contribution to a group's load (as
      opposed to a run queue) that is required.  As this value is not directly
      available from the group it is calculated on the fly as the queues in the
      groups are visited when determining the busiest group.
      
      A key change to this function is that it is no longer to scale down
      *imbalance on exit as move_tasks() uses the load in its scaled form.
      
      void set_user_nice():
      
      has been modified to update the task's load_weight field when it's nice
      value and also to ensure that its run queue's raw_weighted_load field is
      updated if it was runnable.
      
      From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      
      With smpnice, sched groups with highest priority tasks can mask the imbalance
      between the other sched groups with in the same domain.  This patch fixes some
      of the listed down scenarios by not considering the sched groups which are
      lightly loaded.
      
      a) on a simple 4-way MP system, if we have one high priority and 4 normal
         priority tasks, with smpnice we would like to see the high priority task
         scheduled on one cpu, two other cpus getting one normal task each and the
         fourth cpu getting the remaining two normal tasks.  but with current
         smpnice extra normal priority task keeps jumping from one cpu to another
         cpu having the normal priority task.  This is because of the
         busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
         cpu with high priority task in max_load calculations but including that in
         total and avg_load calcuations..  leading to max_load < avg_load and load
         balance between cpus running normal priority tasks(2 Vs 1) will always show
         imbalanace as one normal priority and the extra normal priority task will
         keep moving from one cpu to another cpu having normal priority task..
      
      b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
         highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
         priority task each..  P2 and P3 are idle.  With this patch, one of the
         normal priority tasks on P1 will be moved to P2 or P3..
      
      c) With the current weighted smp nice calculations, it doesn't always make
         sense to look at the highest weighted runqueue in the busy group..
         Consider a load balance scenario on a DP with HT system, with Package-0
         containing one high priority and one low priority, Package-1 containing one
         low priority(with other thread being idle)..  Package-1 thinks that it need
         to take the low priority thread from Package-0.  And find_busiest_queue()
         returns the cpu thread with highest priority task..  And ultimately(with
         help of active load balance) we move high priority task to Package-1.  And
         same continues with Package-0 now, moving high priority task from package-1
         to package-0..  Even without the presence of active load balance, load
         balance will fail to balance the above scenario..  Fix find_busiest_queue
         to use "imbalance" when it is lightly loaded.
      
      [kernel@kolivas.org: sched: store weighted load on up]
      [kernel@kolivas.org: sched: add discrete weighted cpu load function]
      [suresh.b.siddha@intel.com: sched: remove dead code]
      Signed-off-by: default avatarPeter Williams <pwil3058@bigpond.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarCon Kolivas <kernel@kolivas.org>
      Cc: John Hawkes <hawkes@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2dd73a4f
    • Kirill Korotaev's avatar
      [PATCH] sched: CPU hotplug race vs. set_cpus_allowed() · efc30814
      Kirill Korotaev authored
      There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
      __migrate_task() doesn't report any err code, so task can be left on its
      runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
      possible target.  Also, chaning cpus_allowed mask requires rq->lock being
      held.
      Signed-off-by: default avatarKirill Korotaev <dev@openvz.org>
      Acked-By: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      efc30814
    • Steven Rostedt's avatar
      [PATCH] unnecessary long index i in sched · cc94abfc
      Steven Rostedt authored
      Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
      as a long long here.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cc94abfc
    • Con Kolivas's avatar
      [PATCH] sched: fix interactive ceiling code · 72d2854d
      Con Kolivas authored
      The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
      and not explicit enough.  The sleep boost is not supposed to be any larger
      than without this code and the comment is not clear enough about what
      exactly it does, just the reason it does it.  Fix it.
      
      There is a ceiling to the priority beyond which tasks that only ever sleep
      for very long periods cannot surpass.  Fix it.
      
      Prevent the on-runqueue bonus logic from defeating the idle sleep logic.
      
      Opportunity to micro-optimise.
      Signed-off-by: default avatarCon Kolivas <kernel@kolivas.org>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      72d2854d
    • Steven Rostedt's avatar
      d444886e
    • Chen, Kenneth W's avatar
      [PATCH] sched: fix smt nice lock contention and optimization · c96d145e
      Chen, Kenneth W authored
      Initial report and lock contention fix from Chris Mason:
      
      Recent benchmarks showed some performance regressions between 2.6.16 and
      2.6.5.  We tracked down one of the regressions to lock contention in
      schedule heavy workloads (~70,000 context switches per second)
      
      kernel/sched.c:dependent_sleeper() was responsible for most of the lock
      contention, hammering on the run queue locks.  The patch below is more of a
      discussion point than a suggested fix (although it does reduce lock
      contention significantly).  The dependent_sleeper code looks very expensive
      to me, especially for using a spinlock to bounce control between two
      different siblings in the same cpu.
      
      It is further optimized:
      
      * perform dependent_sleeper check after next task is determined
      * convert wake_sleeping_dependent to use trylock
      * skip smt runqueue check if trylock fails
      * optimize double_rq_lock now that smt nice is converted to trylock
      * early exit in searching first SD_SHARE_CPUPOWER domain
      * speedup fast path of dependent_sleeper
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarCon Kolivas <kernel@kolivas.org>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarChris Mason <mason@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c96d145e
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add proper Kconfig, Makefile entries · 7a8e2a5e
      Jim Cromie authored
      Replace the temp makefile hacks with proper CONFIG entries, which are also
      added to Kconfig.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7a8e2a5e
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: display pin values in/out in gpio_dump · 23916a8e
      Jim Cromie authored
      Add current pin settings to gpio_dump() output.  This adds the last 'word' to
      the syslog lines, which displays the input and output values that the pin is
      set to.
      
        pc8736x_gpio.0: io00: 0x0044 TS OD PUE  EDGE LO DEBOUNCE        io:1/1
      
      The 2 values may differ for a number of reasons:
      1- the pin output circuitry is diaabled, (as the above 'TS' indicates)
      2- it needs a pullup resistor to drive the attached circuit,
      3- the external circuit needs a pullup so the open-drain has something
         to pull-down
      4- the pin is wired to Vcc or Ground
      
      It might be appropriate to add a WARN for 2,3,4, since they could
      damage the chip and/or circuit, esp if misconfig goes unnoticed.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      23916a8e
    • Jim Cromie's avatar
      [PATCH] gpio-patchset-fixups: include linux/io.h · ec312310
      Jim Cromie authored
      Hmm.  Im somewhat ambivalent about this patch, since with it, driver wont
      build for vanilla 17 or older.
      
      Its also only 1/2 of your suggestion - when I tried it, I was building against
      vanilla 17, and asm/uaccess.h cause compilation failure.  Looking back, Im
      perplexed as to why linux/io.h didnt cause same failure ?!?
      
      use linux/io.h rather than asm/io.h
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ec312310
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: replace spinlocks w mutexes · 8bcf6135
      Jim Cromie authored
      Replace spinlocks guarding gpio config ops with mutexes.  This is a me-too
      patch, and is justifiable insofar as mutexes have stricter semantics and
      better debugging support, so are preferred where they are applicable.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8bcf6135
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: fix gpio_current, use shadow regs · 6cad56fd
      Jim Cromie authored
      Add a working gpio_current() to pc8736x_gpio.c (the previous implementation
      just threw a dev_warn), and fix gpio_change() to use gpio_current() rather
      than the incorrect (and temporary) gpio_get().  Initialize shadow-regs so this
      all works.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6cad56fd
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: use dev_dbg in common module · f31000e5
      Jim Cromie authored
      Use of dev_dbg() and friends is considered good practice.  dev_dbg() needs a
      struct device *devp, but nsc_gpio is only a helper module, so it doesnt
      have/need its own.  To provide devp to the user-modules (scx200 & pc8736x
      _gpio), we add it to the vtable, and set it during init.
      
      Also squeeze nsc_gpio_dump()'s format a little.
      
      [  199.259879]  pc8736x_gpio.0: io09: 0x0044 TS OD PUE  EDGE LO DEBOUNCE
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f31000e5
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add platform_device for use w dev_dbg · 58b087cd
      Jim Cromie authored
      Adds platform-device to (just introduced) driver, and uses it to replace many
      printks with dev_dbg() etc.  This could trivially be merged into previous
      patch, but this way matches better with the corresponding patch that does the
      same change to scx200_gpio.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      58b087cd
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add new pc8736x_gpio module · 681a3e7d
      Jim Cromie authored
      Add the brand new pc8736x_gpio driver.  This is mostly based upon
      scx200_gpio.c, but the platform_dev is treated separately, since its fairly
      big too.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      681a3e7d
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: migrate gpio_dump to common module · 0e41ef3c
      Jim Cromie authored
      Since the meaning of config-bits is the same for scx200 and pc8736x _gpios, we
      can share a function to deliver this to user.  Since it is called via the
      vtable, its also completely replaceable.  For now, we keep using printk...
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0e41ef3c
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: migrate file-ops to common module · 1a66fdf0
      Jim Cromie authored
      Now that the read(), write() file-ops are dispatching gpio-ops via the vtable,
      they are generic, and can be moved 'verbatim' to the nsc_gpio common-support
      module.  After the move, various symbols are renamed to update 'scx200_' to
      'nsc_', and headers are adjusted accordingly.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1a66fdf0
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add empty common-module · 1ca5df0a
      Jim Cromie authored
      Add the nsc_gpio common-support module as an empty shell.  Next patch starts
      the migration of the common gpio support routines.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1ca5df0a
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: dispatch via vtable · c3dc8071
      Jim Cromie authored
      Now actually call the gpio operations thru the vtable.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c3dc8071
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add gpio-ops vtable · fe3a168a
      Jim Cromie authored
      Abstract the gpio operations into a new nsc_gpio_ops vtable.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fe3a168a
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: refactor scx200_probe to better... · 9b170b8f
      Jim Cromie authored
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: refactor scx200_probe to better segregate _gpio initialization
      
      Pull shadow-reg initialization into separate function now, rather than doing
      it 2x later (scx200, pc8736x).  When we revisit 2nd drvr below, it will be to
      reimplement an init function, rather than another refactor.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9b170b8f
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: add 'v' command to device-file · 9550a339
      Jim Cromie authored
      Add a new driver command: 'v' which calls gpio_dump() on the pin.  The output
      goes to the log, like all other INFO messages in the original driver.  Giving
      the user control over the feedback they 'need' is construed to be a
      user-friendly feature, and allows us (later) to dial down many INFO messages
      to DEBUG log-level.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9550a339
    • Jim Cromie's avatar
      [PATCH] chardev: GPIO for SCx200 & PC-8736x: put gpio_dump on a diet · d424aa87
      Jim Cromie authored
      Shrink scx200_gpio_dump() to a single printk with ternary ops.  The function
      is still ifdef'd out, this is corrected in next patch, when it is actually
      used.
      
      The patch 'inadvertently' changed loglevel from DEBUG to INFO.  This is Good,
      because in next patch, its wired to a 'command' which the user can invoke when
      they want.  When they do so, its because they want INFO to support their
      developement effort, and we want to give it to them without compiling a DEBUG
      version of the driver.
      Signed-off-by: default avatarJim Cromie <jim.cromie@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d424aa87