1. 18 Jan, 2011 2 commits
  2. 07 Jan, 2011 3 commits
  3. 05 Jan, 2011 1 commit
    • NeilBrown's avatar
      sched: Change wait_for_completion_*_timeout() to return a signed long · 6bf41237
      NeilBrown authored
      
      wait_for_completion_*_timeout() can return:
      
         0: if the wait timed out
       -ve: if the wait was interrupted
       +ve: if the completion was completed.
      
      As they currently return an 'unsigned long', the last two cases
      are not easily distinguished which can easily result in buggy
      code, as is the case for the recently added
      wait_for_completion_interruptible_timeout() call in
      net/sunrpc/cache.c
      
      So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
      is LONG_MAX, a large +ve return value should never overflow.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: J.  Bruce Fields <bfields@fieldses.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      6bf41237
  4. 04 Jan, 2011 1 commit
  5. 19 Dec, 2010 1 commit
  6. 16 Dec, 2010 3 commits
  7. 08 Dec, 2010 3 commits
  8. 30 Nov, 2010 3 commits
    • Mike Galbraith's avatar
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith authored
      
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5091faa4
    • Paul Turner's avatar
      sched: Fix unregister_fair_sched_group() · 822bc180
      Paul Turner authored
      
      In the flipping and flopping between calling
      unregister_fair_sched_group() on a per-cpu versus per-group basis
      we ended up in a bad state.
      
      Remove from the list for the passed cpu as opposed to some
      arbitrary index.
      
      ( This fixes explosions w/ autogroup as well as a group
        creation/destruction stress test. )
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarPaul Turner <pjt@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20101130005740.080828123@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      822bc180
    • Lai Jiangshan's avatar
      rcu,cleanup: move synchronize_sched_expedited() out of sched.c · 7b27d547
      Lai Jiangshan authored
      
      The first version of synchronize_sched_expedited() used the migration
      code in the scheduler, and was therefore implemented in kernel/sched.c.
      However, the more recent version of this code no longer uses the
      migration code, so this commit moves it to the main RCU source files.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7b27d547
  9. 26 Nov, 2010 2 commits
  10. 23 Nov, 2010 1 commit
  11. 18 Nov, 2010 7 commits
  12. 11 Nov, 2010 1 commit
  13. 10 Nov, 2010 1 commit
    • Suresh Siddha's avatar
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha authored
      
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      aae6d3dd
  14. 23 Oct, 2010 1 commit
  15. 22 Oct, 2010 1 commit
  16. 20 Oct, 2010 1 commit
  17. 18 Oct, 2010 8 commits
    • Ingo Molnar's avatar
      sched: Export account_system_vtime() · b7dadc38
      Ingo Molnar authored
      
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b7dadc38
    • Venkatesh Pallipadi's avatar
      sched: Call tick_check_idle before __irq_enter · d267f87f
      Venkatesh Pallipadi authored
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9
      
      :
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d267f87f
    • Venkatesh Pallipadi's avatar
      sched: Remove irq time from available CPU power · aa483808
      Venkatesh Pallipadi authored
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      aa483808
    • Venkatesh Pallipadi's avatar
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi authored
      
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      305e6835
    • Venkatesh Pallipadi's avatar
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi authored
      
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b52bfee4
    • Venkatesh Pallipadi's avatar
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi authored
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      75e1056f
    • Nikhil Rao's avatar
      sched: Do not consider SCHED_IDLE tasks to be cache hot · ef8002f6
      Nikhil Rao authored
      
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      ef8002f6
    • Linus Walleij's avatar
      sched: Drop all load weight manipulation for RT tasks · 17bdcf94
      Linus Walleij authored
      
      Load weights are for the CFS, they do not belong in the RT task. This makes all
      RT scheduling classes leave the CFS weights alone.
      
      This fixes a real bug as well: I noticed the following phonomena: a process
      elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
      indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
      inserted by set_load_weight() remains at 0, giving the task insignificat
      priority.
      
      With this fix, the weight is reset to what the task had before being elevated
      to SCHED_RR/SCHED_FIFO.
      
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      17bdcf94