1. 02 Mar, 2011 13 commits
    • David Howells's avatar
      CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials · c8fd4409
      David Howells authored
      commit de09a977 upstream.
      
      It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
      credentials by incrementing their usage count after their replacement by the
      task being accessed.
      
      What happens is that get_task_cred() can race with commit_creds():
      
      	TASK_1			TASK_2			RCU_CLEANER
      	-->get_task_cred(TASK_2)
      	rcu_read_lock()
      	__cred = __task_cred(TASK_2)
      				-->commit_creds()
      				old_cred = TASK_2->real_cred
      				TASK_2->real_cred = ...
      				put_cred(old_cred)
      				  call_rcu(old_cred)
      		[__cred->usage == 0]
      	get_cred(__cred)
      		[__cred->usage == 1]
      	rcu_read_unlock()
      							-->put_cred_rcu()
      							[__cred->usage == 1]
      							panic()
      
      However, since a tasks credentials are generally not changed very often, we can
      reasonably make use of a loop involving reading the creds pointer and using
      atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
      
      If successful, we can safely return the credentials in the knowledge that, even
      if the task we're accessing has released them, they haven't gone to the RCU
      cleanup code.
      
      We then change task_state() in procfs to use get_task_cred() rather than
      calling get_cred() on the result of __task_cred(), as that suffers from the
      same problem.
      
      Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
      tripped when it is noticed that the usage count is not zero as it ought to be,
      for example:
      
      kernel BUG at kernel/cred.c:168!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/kernel/mm/ksm/run
      CPU 0
      Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
      745
      RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
      RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
      RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
      RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
      R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
      FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
      Stack:
       ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
      <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
      <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
      Call Trace:
       [<ffffffff810698cd>] put_cred+0x13/0x15
       [<ffffffff81069b45>] commit_creds+0x16b/0x175
       [<ffffffff8106aace>] set_current_groups+0x47/0x4e
       [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
       [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
      Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
      48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
      04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
      RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
       RSP <ffff88019e7e9eb8>
      ---[ end trace df391256a100ebdd ]---
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c8fd4409
    • Dan Carpenter's avatar
      av7110: check for negative array offset · dd6a19a5
      Dan Carpenter authored
      commit cb26a24e upstream.
      
      info->num comes from the user.  It's type int.  If the user passes
      in a negative value that would cause memory corruption.
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      dd6a19a5
    • Jeremy Fitzhardinge's avatar
      x86/pvclock: Zero last_value on resume · 595b62a8
      Jeremy Fitzhardinge authored
      commit e7a3481c upstream.
      
      If the guest domain has been suspend/resumed or migrated, then the
      system clock backing the pvclock clocksource may revert to a smaller
      value (ie, can be non-monotonic across the migration/save-restore).
      
      Make sure we zero last_value in that case so that the domain
      continues to see clock updates.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      595b62a8
    • Alan Stern's avatar
      OHCI: work around for nVidia shutdown problem · 5f528de0
      Alan Stern authored
      commit 3df7169e upstream.
      
      This patch (as1417) fixes a problem affecting some (or all) nVidia
      chipsets.  When the computer is shut down, the OHCI controllers
      continue to power the USB buses and evidently they drive a Reset
      signal out all their ports.  This prevents attached devices from going
      to low power.  Mouse LEDs stay on, for example, which is disconcerting
      for users and a drain on laptop batteries.
      
      The fix involves leaving each OHCI controller in the OPERATIONAL state
      during system shutdown rather than putting it in the RESET state.
      Although this nominally means the controller is running, in fact it's
      not doing very much since all the schedules are all disabled.  However
      there is ongoing DMA to the Host Controller Communications Area, so
      the patch also disables the bus-master capability of all PCI USB
      controllers after the shutdown routine runs.
      
      The fix is applied only to nVidia-based PCI OHCI controllers, so it
      shouldn't cause problems on systems using other hardware.  As an added
      safety measure, in case the kernel encounters one of these running
      controllers during boot, the patch changes quirk_usb_handoff_ohci()
      (which runs early on during PCI discovery) to reset the controller
      before anything bad can happen.
      Reported-by: default avatarPali Rohár <pali.rohar@gmail.com>
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      CC: David Brownell <david-b@pacbell.net>
      Tested-by: default avatarPali Rohár <pali.rohar@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      5f528de0
    • Shaohua Li's avatar
      x86, hpet: Disable per-cpu hpet timer if ARAT is supported · bf8c4fb7
      Shaohua Li authored
      commit 39fe05e5 upstream.
      
      If CPU support always running local APIC timer, per-cpu hpet
      timer could be disabled, which is useless and wasteful in such
      case. Let's leave the timers to others.
      
      The effect is that we reserve less timers.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: venkatesh.pallipadi@intel.com
      LKML-Reference: <20090812031612.GA10062@sli10-desk.sh.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Thomas Renninger <trenn@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      bf8c4fb7
    • Apollon Oikonomopoulos's avatar
      x25: decrement netdev reference counts on unload · cfa3f57b
      Apollon Oikonomopoulos authored
      commit 171995e5 upstream.
      
      x25 does not decrement the network device reference counts on module unload.
      Thus unregistering any pre-existing interface after unloading the x25 module
      hangs and results in
      
       unregister_netdevice: waiting for tap0 to become free. Usage count = 1
      
      This patch decrements the reference counts of all interfaces in x25_link_free,
      the way it is already done in x25_link_device_down for NETDEV_DOWN events.
      Signed-off-by: default avatarApollon Oikonomopoulos <apollon@noc.grnet.gr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      cfa3f57b
    • David S. Miller's avatar
      filter: make sure filters dont read uninitialized memory · f37c091b
      David S. Miller authored
      commit 57fe93b3 upstream.
      
      There is a possibility malicious users can get limited information about
      uninitialized stack mem array. Even if sk_run_filter() result is bound
      to packet length (0 .. 65535), we could imagine this can be used by
      hostile user.
      
      Initializing mem[] array, like Dan Rosenberg suggested in his patch is
      expensive since most filters dont even use this array.
      
      Its hard to make the filter validation in sk_chk_filter(), because of
      the jumps. This might be done later.
      
      In this patch, I use a bitmap (a single long var) so that only filters
      using mem[] loads/stores pay the price of added security checks.
      
      For other filters, additional cost is a single instruction.
      
      [ Since we access fentry->k a lot now, cache it in a local variable
        and mark filter entry pointer as const. -DaveM ]
      Reported-by: default avatarDan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [Backported by dann frazier <dannf@debian.org>]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f37c091b
    • Dan Rosenberg's avatar
      Fix pktcdvd ioctl dev_minor range check · 12d83a21
      Dan Rosenberg authored
      commit 252a52aa upstream.
      
      The PKT_CTRL_CMD_STATUS device ioctl retrieves a pointer to a
      pktcdvd_device from the global pkt_devs array.  The index into this
      array is provided directly by the user and is a signed integer, so the
      comparison to ensure that it falls within the bounds of this array will
      fail when provided with a negative index.
      
      This can be used to read arbitrary kernel memory or cause a crash due to
      an invalid pointer dereference.  This can be exploited by users with
      permission to open /dev/pktcdvd/control (on many distributions, this is
      readable by group "cdrom").
      Signed-off-by: default avatarDan Rosenberg <dan.j.rosenberg@gmail.com>
      [ Rather than add a cast, just make the function take the right type -Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      12d83a21
    • dann frazier's avatar
      ocfs2_connection_find() returns pointer to bad structure · 965f6e05
      dann frazier authored
      commit 226291aa upstream.
      
      If ocfs2_live_connection_list is empty, ocfs2_connection_find() will return
      a pointer to the LIST_HEAD, cast as a ocfs2_live_connection. This can cause
      an oops when ocfs2_control_send_down() dereferences c->oc_conn:
      
      Call Trace:
        [<ffffffffa00c2a3c>] ocfs2_control_message+0x28c/0x2b0 [ocfs2_stack_user]
        [<ffffffffa00c2a95>] ocfs2_control_write+0x35/0xb0 [ocfs2_stack_user]
        [<ffffffff81143a88>] vfs_write+0xb8/0x1a0
        [<ffffffff8155cc13>] ? do_page_fault+0x153/0x3b0
        [<ffffffff811442f1>] sys_write+0x51/0x80
        [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
      
      Fix by explicitly returning NULL if no match is found.
      Signed-off-by: default avatardann frazier <dann.frazier@canonical.com>
      Signed-off-by: default avatarJoel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      965f6e05
    • Dan Rosenberg's avatar
      sctp: Fix out-of-bounds reading in sctp_asoc_get_hmac() · 1209e7ab
      Dan Rosenberg authored
      commit 51e97a12 upstream.
      
      The sctp_asoc_get_hmac() function iterates through a peer's hmac_ids
      array and attempts to ensure that only a supported hmac entry is
      returned.  The current code fails to do this properly - if the last id
      in the array is out of range (greater than SCTP_AUTH_HMAC_ID_MAX), the
      id integer remains set after exiting the loop, and the address of an
      out-of-bounds entry will be returned and subsequently used in the parent
      function, causing potentially ugly memory corruption.  This patch resets
      the id integer to 0 on encountering an invalid id so that NULL will be
      returned after finishing the loop if no valid ids are found.
      Signed-off-by: default avatarDan Rosenberg <drosenberg@vsecurity.com>
      Acked-by: default avatarVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1209e7ab
    • Kashyap, Desai's avatar
      mptfusion: Fix Incorrect return value in mptscsih_dev_reset · bcb8164b
      Kashyap, Desai authored
      commit bcfe42e9 upstream.
      
      There's a branch at the end of this function that
      is supposed to normalize the return value with what
      the mid-layer expects. In this one case, we get it wrong.
      
      Also increase the verbosity of the INFO level printk
      at the end of mptscsih_abort to include the actual return value
      and the scmd->serial_number. The reason being success
      or failure is actually determined by the state of
      the internal tag list when a TMF is issued, and not the
      return value of the TMF cmd. The serial_number is also
      used in this decision, thus it's useful to know for debugging
      purposes.
      Reported-by: default avatarPeter M. Petrakis <peter.petrakis@canonical.com>
      Signed-off-by: default avatarKashyap Desai <kashyap.desai@lsi.com>
      Signed-off-by: default avatarJames Bottomley <James.Bottomley@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      bcb8164b
    • Kashyap, Desai's avatar
      mptfusion: mptctl_release is required in mptctl.c · 6140386a
      Kashyap, Desai authored
      commit 84857c8b upstream.
      
      Added missing release callback for file_operations mptctl_fops.
      Without release callback there will be never freed. It remains on
      mptctl's eent list even after the file is closed and released.
      
      Relavent RHEL bugzilla is 660871
      Signed-off-by: default avatarKashyap Desai <kashyap.desai@lsi.com>
      Signed-off-by: default avatarJames Bottomley <James.Bottomley@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6140386a
    • Konstantin Khorenko's avatar
      NFSD: memory corruption due to writing beyond the stat array · e2b71389
      Konstantin Khorenko authored
      commit 3aa6e0aa upstream.
      
      If nfsd fails to find an exported via NFS file in the readahead cache, it
      should increment corresponding nfsdstats counter (ra_depth[10]), but due to a
      bug it may instead write to ra_depth[11], corrupting the following field.
      
      In a kernel with NFSDv4 compiled in the corruption takes the form of an
      increment of a counter of the number of NFSv4 operation 0's received; since
      there is no operation 0, this is harmless.
      
      In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the
      memory beyond nfsdstats.
      Signed-off-by: default avatarKonstantin Khorenko <khorenko@openvz.org>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      e2b71389
  2. 18 Feb, 2011 1 commit
  3. 17 Feb, 2011 26 commits
    • Namhyung Kim's avatar
      kernel/user.c: add lock release annotation on free_user() · d02522a9
      Namhyung Kim authored
      commit 571428be upstream.
      
      free_user() releases uidhash_lock but was missing annotation.  Add it.
      This removes following sparse warnings:
      
       include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock
       kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      
      d02522a9
    • Dan Carpenter's avatar
      sched: Remove some dead code · c3d5f1e8
      Dan Carpenter authored
      commit 618765801ebc271fe0ba3eca99fcfd62a1f786e1 upstream.
      
      This was left over from "7c941438 sched: Remove USER_SCHED"
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Acked-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      LKML-Reference: <20100315082148.GD18181@bicker>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c3d5f1e8
    • Peter Zijlstra's avatar
      sched: Fix wake_affine() vs RT tasks · 97fc6c0d
      Peter Zijlstra authored
      Commit: e51fd5e2 upstream
      
      Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT
      tasks), wake_affine() goes funny on RT tasks due to them still having a
      !0 weight and wake_affine() still subtracts that from the rq weight.
      
      Since nobody should be using se->weight for RT tasks, set the value to
      zero. Also, since we now use ->cpu_power to normalize rq weights to
      account for RT cpu usage, add that factor into the imbalance computation.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1275316109.27810.22969.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      97fc6c0d
    • Nikhil Rao's avatar
      sched: Fix idle balancing · 354f5613
      Nikhil Rao authored
      Commit: d5ad140b upstream
      
      An earlier commit reverts idle balancing throttling reset to fix a 30%
      regression in volanomark throughput. We still need to reset idle_stamp
      when we pull a task in newidle balance.
      Reported-by: default avatarAlex Shi <alex.shi@intel.com>
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1290022924-3548-1-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      354f5613
    • Alex Shi's avatar
      sched: Fix volanomark performance regression · 82dd2a0c
      Alex Shi authored
      Commit: b5482cfa upstream
      
      Commit fab47622 triggers excessive idle balancing, causing a ~30% loss in
      volanomark throughput. Remove idle balancing throttle reset.
      Originally-by: default avatarAlex Shi <alex.shi@intel.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1289928732.5169.211.camel@maggy.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      82dd2a0c
    • Peter Zijlstra's avatar
      sched: Fix cross-sched-class wakeup preemption · 134f7fee
      Peter Zijlstra authored
      Commit: 1e5a7405 upstream
      
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      134f7fee
    • Suresh Siddha's avatar
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aa68c032
      Suresh Siddha authored
      Commit: aae6d3dd upstream
      
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      aa68c032
    • Peter Zijlstra's avatar
      sched, cgroup: Fixup broken cgroup movement · 2bdf3dc4
      Peter Zijlstra authored
      Commit: b2b5ce02 upstream
      
      Dima noticed that we fail to correct the ->vruntime of sleeping tasks
      when we move them between cgroups.
      Reported-by: default avatarDima Zavin <dima@android.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      LKML-Reference: <1287150604.29097.1513.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      2bdf3dc4
    • Ingo Molnar's avatar
      sched: Export account_system_vtime() · ea63ff2b
      Ingo Molnar authored
      Commit: b7dadc38 upstream
      
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      ea63ff2b
    • Venkatesh Pallipadi's avatar
      sched: Call tick_check_idle before __irq_enter · 19d3e3cb
      Venkatesh Pallipadi authored
      Commit: d267f87f upstream
      
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9:
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      19d3e3cb
    • Venkatesh Pallipadi's avatar
      sched: Remove irq time from available CPU power · c8c88559
      Venkatesh Pallipadi authored
      Commit: aa483808 upstream
      
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c8c88559
    • Venkatesh Pallipadi's avatar
      sched: Do not account irq time to current task · 3a69989d
      Venkatesh Pallipadi authored
      Commit: 305e6835 upstream
      
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      3a69989d
    • Venkatesh Pallipadi's avatar
      x86: Add IRQ_TIME_ACCOUNTING · 3b7d4d54
      Venkatesh Pallipadi authored
      Commit: e82b8e4e upstream
      
      This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
      when TSC is enabled.
      
      This change just enables fine grained irq time accounting, isn't used yet.
      Following patches use it for different purposes.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      3b7d4d54
    • Venkatesh Pallipadi's avatar
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · 5e7ce6ec
      Venkatesh Pallipadi authored
      Commit: b52bfee4 upstream
      
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      5e7ce6ec
    • Venkatesh Pallipadi's avatar
      sched: Add a PF flag for ksoftirqd identification · 9b511401
      Venkatesh Pallipadi authored
      Commit: 6cdd5199 upstream
      
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      9b511401
    • Dave Young's avatar
      sched: Remove unused PF_ALIGNWARN flag · 82f7e90e
      Dave Young authored
      Commit: 637bbdc5 upstream
      
      PF_ALIGNWARN is not implemented and it is for 486 as the
      comment.
      
      It is not likely someone will implement this flag feature.
      So here remove this flag and leave the valuable 0x00000001 for
      future use.
      Signed-off-by: default avatarDave Young <hidave.darkstar@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20100913121903.GB22238@darkstar>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      82f7e90e
    • Venkatesh Pallipadi's avatar
      sched: Consolidate account_system_vtime extern declaration · 95824433
      Venkatesh Pallipadi authored
      Commit: e1e10a26 upstream
      
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      95824433
    • Venkatesh Pallipadi's avatar
      sched: Fix softirq time accounting · 49c6f4a2
      Venkatesh Pallipadi authored
      Commit: 75e1056f upstream
      
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      49c6f4a2
    • Nikhil Rao's avatar
      sched: Drop group_capacity to 1 only if local group has extra capacity · 1d3d2371
      Nikhil Rao authored
      Commit: 75dd321d upstream
      
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1d3d2371
    • Nikhil Rao's avatar
      sched: Force balancing on newidle balance if local group has capacity · 703482e7
      Nikhil Rao authored
      Commit: fab47622 upstream
      
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      703482e7
    • Nikhil Rao's avatar
      sched: Set group_imb only a task can be pulled from the busiest cpu · 6e1d0fe9
      Nikhil Rao authored
      Commit: 2582f0eb upstream
      
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6e1d0fe9
    • Nikhil Rao's avatar
      sched: Do not consider SCHED_IDLE tasks to be cache hot · 215856a4
      Nikhil Rao authored
      Commit: ef8002f6 upstream
      
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      215856a4
    • Peter Zijlstra's avatar
      sched: fix RCU lockdep splat from task_group() · acb2c6dc
      Peter Zijlstra authored
      Commit: 6506cf6c upstream
      
      This addresses the following RCU lockdep splat:
      
      [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
      [0.052999] lockdep: fixing up alternatives.
      [0.054105]
      [0.054106] ===================================================
      [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
      [0.054999] ---------------------------------------------------
      [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
      [0.054999]
      [0.054999] other info that might help us debug this:
      [0.054999]
      [0.054999]
      [0.054999] rcu_scheduler_active = 1, debug_locks = 1
      [0.054999] 3 locks held by swapper/1:
      [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
      [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
      [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
      [0.054999]
      [0.054999] stack backtrace:
      [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
      [0.054999] Call Trace:
      [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
      [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
      [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
      [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
      [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
      [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
      [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
      [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
      [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
      [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
      [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
      [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
      [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
      [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
      [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
      [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
      [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
      [0.130045]  #2lockdep: fixing up alternatives.
      [0.203089]  #3 Ok.
      [0.275286] Brought up 4 CPUs
      [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
      
      The cgroup_subsys_state structures referenced by idle tasks are never
      freed, because the idle tasks should be part of the root cgroup,
      which is not removable.
      
      The problem is that while we do in-fact hold rq->lock, the newly spawned
      idle thread's cpu is not yet set to the correct cpu so the lockdep check
      in task_group():
      
        lockdep_is_held(&task_rq(p)->lock)
      
      will fail.
      
      But this is a chicken and egg problem.  Setting the CPU's runqueue requires
      that the CPU's runqueue already be set.  ;-)
      
      So insert an RCU read-side critical section to avoid the complaint.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      acb2c6dc
    • Paul E. McKenney's avatar
      sched: suppress RCU lockdep splat in task_fork_fair · f4de371f
      Paul E. McKenney authored
      Commit: b0a0f667 upstream
      
      > ===================================================
      > [ INFO: suspicious rcu_dereference_check() usage. ]
      > ---------------------------------------------------
      > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
      >
      > other info that might help us debug this:
      >
      > rcu_scheduler_active = 1, debug_locks = 1
      > 1 lock held by ifup/23517:
      >   #0:  (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
      >
      > stack backtrace:
      > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
      > Call Trace:
      >   [<c075e219>] ? printk+0xf/0x16
      >   [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
      >   [<c0426854>] task_group+0x6d/0x79
      >   [<c042686e>] set_task_rq+0xe/0x57
      >   [<c042f79e>] task_fork_fair+0x57/0x108
      >   [<c042e965>] sched_fork+0x82/0xf9
      >   [<c04334b3>] copy_process+0x569/0xe8e
      >   [<c0433ef0>] do_fork+0x118/0x262
      >   [<c076302f>] ? do_page_fault+0x16a/0x2cf
      >   [<c044b80c>] ? up_read+0x16/0x2a
      >   [<c04085ae>] sys_clone+0x1b/0x20
      >   [<c04030a5>] ptregs_clone+0x15/0x30
      >   [<c0402f1c>] ? sysenter_do_call+0x12/0x38
      
      Here a newly created task is having its runqueue assigned.  The new task
      is not yet on the tasklist, so cannot go away.  This is therefore a false
      positive, suppress with an RCU read-side critical section.
      
      Reported-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f4de371f
    • stable-bot for Steven Rostedt's avatar
      sched: Give CPU bound RT tasks preference · f1d70344
      stable-bot for Steven Rostedt authored
      From:: Steven Rostedt <srostedt@redhat.com>
      
      Commit: b3bc211c upstream
      
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f1d70344
    • Steven Rostedt's avatar
      sched: Try not to migrate higher priority RT tasks · f266611e
      Steven Rostedt authored
      Commit: 43fa5460 upstream
      
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f266611e