1. 10 May, 2004 13 commits
    • Andrew Morton's avatar
      [PATCH] sched: add local load metrics · 1ec43096
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This patch removes the per runqueue array of NR_CPU arrays.  Each time we
      want to check a remote CPU's load we check nr_running as well anyway, so
      introduce a cpu_load which is the load of the local runqueue and is kept
      updated in the timer tick.  Put them in the same cacheline.
      
      This has additional benefits of having the cpu_load consistent across all
      CPUs and more up to date.  It is sampled better too, being updated once per
      timer tick.
      
      This shouldn't make much difference in scheduling behaviour, but all
      benchmarks are either as good or better on the 16-way NUMAQ: hackbench,
      reaim, volanomark are about the same, tbench and dbench are maybe a bit
      better.  kernbench is about one percent better.
      
      John reckons it isn't a big deal, but it does save 4K per CPU or 2MB total
      on his big systems, so I figure it must be a bit kinder on the caches.  I
      think it is just nicer in general anyway.
      1ec43096
    • Andrew Morton's avatar
      [PATCH] sched: SMT niceness handling · 47ad0fce
      Andrew Morton authored
      From: Con Kolivas <kernel@kolivas.org>
      
      This patch provides full per-package priority support for SMT processors
      (aka pentium4 hyperthreading) when combined with CONFIG_SCHED_SMT.
      
      It maintains cpu percentage distribution within each physical cpu package
      by limiting the time a lower priority task can run on a sibling cpu
      concurrently with a higher priority task.
      
      It introduces a new flag into the scheduler domain
      unsigned int per_cpu_gain;	/* CPU % gained by adding domain cpus */
      
      This is empirically set to 15% for pentium4 at the moment and can be
      modified to support different values dynamically as newer processors come
      out with improved SMT performance.  It should not matter how many siblings
      there are.
      
      How it works is it compares tasks running on sibling cpus and when a lower
      static priority task is running it will delay it till
      high_priority_timeslice * (100 - per_cpu_gain) / 100 <= low_prio_timeslice
      
      eg.  a nice 19 task timeslice is 10ms and nice 0 timeslice is 102ms On
      vanilla the nice 0 task runs on one logical cpu while the nice 19 task runs
      unabated on the other logical cpu.  With smtnice the nice 0 runs on one
      logical cpu for 102ms and the nice 19 sleeps till the nice 0 task has 12ms
      remaining and then will schedule.
      
      Real time tasks and kernel threads are not altered by this code, and kernel
      threads do not delay lower priority user tasks.
      
      with lots of thanks to Zwane Mwaikambo and Nick Piggin for help with the
      coding of this version.
      
      If this is merged, it is probably best to delay pushing this upstream in
      mainline till sched_domains gets tested for at least one major release.
      47ad0fce
    • Andrew Morton's avatar
      [PATCH] sched_domains: use cpu_possible_map · a5f39fd8
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This changes sched domains to contain all possible CPUs, and check for
      online as needed.  It's in order to play nicely with CPU hotplug.
      a5f39fd8
    • Andrew Morton's avatar
      [PATCH] sched-group-power · 482b9933
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The following patch implements a cpu_power member to struct sched_group.
      
      This allows special casing to be removed for SMT groups in the balancing
      code.  It does not take CPU hotplug into account yet, but that shouldn't be
      too hard.
      
      I have tested it on the NUMAQ by pretending it has SMT.  Works as expected.
      Active balances across nodes.
      482b9933
    • Andrew Morton's avatar
      [PATCH] sched_balance_exec(): don't fiddle with the cpus_allowed mask · 3de8a6b4
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>,
            Nick Piggin <piggin@cyberone.com.au>
      
      The current sched_balance_exec() sets the task's cpus_allowed mask
      temporarily to move it to a different CPU.  This has several issues,
      including the fact that a task will see its affinity at a bogus value.
      
      So we change the migration_req_t to explicitly specify a destination CPU,
      rather than the migration thread deriving it from cpus_allowed.  If the
      requested CPU is no longer valid (racing with another set_cpus_allowed,
      say), it can be ignored: if the task is not allowed on this CPU, there will
      be another migration request pending.
      
      This change allows sched_balance_exec() to tell the migration thread what
      to do without changing the cpus_allowed mask.
      
      So we rename __set_cpus_allowed() to move_task(), as the cpus_allowed mask
      is now set by the caller.  And move_task_away(), which the migration thread
      uses to actually perform the move, is renamed __move_task().
      
      I also ignore offline CPUs in sched_best_cpu(), so sched_migrate_task()
      doesn't need to check for offline CPUs.
      
      Ulterior motive: this approach also plays well with CPU Hotplug.
      Previously that patch might have seen a task with cpus_allowed only
      containing the dying CPU (temporarily due to sched_balance_exec) and
      forcibly reset it to all cpus, which might be wrong.  The other approach is
      to hold the cpucontrol sem around sched_balance_exec(), which is too much
      of a bottleneck.
      3de8a6b4
    • Andrew Morton's avatar
      [PATCH] sched: handle inter-CPU jiffies skew · db05a192
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      John Hawkes discribed this problem to me:
      
      There *is* a small problem in this area, though, that SuSE avoids.
      "jiffies" gets updated by cpu0.  The other CPUs may, over time, get out of
      sync (and they're initialized on ia64 to start out being out of sync), so
      it's no guarantee that every CPU will wake up from its timer interrupt and
      see a "jiffies" value that is guaranteed to be last_jiffies+1.  Sometimes
      the jiffies value may be unchanged since the last wakeup.  Sometimes the
      jiffies value may have incremented by 2 (or more, especially if cpu0's
      interrupts are disabled for long stretches of time).  So an algoithm that
      says, "I'll call load_balance() only when jiffies is *exactly* N" is going
      to fail on occasion, either by calling load_balance() too often or not
      often enough.  ***
      
      I fixed this by adding a last_balance field to struct sched_domain, and
      working off that.
      db05a192
    • Andrew Morton's avatar
      [PATCH] sched: implement domains for i386 HT · e18e19ad
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      The following patch builds a scheduling description for the i386
      architecture using cpu_sibling_map to set up SMT if CONFIG_SCHED_SMT is
      set.
      
      It could be made more fancy and collapse degenerate domains at runtime (ie.
      1 sibling per CPU, or 1 NUMA node in the computer).
      
      
      From: Zwane Mwaikambo <zwane@arm.linux.org.uk>
      
         This fixes an oops due to cpu_sibling_map being uninitialised when a
         system with no MP table (most UP boxen) boots a CONFIG_SMT kernel.  What
         also happens is that the cpu_group lists end up not being terminated
         properly, but this oops kills it first.  Patch tested on UP w/o MP table,
         2x P2 and UP Xeon w/ no siblings.
      
      From: "Martin J. Bligh" <mbligh@aracnet.com>,
            Nick Piggin <piggin@cyberone.com.au>
      
         Change arch_init_sched_domains to use cpu_online_map
      
      From: Anton Blanchard <anton@samba.org>
      
         Fix build with NR_CPUS > BITS_PER_LONG
      e18e19ad
    • Andrew Morton's avatar
      [PATCH] sched: cpu_sibling_map to cpu_mask · 7a1dc0ea
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This is a (somewhat) trivial patch which converts cpu_sibling_map from an
      array of CPUs to an array of cpumasks.  Needed for >2 siblings per package,
      but it actually can simplify code as it allows the cpu_sibling_map to be
      set up even when there is 1 sibling per package.  Intel want this, I use it
      in the next patch to build scheduling domains for the P4 HT.
      
      From: Thomas Schlichter <thomas.schlichter@web.de>
      
         Build fix
      
      From: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
      
         Fix to handle more than 2 siblings per package.
      7a1dc0ea
    • Andrew Morton's avatar
      [PATCH] scheduler domain balancing improvements · 3dfa303d
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This patch gets the sched_domain scheduler working better WRT balancing.
      Its been tested on the NUMAQ.  Among other things it changes to the way SMT
      load calculation works so as not to active load blances when it shouldn't.
      
      It still has a problem with SMT and NUMA: it will put a task on each
      sibling in a node before moving tasks to another node.  It should probably
      start moving tasks after each *physical* CPU is filled.
      
      To fix, you need "how much CPU power in this domain?" At the moment we
      approximate # runqueues == CPU power, and hack around it at the CPU
      physical domain by counting all sibling runqueues as 1.
      
      It isn't hard to correctly work the CPU power out, but once CPU hotplug is
      in the equation it becomes much more hotplug events.  If anyone is actually
      interested in getting this fixed, that is.
      3dfa303d
    • Andrew Morton's avatar
      [PATCH] sched_domain debugging · b45bb339
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      Anton was attempting to make a sched domain topology for his POWER5 and was
      having some trouble.
      
      This patch only includes code which is ifdefed out, but hopefully it will
      be of some use to implementors.
      b45bb339
    • Andrew Morton's avatar
      [PATCH] sched: scheduler domain support · 8c136f71
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      This is the core sched domains patch.  It can handle any number of levels
      in a scheduling heirachy, and allows architectures to easily customize how
      the scheduler behaves.  It also provides progressive balancing backoff
      needed by SGI on their large systems (although they have not yet tested
      it).
      
      It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and
      gets results very similar to them when using the default scheduling
      description.
      
      Benchmarks
      ==========
      
      Martin was seeing I think 10-20% better system times in kernbench on the 32
      way.  I was seeing improvements in dbench, tbench, kernbench, reaim,
      hackbench on a 16-way NUMAQ.  Hackbench in fact had a non linear element
      which is all but eliminated.  Large improvements in volanomark.
      
      Cross node task migration was decreased in all above benchmarks, sometimes by
      a factor of 100!!  Cross CPU migration was also generally decreased.  See
      this post:
      http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2
      
      Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues
      patch (which is a big improvement).
      
      Some examples on the 16-way NUMAQ (this is slightly older sched domain code):
      
       http://www.kerneltrap.org/~npiggin/w26/hbench.png
       http://www.kerneltrap.org/~npiggin/w26/vmark.html
      
      From: Jes Sorensen <jes@wildopensource.com>
      
         Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS >
         BITS_PER_LONG.
      
      From: "Martin J. Bligh" <mbligh@aracnet.com>
      
         Fix a minor nit with the find_busiest_group code.  No functional change,
         but makes the code simpler and clearer.  This patch does two things ... 
         adds some more expansive comments, and removes this if clause:
      
            if (*imbalance < SCHED_LOAD_SCALE
                            && max_load - this_load > SCHED_LOAD_SCALE)
      		*imbalance = SCHED_LOAD_SCALE;
      
         If we remove the scaling factor, we're basically conditionally doing:
      
      	if (*imbalance < 1)
      		*imbalance = 1;
      
         Which is pointless, as the very next thing we do is to remove the
         scaling factor, rounding up to the nearest integer as we do:
      
      	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
      
         Thus the if statement is redundant, and only makes the code harder to
         read ;-)
      
      From: Rick Lindsley <ricklind@us.ibm.com>
      
         In find_busiest_group(), after we exit the do/while, we select our
         imbalance.  But max_load, avg_load, and this_load are all unsigned, so
         min(x,y) will make a bad choice if max_load < avg_load < this_load (that
         is, a choice between two negative [very large] numbers).
      
         Unfortunately, there is a bug when max_load never gets changed from zero
         (look in the loop and think what happens if the only load on the machine is
         being created by cpu groups of which we are a member).  And you have a
         recipe for some really bogus values for imbalance.
      
         Even if you fix the max_load == 0 bug, there will still be times when
         avg_load - this_load will be negative (thus very large) and you'll make the
         decision to move stuff when you shouldn't have.
      
         This patch allows for this_load to set max_load, which if I understand
         the logic properly is correct.  With this patch applied, the algorithm is
         *much* more conservative ...  maybe *too* conservative but that's for
         another round of testing ...
      
      From: Ingo Molnar <mingo@elte.hu>
      
         sched-find-busiest-fix
      8c136f71
    • Andrew Morton's avatar
      [PATCH] sched: improved resolution in find_busiest_node · 067e0480
      Andrew Morton authored
      From: Nick Piggin <piggin@cyberone.com.au>
      
      From: Frank Cornelis <frank.cornelis@elis.ugent.be>
      
      In order to get the best possible resolution we need to use NR_CPUS instead
      of the constant value 10.  load is an int, so no need to worry about
      overflows...
      067e0480
    • Andrew Morton's avatar
      [PATCH] small scheduler cleanup · 4f20771c
      Andrew Morton authored
      From: Ingo Molnar <mingo@elte.hu>
      
      From: Nick Piggin <piggin@cyberone.com.au> wrote:
      
      It removes the last place where we mess with run_list open coded.
      4f20771c
  2. 09 May, 2004 13 commits
    • Linus Torvalds's avatar
      Linux 2.6.6 · 3dc567d8
      Linus Torvalds authored
      3dc567d8
    • Linus Torvalds's avatar
      Mark the ACPI CPU throttle and timer IO regions busy. · 7e941d4d
      Linus Torvalds authored
      This should help some laptops where the generic PCI
      code might otherwise believe that this range is unused.
      The ACPI IO range is usually not visible as a standard
      BAR.
      7e941d4d
    • Andi Kleen's avatar
      [PATCH] Fix x86-64 compilation without iommu for 2.6.6rc3 · c99ae253
      Andi Kleen authored
      Various people hit this in earlier kernels. The x86-64 kernel did not compile 
      without CONFIG_IOMMU_GART in various configurations. Just add the missing symbol 
      and export it. Also export iommu_merge while I am at it.
      c99ae253
    • Andi Kleen's avatar
      [PATCH] Fix machine check handler on x86-64 · 57bfa2c5
      Andi Kleen authored
      This fixes a bug in the new machine check handler on x86-64.
      
      One nasty part was that when you got an MCE during boot up
      then it would not always print it on the screen, but still
      panic because it attempted to kill the idle task.
      
      This patch does:
       - Always use KERN_EMERG when printing MCEs
       - Always panic and print on screen before killing idle loop
         or init.
      57bfa2c5
    • Linus Torvalds's avatar
      Merge bk://bk.arm.linux.org.uk/linux-2.6-rmk · b81346bc
      Linus Torvalds authored
      into ppc970.osdl.org:/home/torvalds/v2.6/linux
      b81346bc
    • Russell King's avatar
      Merge flint.arm.linux.org.uk:/usr/src/bk/linux-2.6-sharp · 1d89057b
      Russell King authored
      into flint.arm.linux.org.uk:/usr/src/bk/linux-2.6-rmk
      1d89057b
    • Marc Singer's avatar
      [ARM PATCH] 1818/1: lh7a40x #2 (3/7) doc · f42083cc
      Marc Singer authored
      Patch from Marc Singer
      
      Documentation for the Sharp-LH machines.
      f42083cc
    • Marc Singer's avatar
      [ARM PATCH] 1817/1: lh7a40x #2 (2/7) core-include · a7c57d4a
      Marc Singer authored
      Patch from Marc Singer
      
      Include files for this updated lh7a40x patch set.  The changes in this
      set from the previous are mostly cosmetic.  The memory macros were
      reworked in order to be more similar to the other ARM versions.  The
      previous versions produced the same results, but the forms are
      slightly different.
      a7c57d4a
    • Marc Singer's avatar
      [ARM PATCH] 1816/1: lh7a40x #2 (1/7) core · 1c0c2783
      Marc Singer authored
      Patch from Marc Singer
      
      Updated change set for the 2.6.5 kernel *and* for the April 8th arm
      patch.  Also included are changes suggested by Russell that merge
      several of the files in the mach- directory.  I have also endeavored
      to remove all unnecessary whitespace additions.
      
      Note that since I've found the cause of an annoying user-space crash,
      I believe that this patch is OK.  The crash appears to have nothing to
      do with the system setup.
      1c0c2783
    • Tony Lindgren's avatar
      [ARM PATCH] 1847/1: OMAP update 2/2: include files · a263e250
      Tony Lindgren authored
      Patch from Tony Lindgren
      
      This patch syncs the mainline kernel with the linux-omap tree. The
      patch contains following updates:
      - Move virtual IO area to 0xfefb0000 from 0xfffb0000 to fix parts of
        IO area  overlapping with ARM Linux reserved memory area
      - Add support to OMAP-730, OMAP-5912, and OMAP-1710 processors
      - Reorganize board support
      - Add OMAP core detection
      This patch requires ARM Linux patch 1844/1 be applied to compile
      OMAP-730 and OMAP-5912
      a263e250
    • Tony Lindgren's avatar
      [ARM PATCH] 1846/1: OMAP update 1/2: arch files · 62b2119f
      Tony Lindgren authored
      Patch from Tony Lindgren
      
      This patch syncs the mainline kernel with the linux-omap tree. The
      patch contains following updates:
      - Move virtual IO area to 0xfefb0000 from 0xfffb0000 to fix parts of
        IO area overlapping with ARM Linux reserved memory area
      - Add support to OMAP-730, OMAP-5912, and OMAP-1710 processors
      - Reorganize board support
      - Add OMAP core detection
      This patch requires ARM Linux patch 1844/1 be applied to compile
      OMAP-730 and OMAP-5912
      62b2119f
    • Tony Lindgren's avatar
      [ARM PATCH] 1844/1: Allow OMAP-730 and OMAP-5910 to use ARM926 in mm/Kconfig · 78c4d584
      Tony Lindgren authored
      Patch from Tony Lindgren
      
      Adds OMAP-730 and OMAP-5910 support
      78c4d584
    • Armin Schindler's avatar
      [PATCH] ISDN Eicon driver: fix idi cleanup deadlock · 8f555e6d
      Armin Schindler authored
         On IDI module cleanup, the freed card must be removed from list.  
         Use list_empty() instead of list_for_each() loop. Thanks Linus.
      8f555e6d
  3. 08 May, 2004 14 commits
    • Linus Torvalds's avatar
      Waste less memory in dentries. · 293889f5
      Linus Torvalds authored
      We don't bother aligining them on a cacheline boundary, since
      that is totally excessive in some configurations (especially
      P4's with 128-byte cachelines).
      
      Instead, we make the minimum inline string size a bit longer,
      and re-order a few fields that allow for better packing on
      64-bit architectures, for better memory utilization.
      293889f5
    • Linus Torvalds's avatar
      Merge bk://kernel.bkbits.net/davem/sparc-2.6 · 82f1671a
      Linus Torvalds authored
      into ppc970.osdl.org:/home/torvalds/v2.6/linux
      82f1671a
    • Linus Torvalds's avatar
      Merge bk://kernel.bkbits.net/davem/net-2.6 · 0ef8ced2
      Linus Torvalds authored
      into ppc970.osdl.org:/home/torvalds/v2.6/linux
      0ef8ced2
    • Andrew Morton's avatar
      [PATCH] run populate_rootfs() before initcalls · 25714ddf
      Andrew Morton authored
      I moved this a little too late - we need to run populate_rootfs() before
      running initcalls because some driver initcalls need to open files for
      firmware.
      
      The populate_rootfs() call is still coming after init_idle(), so it won't
      knock the scheduler over.
      25714ddf
    • David S. Miller's avatar
      2b308273
    • David S. Miller's avatar
    • Stephen Hemminger's avatar
      [TCP]: BIC TCP for Linux 2.6.6 · 54d05783
      Stephen Hemminger authored
      This is a version of Binary Increase Control (BIC) TCP
      developed by NCSU.   It is yet another TCP congestion control
      algorithm for handling big fat pipes. For normal size congestion
      windows it behaves the same as existing TCP Reno, but when window
      is large it uses additive increase to ensure fairness and when
      window is small it uses binary search increase.
      
      For more details see the BIC TCP web page
       http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/
      
      The original code was for web100 (2.4); this version is pretty
      much the same but targeted for 2.6 with less sysctl parameters 
      and more constants.
      
      I don't have a real high speed long haul network to test, but
      when running over 1G links with delays, the performance is more stable
      (ie tests are repeatable) and as fast as existing Reno.
      54d05783
    • Sridhar Samudrala's avatar
      [SCTP]: Fix multihomed connection failures on 64-bit systems. · fbb3aa0d
      Sridhar Samudrala authored
      Avoid the use of sizeof() and pointer arithmetic to get to the end
      of sctp_cookie structure. Instead use the last element peer_init which
      is a zero-sized array as the offset.
      fbb3aa0d
    • David Stevens's avatar
      6619be03
    • James Morris's avatar
      [NET]: Add sock_create_lite() · 398b3c44
      James Morris authored
      The purpose of this is to allow sockets created by the kernel in this way
      to be passed through the LSM socket creation hooks and be labeled and
      mediated in the same manner as other sockets.
      
      This patches addresses a class of potential issues with LSMs, where such
      sockets will not be labeled correctly (if at all), or mediated during
      creation.  Under SELinux, it fixes a specific bug where RPC sockets
      created by the kernel during TCP NFS serving are unlabeled.
      398b3c44
    • James Morris's avatar
      [NET]: Add sock_create_kern() · e2943dca
      James Morris authored
      Under SELinux, and potentially other LSMs, we need to be able to
      distinguish between user sockets and kernel sockets.  For SELinux
      specifically, kernel sockets need to be specially labeled during creation,
      then bypass access control checks (they are controlled by the kernel
      itself and not subject to SELinux mediation).
      
      This addresses a class of potential issues in SELinux where, for example, 
      a TCP NFS session times out, then the kernel re-establishes an RPC 
      connection upon further user activity.  We do not want such kernel 
      created sockets to be labeled with user security contexts.
      
      sock_create() and sock_create_kern() are wrapper functions, which seems 
      semantically clearer to me than e.g. adding a flag to sock_create().  If 
      you prefer the latter, then let me know.
      
      The patch also adds an argument to the LSM socket creation functions
      indicating whether the socket being created is a kernel socket or not.
      e2943dca
    • David S. Miller's avatar
      Merge nuts.davemloft.net:/disk1/BK/network-2.6 · 49a1f4d4
      David S. Miller authored
      into nuts.davemloft.net:/disk1/BK/net-2.6
      49a1f4d4
    • Joshua Kwan's avatar
      [SPARC64]: Use $(CC) in NEW_GCC checks. · 812b724d
      Joshua Kwan authored
      812b724d
    • Benjamin Herrenschmidt's avatar