1. 03 Jan, 2005 40 commits
    • Ananth N. Mavinakayanahalli's avatar
      [PATCH] Kprobes: wrapper to define jprobe.entry · 3cb8ef47
      Ananth N. Mavinakayanahalli authored
      Here is a patch that adds a wrapper for defining jprobe.entry to make
      t easy to handle the three dword function descriptors defined by the
      PowerPC ELF ABI.
      
      x86, ppc64 and x86_64 are also updated.
      Signed-off-by: default avatarAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3cb8ef47
    • David Gibson's avatar
      [PATCH] ppc64: tweaks to ppc64 cpu sysfs information · 64ec66bb
      David Gibson authored
      Currently the ppc64 sysfs code registers an entry for each possible cpu in
      sysfs, rather than just online cpus.  That makes sense, since the sysfs
      entries are needed to control onlining of the cpus.  However, this is done
      even if CONFIG_HOTPLUG_CPU is not set, or if it is not a hotplug capable
      (DLPAR) machine, which is a bit misleading.  Secondly it also registers all
      the other sysfs entries (mostly performance monitoring controls) on all
      possible cpus, although they are quite meaningless on non-online cpus.
      
      This patch alters the code to only register sysfs directories at boot for
      cpus which are either online or could be onlined (cpu is possible, and
      CONFIG_HOTPLUG_CPU and an lpar machine).  Furthermore, the entries apart
      from 'online' itself and 'physical_id' are only registered for online CPUs
      (and deregistered again if a cpu goes offline).
      
      Currently the ppc64 sysfs code registers an entry for each possible cpu in
      sysfs, rather than just online cpus.  That makes sense, since the sysfs
      entries are needed to control onlining of the cpus.  However, this is done
      even if CONFIG_HOTPLUG_CPU is not set, or if it is not a hotplug capable
      (DLPAR) machine, which is a bit misleading.  Secondly it also registers all
      the other sysfs entries (mostly performance monitoring controls) on all
      possible cpus, although they are quite meaningless on non-online cpus.
      
      This patch alters the code to only register sysfs directories at boot for
      cpus which are either online or could be onlined (cpu is possible, and
      CONFIG_HOTPLUG_CPU and an lpar machine).  Furthermore, the entries apart
      from 'online' itself and 'physical_id' are only registered for online CPUs
      (and deregistered again if a cpu goes offline).
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      64ec66bb
    • Ananth N. Mavinakayanahalli's avatar
      [PATCH] ppc64: kprobes implementation · 53a50435
      Ananth N. Mavinakayanahalli authored
      Kprobes (Kernel dynamic probes) is a lightweight mechanism for kernel
      modules to insert probes into a running kernel, without the need to modify
      the underlying source.  The probe handlers can then be coded to log
      relevent data at the probe point.  More information on kprobes can be found
      at:
      
      http://www-124.ibm.com/developerworks/oss/linux/projects/kprobes/
      
      Jprobes (or jumper probes) is a small infrastructure to access function
      arguments.  It can be used by defining a small stub with the same template
      as the routine in kernel, within which the required parameters can be
      logged.
      Signed-off-by: default avatarAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      53a50435
    • Arthur Othieno's avatar
      [PATCH] ppc32: Resurrect Documentation/powerpc/cpu_features.txt · 36055b52
      Arthur Othieno authored
      Documentation/powerpc/cpu_features.txt mysteriously disappeared sometime
      when 2.5 forked off.
      
      Searching through BK logs on linux.bkbits.net didn't reveal anything,
      unfortunately.  The only reference I could pick up from searching the
      available lkml archives is the 2.4.20-pre11 ChangeLog where this was first
      merged.
      
      Thus far, nothing indicates it was intentionally removed, and AFAICS, is
      still up to date with the current code.
      Signed-off-by: default avatarArthur Othieno <a.othieno@bluewin.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      36055b52
    • Matt Porter's avatar
      [PATCH] ppc32: fix io_remap_page_range for 36-bit phys platforms · bbf53507
      Matt Porter authored
      Fixes io_remap_page_range() to use the 32-bit address translator similar to
      ioremap().  Someday u64 start/end resources should make this unnecessary.
      Fixes set_pte() to handle a long long pte_t properly.
      Signed-off-by: default avatarMatt Porter <mporter@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bbf53507
    • Matt Porter's avatar
      [PATCH] ppc32: add uImage to default targets · 34b7c669
      Matt Porter authored
      We'd like to get a uImage when just using 'make' on many targets.  After
      some discussion, it made sense to simply add uImage to the default targets
      since it adds minimal build overhead and will work on all platforms.  Also,
      fix a dependency in the boot stuff.
      Signed-off-by: default avatarMatt Porter <mporter@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      34b7c669
    • Corey Minyard's avatar
      [PATCH] PPC debug setcontext syscall implementation. · a784ab71
      Corey Minyard authored
      Add a debugging interface for PowerPC that allows signal handlers (or any
      jump to a context, really) to perform debug functions.  It allows the a
      user program to turn on single-stepping, for instance, and the thread will
      get a trap after executing the next instruction.  It can also (on supported
      PPC processors) turn on branch tracing and get a trap after the next branch
      instruction is executed.  This is useful for in-application debugging.
      
      Note that you can enable single-stepping on x86 processors directly from
      signal handlers.  Newer x86 processors have the equivalent of a
      branch-trace bit in the IA32_DEBUGCTL MSR and could have similar function
      to this syscall.  Most other processors could benefit from a similar
      interface, except for ARM which is extraordinarily broken for debugging.
      
      Future uses of this could be adding the ability to set the hardware
      breakpoint registers from a signal handler.
      Signed-off-by: default avatarCorey Minyard <minyard@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a784ab71
    • Matt Porter's avatar
      [PATCH] ppc32: remove bogus SPRN_CPC0_GPIO define · a47ac38f
      Matt Porter authored
      This trivial patch removes long-standing typo in ibm44x.h.  In fact, we
      already have correct DCRN_CPC0_GPIO define later in the same file.
      Signed-off-by: default avatarEugene Surovegin <ebs@ebshome.net>
      Signed-off-by: default avatarMatt Porter <mporter@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a47ac38f
    • Matt Porter's avatar
      [PATCH] ppc32: fix ebony.c warnings · 19a8907d
      Matt Porter authored
      This patch removes annoying warnings in ebony.c.  Fix is similar to one I
      made in ocotea.c before.
      Signed-off-by: default avatarEugene Surovegin <ebs@ebshome.net>
      Signed-off-by: default avatarMatt Porter <mporter@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      19a8907d
    • Kumar Gala's avatar
      [PATCH] Fix prototypes & externs in e500 oprofile support · 3aa29948
      Kumar Gala authored
      Remove prototypes and externs out of the .c files
      Signed-off-by: default avatarAndy Fleming <afleming@freescale.com>
      Signed-off-by: default avatarKumar Gala <kumar.gala@freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3aa29948
    • Kumar Gala's avatar
      [PATCH] ppc32: performance Monitor/Oprofile support for e500 · 6c4fe420
      Kumar Gala authored
      Adds oprofile support for the e500 PowerPC core.
      Signed-off-by: default avatarAndy Fleming <afleming@freescale.com>
      Signed-off-by: default avatarKumar Gala <kumar.gala@freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6c4fe420
    • Matt Porter's avatar
      [PATCH] ppc32: PPC4xx PIC rewrite/cleanup · f481178e
      Matt Porter authored
      Patch from Eugene to do some cleanup of the PPC4xx PIC code.  Separates the
      interrupts that can have polarity/triggering modified for platform
      modification if necessary.  Between the two of us, it's tested on most of
      the affected platforms.
      Signed-off-by: default avatarEugene Surovegin <ebs@ebshome.net>
      Signed-off-by: default avatarMatt Porter <mporter@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f481178e
    • Randy Vinson's avatar
      [PATCH] ppc32: add Support for IBM 750FX and 750GX Eval Boards · ad47c00f
      Randy Vinson authored
      I've added support for the IBM 750FX and 750GX Eval Boards
      (Chestnut/Buckeye).
      Signed-off-by: default avatarRandy Vinson <rvinson@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ad47c00f
    • Mark A. Greer's avatar
      [PATCH] ppc32: support for Artesyn Katana cPCI boards · e1b2de6e
      Mark A. Greer authored
      This patch adds support for the Artesyn Katana 750i, 752i, and 3750.
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e1b2de6e
    • Mark A. Greer's avatar
      [PATCH] ppc32: support for Force CPCI-690 board · c7033ab5
      Mark A. Greer authored
      This patch adds support for the Force CPCI-690 cPCI board.
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c7033ab5
    • Mark A. Greer's avatar
      [PATCH] ppc32: support for Marvell EV-64260[ab]-BP eval platform · 216df828
      Mark A. Greer authored
      This patch adds support for a line of evaluation platforms from Marvell
      that use the Marvell GT64260[ab] host bridges.
      
      This patch depends on the Marvell host bridge support patch (mv64x60).
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      216df828
    • Mark A. Greer's avatar
      [PATCH] ppc32-marvell-host-bridge-support-mv64x60 review fixes · fe7c9be8
      Mark A. Greer authored
      Here is an incremental patch [hopefully] with your concerns addressed.
      Note that the arch/ppc/boot code is not kernel code and only exists for a
      short period of time before execution jumps to the kernel.
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fe7c9be8
    • Mark A. Greer's avatar
      [PATCH] ppc32: Marvell host bridge support (mv64x60) · 8594ca60
      Mark A. Greer authored
      This patch adds core support for a line of host bridges from Marvell
      (formerly Galileo).  This code has been tested with a GT64260a, GT64260b,
      MV64360, and MV64460.  Patches for platforms that use these bridges will be
      sent separately.
      
      The patch is rather large so a link is provided.
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8594ca60
    • Tom Rini's avatar
      [PATCH] ppc32: Switch to KBUILD_DEFCONFIG · b595953f
      Tom Rini authored
      The following patch switches ppc32 from using arch/ppc/defconfig to
      arch/ppc/configs/common_defconfig as a defconfig.  These files are supposed
      to be identical, but always end up out of sync.  This also updates the
      common_defconfig with current options.
      Signed-off-by: default avatarTom Rini <trini@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b595953f
    • Kumar Gala's avatar
      [PATCH] ppc32: refactor common book-e exception code · 221df77a
      Kumar Gala authored
      Moves common handling of InstructionStorage, Alignment, Program, and
      Decrementer exceptions handlers for Book-E processors (44x & e500) into
      common code.
      Signed-off-by: default avatarKumar Gala <kumar.gala@freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      221df77a
    • Kumar Gala's avatar
      [PATCH] ppc32: freescale Book-E MMU cleanup · 277fe7d9
      Kumar Gala authored
      Updates the Freescale Book-E MMU usage to match the architecture spec.
      This is mainly growing the widths of fields in various registers to match
      the architecture spec instead of the implementation.
      Signed-off-by: default avatarBecky Gill <becky.gill@freescale.com>
      Signed-off-by: default avatarKumar Gala <kumar.gala@freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      277fe7d9
    • Martin Josefsson's avatar
      [PATCH] Fix broken RST handling in ip_conntrack · d68bbf1d
      Martin Josefsson authored
      Here's a patch that fixes a pretty serious bug introduced by a recent
      "bugfix".  The problem is that RST packets are ignored if they follow an
      ACK packet, this means that the timeout of the connection isn't decreased,
      so we get lots of old connections lingering around until the timeout
      expires, the default timeout for state ESTABLISHED is 5 days.
      
      This needs to go into -bk as soon as possible.  The bug is present in
      2.6.10 as well.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d68bbf1d
    • Rusty Russell's avatar
      [PATCH] netfilter: Fix cleanup in ipt_recent should ipt_registrater_match error · 287b7862
      Rusty Russell authored
      When ipt_registrater_match() fails, ipt_recent doesn't remove its proc
      entry.  Found by nfsim.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      287b7862
    • Rusty Russell's avatar
      [PATCH] netfilter: Remove copy_to_user Warnings in Netfilter · be4bae19
      Rusty Russell authored
      After changing firewall rules, we try to return the counters to userspace.  We
      didn't fail at that point if the copy failed, but it doesn't really matter.
      Someone added a warn_unused_result attribute to copy_to_user, so we get bogus
      warnings.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      be4bae19
    • Rusty Russell's avatar
      [PATCH] netfilter: Remove IPCHAINS and IPFWADM compatibility · f631723a
      Rusty Russell authored
      We've been threatening to do this for ages: remove the backwards compatibility
      code.  We can now combine ip_conntrack_core.c and ip_conntrack_standalone.c,
      likewise for the NAT code, but that will come later.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f631723a
    • Rusty Russell's avatar
      [PATCH] netfilter: Add comment above remove_expectations in destroy_conntrack() · 6dd1537e
      Rusty Russell authored
      I removed this code in a previous patch, and Patrick McHardy explained
      what was wrong.  Add a comment.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6dd1537e
    • Rusty Russell's avatar
      [PATCH] netfilter: Fix ip_ct_selective_cleanup(), and rename ip_ct_iterate_cleanup() · 4759d4d9
      Rusty Russell authored
      Several places use ip_ct_selective_cleanup() as a general iterator, which it
      was not intended for (it takes a const ip_conntrack *).  So rename it, and
      make it take a non-const argument.
      
      Also, it missed unconfirmed connections, which aren't in the hash table.  This
      introduces a potential problem for users which expect to iterate all
      connections (such as the helper deletion code).  So keep a linked list of
      unconfirmed connections as well.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4759d4d9
    • Rusty Russell's avatar
      [PATCH] netfilter: Fix ip_conntrack_proto_sctp exit on sysctl fail · 5ea39dfb
      Rusty Russell authored
      On failure from register_sysctl_table, we return with exit 0.  Oops.  init and
      fini should also be static.  nfsim found these.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5ea39dfb
    • Rusty Russell's avatar
      [PATCH] netfilter: fix return values of ipt_recent checkentry · a9dcd00e
      Rusty Russell authored
      Peejix's nfsim test for ipt_recent, written two days ago, revealed this bugs
      with ipt_recent: checkentry() returns true or false, not an error.  (Maybe it
      should, but that's a much larger change).  Also, make hash_func() static.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a9dcd00e
    • Brent Casavant's avatar
      [PATCH] TCP hashes: NUMA interleaving · 0e4e73f8
      Brent Casavant authored
      Modifies the TCP ehash and TCP bhash to enable the use of vmalloc to
      alleviate boottime memory allocation imbalances on NUMA systems, utilizing
      flags to the alloc_large_system_hash routine in order to centralize the
      enabling of this behavior.
      Signed-off-by: default avatarBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0e4e73f8
    • Brent Casavant's avatar
      [PATCH] filesystem hashes: NUMA interleaving · e330572f
      Brent Casavant authored
      The following patch modifies the dentry cache and inode cache to enable the
      use of vmalloc to alleviate boottime memory allocation imbalances on NUMA
      systems, utilizing flags to the alloc_large_system_hash routine in order to
      centralize the enabling of this behavior.
      
      In general, for each hash, we check at the early allocation point whether
      hash distribution is enabled, and if so we defer allocation.  At the late
      allocation point we perform the allocation if it was not earlier deferred. 
      These late allocation points are the same points utilized prior to the
      addition of alloc_large_system_hash to the kernel.
      Signed-off-by: default avatarBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e330572f
    • Brent Casavant's avatar
      [PATCH] alloc_large_system_hash: NUMA interleaving · dcee73c4
      Brent Casavant authored
      NUMA systems running current Linux kernels suffer from substantial inequities
      in the amount of memory allocated from each NUMA node during boot.  In
      particular, several large hashes are allocated using alloc_bootmem, and as
      such are allocated contiguously from a single node each.
      
      This becomes a problem for certain workloads that are relatively common on
      big-iron HPC NUMA systems.  In particular, a number of MPI and OpenMP
      applications which require nearly all available processors in the system and
      nearly all the memory on each node run into difficulties.  Due to the uneven
      memory distribution onto a few nodes, any thread on those nodes will require a
      portion of its memory be allocated from remote nodes.  Any access to those
      memory locations will be slower than local accesses, and thereby slows down
      the effective computation rate for the affected CPUs/threads.  This problem is
      further amplified if the application is tightly synchronized between threads
      (as is often the case), as they entire job can run only at the speed of the
      slowest thread.
      
      Additionally since these hashes are usually accessed by all CPUS in the
      system, the NUMA network link on the node which hosts the hash experiences
      disproportionate traffic levels, thereby reducing the memory bandwidth
      available to that node's CPUs, and further penalizing performance of the
      threads executed thereupon.
      
      As such, it is desired to find a way to distribute these large hash
      allocations more evenly across NUMA nodes.  Fortunately current kernels do
      perform allocation interleaving for vmalloc() during boot, which provides a
      stepping stone to a solution.
      
      This series of patches enables (but does not require) the kernel to allocate
      several boot time hashes using vmalloc rather than alloc_bootmem, thereby
      causing the hashes to be interleaved amongst NUMA nodes.  In particular the
      dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be
      allocated in this manner.  Due to the limited vmalloc space on architectures
      such as i386, this behavior is turned on by default only for IA64 NUMA systems
      (though there is no reason other interested architectures could not enable it
      if desired).  Non-IA64 and non-NUMA systems continue to use the existing
      alloc_bootmem() allocation mechanism.  A boot line parameter "hashdist" can be
      set to override the default behavior.
      
      The following two sets of example output show the uneven distribution just
      after boot, using init=/bin/sh to eliminate as much non-kernel allocation as
      possible.
      
      Without the boot hash distribution patches:
      
       Nid  MemTotal   MemFree   MemUsed      (in kB)
         0   3870656   3697696    172960
         1   3882992   3866656     16336
         2   3883008   3866784     16224
         3   3882992   3866464     16528
         4   3883008   3866592     16416
         5   3883008   3866720     16288
         6   3882992   3342176    540816
         7   3883008   3865440     17568
         8   3882992   3866560     16432
         9   3883008   3866400     16608
        10   3882992   3866592     16400
        11   3883008   3866400     16608
        12   3882992   3866400     16592
        13   3883008   3866432     16576
        14   3883008   3866528     16480
        15   3864768   3848256     16512
       ToT  62097440  61152096    945344
      
      Notice that nodes 0 and 6 have a substantially larger memory utilization
      than all other nodes.
      
      With the boot hash distribution patch:
      
       Nid  MemTotal   MemFree   MemUsed      (in kB)
         0   3870656   3789792     80864
         1   3882992   3843776     39216
         2   3883008   3843808     39200
         3   3882992   3843904     39088
         4   3883008   3827488     55520
         5   3883008   3843712     39296
         6   3882992   3843936     39056
         7   3883008   3844096     38912
         8   3882992   3843712     39280
         9   3883008   3844000     39008
        10   3882992   3843872     39120
        11   3883008   3843872     39136
        12   3882992   3843808     39184
        13   3883008   3843936     39072
        14   3883008   3843712     39296
        15   3864768   3825760     39008
       ToT  62097440  61413184    684256
      
      While not perfectly even, we can see that there is a substantial improvement
      in the spread of memory allocated by the kernel during boot.  The remaining
      uneveness may be due in part to further boot time allocations that could be
      addressed in a similar manner, but some difference is due to the somewhat
      special nature of node 0 during boot.  However the uneveness has fallen to a
      much more acceptable level (at least to a level that SGI isn't concerned
      about).
      
      The astute reader will also notice that in this example, with this patch
      approximately 256 MB less memory was allocated during boot.  This is due to
      the size limits of a single vmalloc.  More specifically, this is because the
      automatically computed size of the TCP ehash exceeds the maximum size which a
      single vmalloc can accomodate.  However this is of little practical concern as
      the vmalloc size limit simply reduces one ridiculously large allocation
      (512MB) to a slightly less ridiculously large allocation (256MB).  In practice
      machines with large memory configurations are using the thash_entries setting
      to limit the size of the TCP ehash _much_ lower than either of the
      automatically computed values.  Illustrative of the exceedingly large nature
      of the automatically computed size, SGI currently recommends that customers
      boot with thash_entries=2097152, which works out to a 32MB allocation.  In any
      case, setting hashdist=0 will allow for allocations in excess of vmalloc
      limits, if so desired.
      
      Other than the vmalloc limit, great care was taken to ensure that the size of
      TCP hash allocations was not altered by this patch.  Due to slightly different
      computation techniques between the existing TCP code and
      alloc_large_system_hash (which is now utilized), some of the magic constants
      in the TCP hash allocation code were changed.  On all sizes of system (128MB
      through 64GB) that I had access to, the patched code preserves the previous
      hash size, as long as the vmalloc limit (256MB on IA64) is not encountered.
      
      There was concern that changing the TCP-related hashes to use vmalloc space
      may adversely impact network performance.  To this end the netperf set of
      benchmarks was run.  Some individual tests seemed to benefit slightly, some
      seemed to be harmed slightly, but in all cases the average difference with and
      without these patches was well within the variabilty I would see from run to
      run.
      
      The following is the overall netperf averages (30 10 second runs each) against
      an older kernel with these same patches.  These tests were run over loopback
      as GigE results were so inconsistent run to run both with and without these
      patches that they provided no meaningful comparison that I could discern.  I
      used the same kernel (IA64 generic) for each run, simply varying the new
      "hashdist" boot parameter to turn on or off the new allocation behavior.  In
      all cases the thash_entries value was manually specified as discussed
      previously to eliminate any variability that might result from that size
      difference.
      
      HP ZX1, hashdist=0
      ==================
      TCP_RR = 19389
      TCP_MAERTS = 6561 
      TCP_STREAM = 6590 
      TCP_CC = 9483
      TCP_CRR = 8633 
      
      HP ZX1, hashdist=1
      ==================
      TCP_RR = 19411
      TCP_MAERTS = 6559 
      TCP_STREAM = 6584 
      TCP_CC = 9454
      TCP_CRR = 8626 
      
      SGI Altix, hashdist=0
      =====================
      TCP_RR = 16871
      TCP_MAERTS = 3925 
      TCP_STREAM = 4055 
      TCP_CC = 8438
      TCP_CRR = 7750 
      
      SGI Altix, hashdist=1
      =====================
      TCP_RR = 17040
      TCP_MAERTS = 3913 
      TCP_STREAM = 4044 
      TCP_CC = 8367
      TCP_CRR = 7538 
      
      I believe the TCP_CC and TCP_CRR are the tests most sensitive to this
      particular change.  But again, I want to emphasize that even the differences
      you see above are _well_ within the variability I saw from run to run of any
      given test.
      
      In addition, Jose Santos at IBM has run specSFS, which has been particularly
      sensitive to TLB issues, against these patches and saw no performance
      degredation (differences down in the noise).
      
      
      
      This patch:
      
      Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate
      boottime allocation imbalances on NUMA systems.
      
      Due to limited vmalloc space on some architectures (i.e.  x86), the use of
      vmalloc is enabled by default only on NUMA IA64 kernels.  There should be
      no problem enabling this change for any other interested NUMA architecture.
      Signed-off-by: default avatarBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dcee73c4
    • Alex Williamson's avatar
      [PATCH] collect page_states only from online cpus · d841f01f
      Alex Williamson authored
      I noticed the function __read_page_state() curiously high in a q-tools
      profile of a write to a software raid0 device.  Seems this is because we're
      checking page_states for all possible cpus and we have NR_CPUS possible
      when CONFIG_HOTPLUG_CPU=y.  The default config for ia64 is now NR_CPUS=512,
      so on a little 8-way box, this is a significant waste of time.  The patch
      below updates __read_page_state() and __get_page_state() to only count
      page_state info for online cpus.  To keep the stats consistent, the
      page_alloc notifier is updated to move page_states off of the cpu going
      offline.  On my profile, this dropped __read_page_state() back into the
      noise and boosted block write performance by 5% (as measured by spew -
      http://spew.berlios.de).
      Signed-off-by: default avatarAlex Williamson <alex.williamson@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d841f01f
    • Manfred Spraul's avatar
      [PATCH] slab: Add more arch overrides to control object alignment · d32d6f8a
      Manfred Spraul authored
      Add ARCH_SLAB_MINALIGN and document ARCH_KMALLOC_MINALIGN: The flags allow
      the arch code to override the default minimum object aligment
      (BYTES_PER_WORD).
      Signed-Off-By: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d32d6f8a
    • Andrew Morton's avatar
      [PATCH] do_anonymous_page() use SetPageReferenced · a161d268
      Andrew Morton authored
      mark_page_accessed() is more heavyweight than we need: the page is already
      headed for the active list, so setting the software-referenced bit is
      equivalent.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a161d268
    • Miquel van Smoorenburg's avatar
      [PATCH] mark_page_accessed() for read()s on non-page boundaries · 21adf7ac
      Miquel van Smoorenburg authored
      When reading a (partial) page from disk using read(), the kernel only marks
      the page as "accessed" if the read started at a page boundary.  This means
      that files that are accessed randomly at non-page boundaries (usually
      database style files) will not be cached properly.
      
      The patch below uses the readahead state instead.  If a page is read(), it
      is marked as "accessed" if the previous read() was for a different page,
      whatever the offset in the page.
      
      Testing results:
      
      
      - Boot kernel with mem=128M
      
      - create a testfile of size 8 MB on a partition. Unmount/mount.
      
      - then generate about 10 MB/sec streaming writes
      
      	for i in `seq 1 1000`
      	do
      		dd if=/dev/zero of=junkfile.$i bs=1M count=10
      		sync
      		cat junkfile.$i > /dev/null
      		sleep 1
      	done
      
      - use an application that reads 128 bytes 64000 times from a
        random offset in the 64 MB testfile.
      
      1. Linux 2.6.10-rc3 vanilla, no streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.22s system 5% cpu 4.456 total
      
      2. Linux 2.6.10-rc3 vanilla, streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.16s system 2% cpu 7.667 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.37s system 1% cpu 23.294 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.99s system 1% cpu 1:11.52 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.21s system 2% cpu 10.273 total
      
      3. Linux 2.6.10-rc3 with read-page-access.patch , streaming writes:
      
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.21s system 3% cpu 7.634 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.04s user 0.22s system 2% cpu 9.588 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.02s user 0.12s system 24% cpu 0.563 total
      # time ~/rr testfile
      Read 128 bytes 64000 times
      ~/rr testfile  0.03s user 0.13s system 98% cpu 0.163 total
      
      As expected, with the read-page-access.patch, the kernel keeps the 8 MB
      testfile cached as expected, while without it, it doesn't.
      
      So this is useful for workloads where one smallish (wrt RAM) file is read
      randomly over and over again (like heavily used database indexes), while
      other I/O is going on.  Plain 2.6 caches those files poorly, if the app
      uses plain read().
      Signed-Off-By: default avatarMiquel van Smoorenburg <miquels@cistron.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      21adf7ac
    • Dave Hansen's avatar
      [PATCH] make sure ioremap only tests valid addresses · bbd4c45d
      Dave Hansen authored
      When CONFIG_HIGHMEM=y, but ZONE_NORMAL isn't quite full, there is, of
      course, no actual memory at *high_memory.  This isn't a problem with normal
      virt<->phys translations because it's never dereferenced, but
      CONFIG_NONLINEAR is a bit more finicky.  So, don't do virt_to_phys() to
      non-existent addresses.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bbd4c45d
    • Dave Hansen's avatar
      [PATCH] kill off highmem_start_page · 422e43d4
      Dave Hansen authored
      People love to do comparisons with highmem_start_page.  However, where
      CONFIG_HIGHMEM=y and there is no actual highmem, there's no real page at
      *highmem_start_page.
      
      That's usually not a problem, but CONFIG_NONLINEAR is a bit more strict and
      catches the bogus address tranlations. 
      
      There are about a gillion different ways to find out of a 'struct page' is
      highmem or not.  Why not just check page_flags?  Just use PageHighMem()
      wherever there used to be a highmem_start_page comparison.  Then, kill off
      highmem_start_page.
      
      This removes more code than it adds, and gets rid of some nasty
      #ifdefs in .c files.
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      422e43d4
    • Andries E. Brouwer's avatar
      [PATCH] mm: overcommit updates · ea86630e
      Andries E. Brouwer authored
      Alan made overcommit mode 2 and it doesnt work at all.  A process passing
      the limit often does so at a moment of stack extension, and is killed by a
      segfault, not better than being OOM-killed.
      
      Another problem is that close to the edge no other processes can be
      started, so that a sysadmin has problems logging in and investigating.
      
      Below a patch that does 3 things:
      
      (1) It reserves a reasonable amount of virtual stack space (amount
          randomly chosen, no guarantees given) when the process is started, so
          that the common utilities will not be killed by segfault on stack
          extension.
      
      (2) It reserves a reasonable amount of virtual memory for root, so that
          root can do things when the system is out-of-memory
      
      (3) It limits a single process to 97% of what is left, so that also an
          ordinary user is able to use getty, login, bash, ps, kill and similar
          things when one of her processes got out of control.
      
      Since the current overcommit mode 2 is not really useful, I did not give
      this a new number.
      
      The patch is just for playing, not to be applied by Linus.  But, Andrew, I
      hope that you would be willing to put this in -mm so that people can
      experiment.  Of course it only does something if one sets overcommit mode
      to 2.
      
      The past month I have pressured people asking for feedback, and now have
      about a dozen reports, mostly positive, one very positive.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ea86630e
    • Andrea Arcangeli's avatar
      [PATCH] mempolicy optimisation · 182e0eba
      Andrea Arcangeli authored
      Some optimizations in mempolicy.c (like to avoid rebalancing the tree while
      destroying it and by breaking loops early and not checking for invariant
      conditions in the replace operation).
      Signed-off-by: default avatarAndrea Arcangeli <andrea@novell.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      182e0eba