Commits · 53a504350bf8503deb383966c55b2f1c4494137f · Kirill Smelkov / linux

03 Jan, 2005 40 commits

[PATCH] ppc64: kprobes implementation · 53a50435

Ananth N. Mavinakayanahalli authored Jan 03, 2005

Kprobes (Kernel dynamic probes) is a lightweight mechanism for kernel
modules to insert probes into a running kernel, without the need to modify
the underlying source.  The probe handlers can then be coded to log
relevent data at the probe point.  More information on kprobes can be found
at:

http://www-124.ibm.com/developerworks/oss/linux/projects/kprobes/

Jprobes (or jumper probes) is a small infrastructure to access function
arguments.  It can be used by defining a small stub with the same template
as the routine in kernel, within which the required parameters can be
logged.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

53a50435

[PATCH] ppc32: Resurrect Documentation/powerpc/cpu_features.txt · 36055b52

Arthur Othieno authored Jan 03, 2005

Documentation/powerpc/cpu_features.txt mysteriously disappeared sometime
when 2.5 forked off.

Searching through BK logs on linux.bkbits.net didn't reveal anything,
unfortunately.  The only reference I could pick up from searching the
available lkml archives is the 2.4.20-pre11 ChangeLog where this was first
merged.

Thus far, nothing indicates it was intentionally removed, and AFAICS, is
still up to date with the current code.
Signed-off-by: Arthur Othieno <a.othieno@bluewin.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

36055b52

[PATCH] ppc32: fix io_remap_page_range for 36-bit phys platforms · bbf53507

Matt Porter authored Jan 03, 2005

Fixes io_remap_page_range() to use the 32-bit address translator similar to
ioremap().  Someday u64 start/end resources should make this unnecessary.
Fixes set_pte() to handle a long long pte_t properly.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

bbf53507

[PATCH] ppc32: add uImage to default targets · 34b7c669

Matt Porter authored Jan 03, 2005

We'd like to get a uImage when just using 'make' on many targets.  After
some discussion, it made sense to simply add uImage to the default targets
since it adds minimal build overhead and will work on all platforms.  Also,
fix a dependency in the boot stuff.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

34b7c669

[PATCH] PPC debug setcontext syscall implementation. · a784ab71

Corey Minyard authored Jan 03, 2005

Add a debugging interface for PowerPC that allows signal handlers (or any
jump to a context, really) to perform debug functions. It allows the a
user program to turn on single-stepping, for instance, and the thread will
get a trap after executing the next instruction. It can also (on supported
PPC processors) turn on branch tracing and get a trap after the next branch
instruction is executed. This is useful for in-application debugging.

Note that you can enable single-stepping on x86 processors directly from
signal handlers. Newer x86 processors have the equivalent of a
branch-trace bit in the IA32_DEBUGCTL MSR and could have similar function
to this syscall. Most other processors could benefit from a similar
interface, except for ARM which is extraordinarily broken for debugging.

Future uses of this could be adding the ability to set the hardware
breakpoint registers from a signal handler.
Signed-off-by: Corey Minyard <minyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a784ab71

[PATCH] ppc32: remove bogus SPRN_CPC0_GPIO define · a47ac38f

Matt Porter authored Jan 03, 2005

This trivial patch removes long-standing typo in ibm44x.h.  In fact, we
already have correct DCRN_CPC0_GPIO define later in the same file.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a47ac38f

[PATCH] ppc32: fix ebony.c warnings · 19a8907d

Matt Porter authored Jan 03, 2005

This patch removes annoying warnings in ebony.c.  Fix is similar to one I
made in ocotea.c before.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

19a8907d

[PATCH] Fix prototypes & externs in e500 oprofile support · 3aa29948

Kumar Gala authored Jan 03, 2005

Remove prototypes and externs out of the .c files
Signed-off-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

3aa29948

[PATCH] ppc32: performance Monitor/Oprofile support for e500 · 6c4fe420

Kumar Gala authored Jan 03, 2005

Adds oprofile support for the e500 PowerPC core.
Signed-off-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

6c4fe420

[PATCH] ppc32: PPC4xx PIC rewrite/cleanup · f481178e

Matt Porter authored Jan 03, 2005

Patch from Eugene to do some cleanup of the PPC4xx PIC code.  Separates the
interrupts that can have polarity/triggering modified for platform
modification if necessary.  Between the two of us, it's tested on most of
the affected platforms.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f481178e

[PATCH] ppc32: add Support for IBM 750FX and 750GX Eval Boards · ad47c00f

Randy Vinson authored Jan 03, 2005

I've added support for the IBM 750FX and 750GX Eval Boards
(Chestnut/Buckeye).
Signed-off-by: Randy Vinson <rvinson@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ad47c00f

[PATCH] ppc32: support for Artesyn Katana cPCI boards · e1b2de6e

Mark A. Greer authored Jan 03, 2005

This patch adds support for the Artesyn Katana 750i, 752i, and 3750.
Signed-off-by: Mark A. Greer <mgreer@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e1b2de6e

[PATCH] ppc32: support for Force CPCI-690 board · c7033ab5

Mark A. Greer authored Jan 03, 2005

This patch adds support for the Force CPCI-690 cPCI board.
Signed-off-by: Mark A. Greer <mgreer@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c7033ab5

[PATCH] ppc32: support for Marvell EV-64260[ab]-BP eval platform · 216df828

Mark A. Greer authored Jan 03, 2005

This patch adds support for a line of evaluation platforms from Marvell
that use the Marvell GT64260[ab] host bridges.

This patch depends on the Marvell host bridge support patch (mv64x60).
Signed-off-by: Mark A. Greer <mgreer@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

216df828

[PATCH] ppc32-marvell-host-bridge-support-mv64x60 review fixes · fe7c9be8

Mark A. Greer authored Jan 03, 2005

Here is an incremental patch [hopefully] with your concerns addressed.
Note that the arch/ppc/boot code is not kernel code and only exists for a
short period of time before execution jumps to the kernel.
Signed-off-by: Mark A. Greer <mgreer@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

fe7c9be8

[PATCH] ppc32: Marvell host bridge support (mv64x60) · 8594ca60

Mark A. Greer authored Jan 03, 2005

This patch adds core support for a line of host bridges from Marvell
(formerly Galileo).  This code has been tested with a GT64260a, GT64260b,
MV64360, and MV64460.  Patches for platforms that use these bridges will be
sent separately.

The patch is rather large so a link is provided.
Signed-off-by: Mark A. Greer <mgreer@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8594ca60

[PATCH] ppc32: Switch to KBUILD_DEFCONFIG · b595953f

Tom Rini authored Jan 03, 2005

The following patch switches ppc32 from using arch/ppc/defconfig to
arch/ppc/configs/common_defconfig as a defconfig.  These files are supposed
to be identical, but always end up out of sync.  This also updates the
common_defconfig with current options.
Signed-off-by: Tom Rini <trini@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b595953f

[PATCH] ppc32: refactor common book-e exception code · 221df77a

Kumar Gala authored Jan 03, 2005

Moves common handling of InstructionStorage, Alignment, Program, and
Decrementer exceptions handlers for Book-E processors (44x & e500) into
common code.
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

221df77a

[PATCH] ppc32: freescale Book-E MMU cleanup · 277fe7d9

Kumar Gala authored Jan 03, 2005

Updates the Freescale Book-E MMU usage to match the architecture spec.
This is mainly growing the widths of fields in various registers to match
the architecture spec instead of the implementation.
Signed-off-by: Becky Gill <becky.gill@freescale.com>
Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

277fe7d9

[PATCH] Fix broken RST handling in ip_conntrack · d68bbf1d

Martin Josefsson authored Jan 03, 2005

Here's a patch that fixes a pretty serious bug introduced by a recent
"bugfix".  The problem is that RST packets are ignored if they follow an
ACK packet, this means that the timeout of the connection isn't decreased,
so we get lots of old connections lingering around until the timeout
expires, the default timeout for state ESTABLISHED is 5 days.

This needs to go into -bk as soon as possible.  The bug is present in
2.6.10 as well.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d68bbf1d

[PATCH] netfilter: Fix cleanup in ipt_recent should ipt_registrater_match error · 287b7862

Rusty Russell authored Jan 03, 2005

When ipt_registrater_match() fails, ipt_recent doesn't remove its proc
entry.  Found by nfsim.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

287b7862

[PATCH] netfilter: Remove copy_to_user Warnings in Netfilter · be4bae19

Rusty Russell authored Jan 03, 2005

After changing firewall rules, we try to return the counters to userspace.  We
didn't fail at that point if the copy failed, but it doesn't really matter.
Someone added a warn_unused_result attribute to copy_to_user, so we get bogus
warnings.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

be4bae19

[PATCH] netfilter: Remove IPCHAINS and IPFWADM compatibility · f631723a

Rusty Russell authored Jan 03, 2005

We've been threatening to do this for ages: remove the backwards compatibility
code.  We can now combine ip_conntrack_core.c and ip_conntrack_standalone.c,
likewise for the NAT code, but that will come later.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f631723a

[PATCH] netfilter: Add comment above remove_expectations in destroy_conntrack() · 6dd1537e

Rusty Russell authored Jan 03, 2005

I removed this code in a previous patch, and Patrick McHardy explained
what was wrong.  Add a comment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

6dd1537e

[PATCH] netfilter: Fix ip_ct_selective_cleanup(), and rename ip_ct_iterate_cleanup() · 4759d4d9

Rusty Russell authored Jan 03, 2005

Several places use ip_ct_selective_cleanup() as a general iterator, which it
was not intended for (it takes a const ip_conntrack *).  So rename it, and
make it take a non-const argument.

Also, it missed unconfirmed connections, which aren't in the hash table.  This
introduces a potential problem for users which expect to iterate all
connections (such as the helper deletion code).  So keep a linked list of
unconfirmed connections as well.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

4759d4d9

[PATCH] netfilter: Fix ip_conntrack_proto_sctp exit on sysctl fail · 5ea39dfb

Rusty Russell authored Jan 03, 2005

On failure from register_sysctl_table, we return with exit 0.  Oops.  init and
fini should also be static.  nfsim found these.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5ea39dfb

[PATCH] netfilter: fix return values of ipt_recent checkentry · a9dcd00e

Rusty Russell authored Jan 03, 2005

Peejix's nfsim test for ipt_recent, written two days ago, revealed this bugs
with ipt_recent: checkentry() returns true or false, not an error. (Maybe it
should, but that's a much larger change). Also, make hash_func() static.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a9dcd00e

[PATCH] TCP hashes: NUMA interleaving · 0e4e73f8

Brent Casavant authored Jan 03, 2005

Modifies the TCP ehash and TCP bhash to enable the use of vmalloc to
alleviate boottime memory allocation imbalances on NUMA systems, utilizing
flags to the alloc_large_system_hash routine in order to centralize the
enabling of this behavior.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

0e4e73f8

[PATCH] filesystem hashes: NUMA interleaving · e330572f

Brent Casavant authored Jan 03, 2005

The following patch modifies the dentry cache and inode cache to enable the
use of vmalloc to alleviate boottime memory allocation imbalances on NUMA
systems, utilizing flags to the alloc_large_system_hash routine in order to
centralize the enabling of this behavior.

In general, for each hash, we check at the early allocation point whether
hash distribution is enabled, and if so we defer allocation. At the late
allocation point we perform the allocation if it was not earlier deferred.
These late allocation points are the same points utilized prior to the
addition of alloc_large_system_hash to the kernel.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e330572f

[PATCH] alloc_large_system_hash: NUMA interleaving · dcee73c4

Brent Casavant authored Jan 03, 2005

NUMA systems running current Linux kernels suffer from substantial inequities
in the amount of memory allocated from each NUMA node during boot.  In
particular, several large hashes are allocated using alloc_bootmem, and as
such are allocated contiguously from a single node each.

This becomes a problem for certain workloads that are relatively common on
big-iron HPC NUMA systems.  In particular, a number of MPI and OpenMP
applications which require nearly all available processors in the system and
nearly all the memory on each node run into difficulties.  Due to the uneven
memory distribution onto a few nodes, any thread on those nodes will require a
portion of its memory be allocated from remote nodes.  Any access to those
memory locations will be slower than local accesses, and thereby slows down
the effective computation rate for the affected CPUs/threads.  This problem is
further amplified if the application is tightly synchronized between threads
(as is often the case), as they entire job can run only at the speed of the
slowest thread.

Additionally since these hashes are usually accessed by all CPUS in the
system, the NUMA network link on the node which hosts the hash experiences
disproportionate traffic levels, thereby reducing the memory bandwidth
available to that node's CPUs, and further penalizing performance of the
threads executed thereupon.

As such, it is desired to find a way to distribute these large hash
allocations more evenly across NUMA nodes.  Fortunately current kernels do
perform allocation interleaving for vmalloc() during boot, which provides a
stepping stone to a solution.

This series of patches enables (but does not require) the kernel to allocate
several boot time hashes using vmalloc rather than alloc_bootmem, thereby
causing the hashes to be interleaved amongst NUMA nodes.  In particular the
dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be
allocated in this manner.  Due to the limited vmalloc space on architectures
such as i386, this behavior is turned on by default only for IA64 NUMA systems
(though there is no reason other interested architectures could not enable it
if desired).  Non-IA64 and non-NUMA systems continue to use the existing
alloc_bootmem() allocation mechanism.  A boot line parameter "hashdist" can be
set to override the default behavior.

The following two sets of example output show the uneven distribution just
after boot, using init=/bin/sh to eliminate as much non-kernel allocation as
possible.

Without the boot hash distribution patches:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3697696    172960
   1   3882992   3866656     16336
   2   3883008   3866784     16224
   3   3882992   3866464     16528
   4   3883008   3866592     16416
   5   3883008   3866720     16288
   6   3882992   3342176    540816
   7   3883008   3865440     17568
   8   3882992   3866560     16432
   9   3883008   3866400     16608
  10   3882992   3866592     16400
  11   3883008   3866400     16608
  12   3882992   3866400     16592
  13   3883008   3866432     16576
  14   3883008   3866528     16480
  15   3864768   3848256     16512
 ToT  62097440  61152096    945344

Notice that nodes 0 and 6 have a substantially larger memory utilization
than all other nodes.

With the boot hash distribution patch:

 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   3870656   3789792     80864
   1   3882992   3843776     39216
   2   3883008   3843808     39200
   3   3882992   3843904     39088
   4   3883008   3827488     55520
   5   3883008   3843712     39296
   6   3882992   3843936     39056
   7   3883008   3844096     38912
   8   3882992   3843712     39280
   9   3883008   3844000     39008
  10   3882992   3843872     39120
  11   3883008   3843872     39136
  12   3882992   3843808     39184
  13   3883008   3843936     39072
  14   3883008   3843712     39296
  15   3864768   3825760     39008
 ToT  62097440  61413184    684256

While not perfectly even, we can see that there is a substantial improvement
in the spread of memory allocated by the kernel during boot.  The remaining
uneveness may be due in part to further boot time allocations that could be
addressed in a similar manner, but some difference is due to the somewhat
special nature of node 0 during boot.  However the uneveness has fallen to a
much more acceptable level (at least to a level that SGI isn't concerned
about).

The astute reader will also notice that in this example, with this patch
approximately 256 MB less memory was allocated during boot.  This is due to
the size limits of a single vmalloc.  More specifically, this is because the
automatically computed size of the TCP ehash exceeds the maximum size which a
single vmalloc can accomodate.  However this is of little practical concern as
the vmalloc size limit simply reduces one ridiculously large allocation
(512MB) to a slightly less ridiculously large allocation (256MB).  In practice
machines with large memory configurations are using the thash_entries setting
to limit the size of the TCP ehash _much_ lower than either of the
automatically computed values.  Illustrative of the exceedingly large nature
of the automatically computed size, SGI currently recommends that customers
boot with thash_entries=2097152, which works out to a 32MB allocation.  In any
case, setting hashdist=0 will allow for allocations in excess of vmalloc
limits, if so desired.

Other than the vmalloc limit, great care was taken to ensure that the size of
TCP hash allocations was not altered by this patch.  Due to slightly different
computation techniques between the existing TCP code and
alloc_large_system_hash (which is now utilized), some of the magic constants
in the TCP hash allocation code were changed.  On all sizes of system (128MB
through 64GB) that I had access to, the patched code preserves the previous
hash size, as long as the vmalloc limit (256MB on IA64) is not encountered.

There was concern that changing the TCP-related hashes to use vmalloc space
may adversely impact network performance.  To this end the netperf set of
benchmarks was run.  Some individual tests seemed to benefit slightly, some
seemed to be harmed slightly, but in all cases the average difference with and
without these patches was well within the variabilty I would see from run to
run.

The following is the overall netperf averages (30 10 second runs each) against
an older kernel with these same patches.  These tests were run over loopback
as GigE results were so inconsistent run to run both with and without these
patches that they provided no meaningful comparison that I could discern.  I
used the same kernel (IA64 generic) for each run, simply varying the new
"hashdist" boot parameter to turn on or off the new allocation behavior.  In
all cases the thash_entries value was manually specified as discussed
previously to eliminate any variability that might result from that size
difference.

HP ZX1, hashdist=0
==================
TCP_RR = 19389
TCP_MAERTS = 6561 
TCP_STREAM = 6590 
TCP_CC = 9483
TCP_CRR = 8633 

HP ZX1, hashdist=1
==================
TCP_RR = 19411
TCP_MAERTS = 6559 
TCP_STREAM = 6584 
TCP_CC = 9454
TCP_CRR = 8626 

SGI Altix, hashdist=0
=====================
TCP_RR = 16871
TCP_MAERTS = 3925 
TCP_STREAM = 4055 
TCP_CC = 8438
TCP_CRR = 7750 

SGI Altix, hashdist=1
=====================
TCP_RR = 17040
TCP_MAERTS = 3913 
TCP_STREAM = 4044 
TCP_CC = 8367
TCP_CRR = 7538 

I believe the TCP_CC and TCP_CRR are the tests most sensitive to this
particular change.  But again, I want to emphasize that even the differences
you see above are _well_ within the variability I saw from run to run of any
given test.

In addition, Jose Santos at IBM has run specSFS, which has been particularly
sensitive to TLB issues, against these patches and saw no performance
degredation (differences down in the noise).



This patch:

Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate
boottime allocation imbalances on NUMA systems.

Due to limited vmalloc space on some architectures (i.e.  x86), the use of
vmalloc is enabled by default only on NUMA IA64 kernels.  There should be
no problem enabling this change for any other interested NUMA architecture.
Signed-off-by: Brent Casavant <bcasavan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

dcee73c4

[PATCH] collect page_states only from online cpus · d841f01f

Alex Williamson authored Jan 03, 2005

I noticed the function __read_page_state() curiously high in a q-tools
profile of a write to a software raid0 device. Seems this is because we're
checking page_states for all possible cpus and we have NR_CPUS possible
when CONFIG_HOTPLUG_CPU=y. The default config for ia64 is now NR_CPUS=512,
so on a little 8-way box, this is a significant waste of time. The patch
below updates __read_page_state() and __get_page_state() to only count
page_state info for online cpus. To keep the stats consistent, the
page_alloc notifier is updated to move page_states off of the cpu going
offline. On my profile, this dropped __read_page_state() back into the
noise and boosted block write performance by 5% (as measured by spew -
http://spew.berlios.de).
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d841f01f

[PATCH] slab: Add more arch overrides to control object alignment · d32d6f8a

Manfred Spraul authored Jan 03, 2005

Add ARCH_SLAB_MINALIGN and document ARCH_KMALLOC_MINALIGN: The flags allow
the arch code to override the default minimum object aligment
(BYTES_PER_WORD).
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d32d6f8a

[PATCH] do_anonymous_page() use SetPageReferenced · a161d268

Andrew Morton authored Jan 03, 2005

mark_page_accessed() is more heavyweight than we need: the page is already
headed for the active list, so setting the software-referenced bit is
equivalent.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a161d268

[PATCH] mark_page_accessed() for read()s on non-page boundaries · 21adf7ac

Miquel van Smoorenburg authored Jan 03, 2005

When reading a (partial) page from disk using read(), the kernel only marks
the page as "accessed" if the read started at a page boundary.  This means
that files that are accessed randomly at non-page boundaries (usually
database style files) will not be cached properly.

The patch below uses the readahead state instead.  If a page is read(), it
is marked as "accessed" if the previous read() was for a different page,
whatever the offset in the page.

Testing results:


- Boot kernel with mem=128M

- create a testfile of size 8 MB on a partition. Unmount/mount.

- then generate about 10 MB/sec streaming writes

	for i in `seq 1 1000`
	do
		dd if=/dev/zero of=junkfile.$i bs=1M count=10
		sync
		cat junkfile.$i > /dev/null
		sleep 1
	done

- use an application that reads 128 bytes 64000 times from a
  random offset in the 64 MB testfile.

1. Linux 2.6.10-rc3 vanilla, no streaming writes:

# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.22s system 5% cpu 4.456 total

2. Linux 2.6.10-rc3 vanilla, streaming writes:

# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.16s system 2% cpu 7.667 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.37s system 1% cpu 23.294 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.02s user 0.99s system 1% cpu 1:11.52 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.21s system 2% cpu 10.273 total

3. Linux 2.6.10-rc3 with read-page-access.patch , streaming writes:

# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.02s user 0.21s system 3% cpu 7.634 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.04s user 0.22s system 2% cpu 9.588 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.02s user 0.12s system 24% cpu 0.563 total
# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.13s system 98% cpu 0.163 total

As expected, with the read-page-access.patch, the kernel keeps the 8 MB
testfile cached as expected, while without it, it doesn't.

So this is useful for workloads where one smallish (wrt RAM) file is read
randomly over and over again (like heavily used database indexes), while
other I/O is going on.  Plain 2.6 caches those files poorly, if the app
uses plain read().
Signed-Off-By: Miquel van Smoorenburg <miquels@cistron.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

21adf7ac

[PATCH] make sure ioremap only tests valid addresses · bbd4c45d

Dave Hansen authored Jan 03, 2005

When CONFIG_HIGHMEM=y, but ZONE_NORMAL isn't quite full, there is, of
course, no actual memory at *high_memory.  This isn't a problem with normal
virt<->phys translations because it's never dereferenced, but
CONFIG_NONLINEAR is a bit more finicky.  So, don't do virt_to_phys() to
non-existent addresses.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

bbd4c45d

[PATCH] kill off highmem_start_page · 422e43d4

Dave Hansen authored Jan 03, 2005

People love to do comparisons with highmem_start_page.  However, where
CONFIG_HIGHMEM=y and there is no actual highmem, there's no real page at
*highmem_start_page.

That's usually not a problem, but CONFIG_NONLINEAR is a bit more strict and
catches the bogus address tranlations. 

There are about a gillion different ways to find out of a 'struct page' is
highmem or not.  Why not just check page_flags?  Just use PageHighMem()
wherever there used to be a highmem_start_page comparison.  Then, kill off
highmem_start_page.

This removes more code than it adds, and gets rid of some nasty
#ifdefs in .c files.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

422e43d4

[PATCH] mm: overcommit updates · ea86630e

Andries E. Brouwer authored Jan 03, 2005

Alan made overcommit mode 2 and it doesnt work at all.  A process passing
the limit often does so at a moment of stack extension, and is killed by a
segfault, not better than being OOM-killed.

Another problem is that close to the edge no other processes can be
started, so that a sysadmin has problems logging in and investigating.

Below a patch that does 3 things:

(1) It reserves a reasonable amount of virtual stack space (amount
    randomly chosen, no guarantees given) when the process is started, so
    that the common utilities will not be killed by segfault on stack
    extension.

(2) It reserves a reasonable amount of virtual memory for root, so that
    root can do things when the system is out-of-memory

(3) It limits a single process to 97% of what is left, so that also an
    ordinary user is able to use getty, login, bash, ps, kill and similar
    things when one of her processes got out of control.

Since the current overcommit mode 2 is not really useful, I did not give
this a new number.

The patch is just for playing, not to be applied by Linus.  But, Andrew, I
hope that you would be willing to put this in -mm so that people can
experiment.  Of course it only does something if one sets overcommit mode
to 2.

The past month I have pressured people asking for feedback, and now have
about a dozen reports, mostly positive, one very positive.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ea86630e

[PATCH] mempolicy optimisation · 182e0eba

Andrea Arcangeli authored Jan 03, 2005

Some optimizations in mempolicy.c (like to avoid rebalancing the tree while
destroying it and by breaking loops early and not checking for invariant
conditions in the replace operation).
Signed-off-by: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

182e0eba

[PATCH] Simplified readahead congestion control · 250c01d0

Ram Pai authored Jan 03, 2005

Reinstate the feature wherein readahead will be bypassed if the underlying
queue is read-congersted.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

250c01d0

[PATCH] Simplified readahead · 6f734a1a

Steven Pratt authored Jan 03, 2005

With Ram Pai <linuxram@us.ibm.com>

- request size is now passed into page_cache_readahead.  This allows the
  removal of the size averaging code in the current readahead logic.

- readahead rampup is now faster  (especially for larger request sizes)

- No longer "slow read path".  Readahead is turn off at first random access,
  turned back on at first sequential access.

- Code now handles thrashing, slowly reducing readahead window until
  thrashing stops, or min size reached.

- Returned to old behavior where first access is assumed sequential only if
  at offset 0.

- designed to handle larger (1M or above) window sizes efficiently


Benchmark results:

machine 1: 8 way pentiumIV 1GB memory, tests run to 36GB SCSI disk
(Similar results were seen on a 1 way 866Mhz box with IDE disk.)

TioBench:

tiobench.pl --dir /mnt/tmp --block 4096 --size 4000 --numruns 2 --threads 1(4,16,64)

4k request size sequential read results in MB/sec

  Threads         2.6.9    w/patches    %diff         diff

6f734a1a