Commits · adaa6aad36dd8d00d5239ea820633e83ce19171c · Kirill Smelkov / linux

24 Jun, 2004 40 commits

[PATCH] mips: SGI A2 audio rewrite and 2.6 fixes · adaa6aad

Andrew Morton authored Jun 23, 2004

From: Ralf Baechle <ralf@linux-mips.org>

Fix HAL2 audio driver for the SGI A2 audio subsystem and rewrite large
parts of it to finally work.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

adaa6aad

[PATCH] make total_swap_pages a long · e62d1b67

Andrew Morton authored Jun 23, 2004

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e62d1b67

[PATCH] Make nr_swap_pages a long · a5b5323b

Andrew Morton authored Jun 23, 2004

From: Anton Blanchard <anton@samba.org>

../include/linux/swap.h:extern int nr_swap_pages;       /* XXX: shouldn't this be ulong? --hch */

Sounds like it should be too me.  Some of the code checks for nr_swap_pages
< 0 so I made it a long instead.  I had to fix up the ppc64 show_mem() (Im
guessing there will be other trivial changes required in other 64bit archs,
I can find and fix those if you want).

I also noticed that the ppc64 show_mem() used ints to store page counts.
We can overflow that, so make them unsigned long.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a5b5323b

[PATCH] nr_pagecache can go negative · 6a948bc8

Andrew Morton authored Jun 23, 2004

We use per-cpu counters for the system-wide pagecache accounting.  The
counters spill into the global nr_pagecache atomic_t when they underflow or
overflow.

Hence it is possible, under weird circumstances, for nr_pagecache to go
negative.  Anton says he has hit this.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

6a948bc8

[PATCH] help text for FB_RIVA_I2C · 26b193e6

Andrew Morton authored Jun 23, 2004

From: "Antonino A. Daplas" <adaplas@hotpop.com>
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

26b193e6

[PATCH] Support NetMOS based PCI cards providing serial and parallel ports · 296d3c78

Andrew Morton authored Jun 23, 2004

From: Christoph Lameter <christoph@graphe.net>

Attached a patch to support a variety of PCI based serial and parallel port
I/O ports (typically labeled 222N-2 or 9835).

I think this should go into 2.6.0 since it has been out there for a long
time and is just some additional driver support that somehow fell through
the cracks in 2.4.X. Tim Waugh submitted it in the 2.4.X series.

See also http://winterwolf.co.uk/pciioSigned-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

296d3c78

[PATCH] abs() fixes · 8dac3248

Andrew Morton authored Jun 23, 2004

OK, the pending abs() disaster has hit:

drivers/usb/class/audio.c:404: warning: static declaration of 'abs' follows non-static declaration

This is due to the declaration in kernel.h.  AFAIK there's not even a matching
definition for that.

The patch implements abs() as a macro in kernel.h and kills off various
private implementations.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8dac3248

[PATCH] reduce function inlining in slab.c · f875aa02

Andrew Morton authored Jun 23, 2004

From: Manfred Spraul <manfred@colorfullife.com>

slab.c contains too many inline functions:

- some functions that are not performance critical were inlined.  Waste
  of text size.

- The debug code relies on __builtin_return_address(0) to keep track of
  the callers.  According to rmk, gcc didn't inline some functions as
  expected and that resulted in useless debug output.  This was probably
  caused by the large debug-only inline functions.

The attached patche removes most inline functions:

- the empty on release/huge on debug inline functions were replaced with
  empty macros on release/normal functions on debug.

- spurious inline statements were removed.

The code is down to 6 inline functions: three one-liners for struct
abstractions, one for a might_sleep_if test and two for the performance
critical __cache_alloc / __cache_free functions.

Note: If an embedded arch wants to save a few bytes by uninlining
__cache_{free,alloc}: The right way to do that is to fold the functions
into kmem_cache_xy and then replace kmalloc with
kmem_cache_alloc(kmem_find_general_cachep(),).

Signed-Off: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f875aa02

[PATCH] hwcache align kmalloc caches · b167eef8

Andrew Morton authored Jun 23, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Reversing the patches that made all caches hw cacheline aligned had an
unintended side effect on the kmalloc caches: Before they had the
SLAB_HWCACHE_ALIGN flag set, now it's clear.  This breaks one sgi driver -
it expects aligned caches.  Additionally I think it's the right thing to
do: It costs virtually nothing (the caches are power-of-two sized) and
could reduce false sharing.

Additionally, the patch adds back the documentation for the
SLAB_HWCACHE_ALIGN flag.

Signed-Off: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b167eef8

[PATCH] tweak the buddy allocator for better I/O merging · c75b81a5

Andrew Morton authored Jun 23, 2004

From: William Lee Irwin III <wli@holomorphy.com>

Based on Arjan van de Ven's idea, with guidance and testing from James
Bottomley.

The physical ordering of pages delivered to the IO subsystem is strongly
related to the order in which fragments are subdivided from larger blocks
of memory tracked by the page allocator.

Consider a single MAX_ORDER block of memory in isolation acted on by a
sequence of order 0 allocations in an otherwise empty buddy system.
Subdividing the block beginning at the highest addresses will yield all the
pages of the block in reverse, and subdividing the block begining at the
lowest addresses will yield all the pages of the block in physical address
order.

Empirical tests demonstrate this ordering is preserved, and that changing
the order of subdivision so that the lowest page is split off first
resolves the sglist merging difficulties encountered by driver authors at
Adaptec and others in James Bottomley's testing.

James found that before this patch, there were 40 merges out of about 32K
segments.  Afterward, there were 24007 merges out of 19513 segments, for a
merge rate of about 55%.  Merges of 128 segments, the maximum allowed, were
observed afterward, where beforehand they never occurred.  It also improves
dbench on my workstation and works fine there.
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c75b81a5

[PATCH] Use fancy wakeups in wait.h · 758e48e4

Andrew Morton authored Jun 23, 2004

Use the more SMP-friendly prepare_to_wait()/finish_wait() in wait_event() and
friends.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

758e48e4

[PATCH] dnotify.c: use inode->i_lock in place of dn_lock · 0ac04ac1

Andrew Morton authored Jun 23, 2004

From: "Adam J. Richter" <adam@yggdrasil.com>

Replace the use of a global spinlock with the per-inode ->i_lock.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

0ac04ac1

[PATCH] vm: vfs shrinkage tuning · a4411519

Andrew Morton authored Jun 23, 2004

Some people want the dentry and inode caches shrink harder, others want them
shrunk more reluctantly.

The patch adds /proc/sys/vm/vfs_cache_pressure, which tunes the vfs cache
versus pagecache scanning pressure.

- at vfs_cache_pressure=0 we don't shrink dcache and icache at all.

- at vfs_cache_pressure=100 there is no change in behaviour.

- at vfs_cache_pressure > 100 we reclaim dentries and inodes harder.


The number of megabytes of slab left after a slocate.cron on my 256MB test
box:

vfs_cache_pressure=100000   33480
vfs_cache_pressure=10000    61996
vfs_cache_pressure=1000     104056
vfs_cache_pressure=200      166340
vfs_cache_pressure=100      190200
vfs_cache_pressure=50       206168

Of course, this just left more directory and inode pagecache behind instead of
vfs cache.  Interestingly, on this machine the entire slocate run fits into
pagecache, but not into VFS caches.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a4411519

[PATCH] vmscan.c: dont reclaim too many pages · 42b8d994

Andrew Morton authored Jun 23, 2004

The shrink_zone() logic can, under some circumstances, cause far too many
pages to be reclaimed. Say, we're scanning at high priority and suddenly hit
a large number of reclaimable pages on the LRU.

Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

42b8d994

[PATCH] vmscan.c scan rate fixes · 2332dc78

Andrew Morton authored Jun 23, 2004

We've been futzing with the scan rates of the inactive and active lists far
too much, and it's still not right (Anton reports interrupt-off times of over
a second).

- We have this logic in there from 2.4.early (at least) which tries to keep
  the inactive list 1/3rd the size of the active list.  Or something.

  I really cannot see any logic behind this, so toss it out and change the
  arithmetic in there so that all pages on both lists have equal scan rates.

- Chunk the work up so we never hold interrupts off for more that 32 pages
  worth of scanning.

- Make the per-zone scan-count accumulators unsigned long rather than
  atomic_t.

  Mainly because atomic_t's could conceivably overflow, but also because
  access to these counters is racy-by-design anyway.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

2332dc78

[PATCH] vmscan.c: shuffle things around · acba6041

Andrew Morton authored Jun 23, 2004

Move all the data structure declarations, macros and variable definitions to
less surprising places.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

acba6041

[PATCH] Fix and Reenable MSI Support on x86_64 · 0342e162

Andrew Morton authored Jun 23, 2004

From: long <tlnguyen@snoqualmie.dp.intel.com>

MSI support for x86_64 is currently disabled in the kernel 2.6.x.  Below is
the patch, which provides a fix and reenable it.

In addition, the patch provides a info message during kernel boot if
configuring vector-base indexing.

Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

0342e162

[PATCH] make irqaction use a cpu mask · 8c05319f

Andrew Morton authored Jun 23, 2004

From: William Lee Irwin III <wli@holomorphy.com>

The following patch makes irqaction's ->mask a cpumask as it was intended
to be and wraps up the rest of the sweep.  Only struct irqaction is
usefully greppable, so there may be some assignments to ->mask missing
still.  This removes more code than it adds.

From: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8c05319f

[PATCH] alpha: cpumask fixups · 4320cbbd

Andrew Morton authored Jun 23, 2004

From: William Lee Irwin III <wli@holomorphy.com>

The cpumask patches broke alpha's build, even without the irqaction
patch, largely centering around cpu_possible_map.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

4320cbbd

[PATCH] clean up cpumask_t temporaries · a3dcb7f4

Andrew Morton authored Jun 23, 2004

From: Rusty Russell <rusty@rustcorp.com.au>

Paul Jackson's cpumask tour-de-force allows us to get rid of those stupid
temporaries which we used to hold CPU_MASK_ALL to hand them to functions.
This used to break NR_CPUS > BITS_PER_LONG.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

a3dcb7f4

[PATCH] cpumask: comment, spacing tweaks · 02d7effd

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Tweak cpumask.h comments, spacing:

- Add comments for cpu_present_map macros: num_present_cpus() and
  cpu_present()

- Remove comments for obsolete macros: cpu_set_online(),
  cpu_set_offline()

- Reorder a few comment lines, to match the code and confuse readers of
  this patch

- Tabify one chunk of code
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

02d7effd

[PATCH] cpumask: optimize various uses of new cpumasks · 4b81e400

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Make use of for_each_cpu_mask() macro to simplify and optimize a couple of
sparc64 per-CPU loops.

Optimize a bit of cpumask code for asm-i386/mach-es7000

Convert physids_complement() to use both args in the files
include/asm-i386/mpspec.h, include/asm-x86_64/mpspec.h.

Remove cpumask hack from asm-x86_64/topology.h routine pcibus_to_cpumask().

Clarify and slightly optimize several cpumask manipulations in kernel/sched.c
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

4b81e400

[PATCH] cpumask: Remove no longer used obsolete macro emulation · 5ffa67fc

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Now that the emulation of the obsolete cpumask macros is no longer needed,
remove it from cpumask.h
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5ffa67fc

[PATCH] ppc64: cpu_online fix · ea72b241

Andrew Morton authored Jun 23, 2004

include/asm/smp.h:55:1: warning: "cpu_possible" redefined
include/asm/smp.h:54:1: warning: "cpu_online" redefined
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ea72b241

[PATCH] x86_64: cpu_online fix · b8a02d07

Andrew Morton authored Jun 23, 2004

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b8a02d07

[PATCH] cpumask: remove obsolete cpumask macro uses - other archs · 7f1c9f57

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Remove by recoding other uses of the obsolete cpumask const, coerce and
promote macros.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

7f1c9f57

[PATCH] cpumask: remove obsolete cpumask macro uses - i386 arch · 9eb0dcc1

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Remove by recoding i386 uses of the obsolete cpumask const, coerce and promote
macros.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

9eb0dcc1

[PATCH] cpumask: remove 26 no longer used cpumask*.h files · ed880528

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

With the cpumask rewrite in the previous patch, these various
include/asm-*/cpumask*.h headers are no longer used.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ed880528

[PATCH] cpumask: rewrite cpumask.h - single bitmap based implementation · f3344dc3

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Major rewrite of cpumask to use a single implementation, as a struct-wrapped
bitmap.

This patch leaves some 26 include/asm-*/cpumask*.h header files orphaned - to
be removed next patch.

Some nine cpumask macros for const variants and to coerce and promote between
an unsigned long and a cpumask are obsolete.  Simple emulation wrappers are
provided in this patch for these obsolete macros, which can be removed once
each of the 3 archs (i386, ppc64, x86_64) using them are recoded in follow-on
patches to not need them.

The CPU_MASK_ALL macro now avoids leaving possible garbage one bits in any
unused portion of the high word.

An inproved comment lists all available operators, for convenient browsing.

From: Mikael Pettersson <mikpe@csd.uu.se>

  2.6.7-rc3-mm1 changed CPU_MASK_NONE into something that isn't a valid
  rvalue (it only works inside struct initializers).  This caused compile-time
  errors in perfctr in UP x86 builds.

From: Arnd Bergmann <arnd@arndb.de>

  cpumask-5-10-rewrite-cpumaskh-single-bitmap-based from 2.6.7-rc3-mm1
  causes include2/asm/smp.h:54:1: warning: "cpu_online" redefined
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f3344dc3

[PATCH] cpumask: bitmap inlining and optimizations · d6cf71d3

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

These bitmap improvements make it a suitable basis for fully supporting
cpumask_t and nodemask_t.  Inline macros with compile-time checks enable
generating tight code on both small and large systems (large meaning cpumask_t
requires more than one unsigned long's worth of bits).

The existing bitmap_<op> macros in lib/bitmap.c are renamed to __bitmap_<op>,
and wrappers for each bitmap_<op> are exposed in include/linux/bitmap.h

This patch _includes_ Bill Irwins rewrite of the bitmap_shift operators to not
require a fixed length intermediate bitmap.

Improved comments list each available operator for easy browsing.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d6cf71d3

[PATCH] cpumask: bitmap cleanup preparation for cpumask overhaul · ea0c1929

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

Document the bitmap bit model and handling of unused bits.

Tighten up bitmap so it does not generate nonzero bits in the unused tail if
it is not given any on input.

Add intersects, subset, xor and andnot operators.  Change bitmap_complement to
take two operands.

Add a couple of missing 'const' qualifiers on bitops test_bit and bitmap_equal
args.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

ea0c1929

[PATCH] cpumask: make cpu_present_map real even on non-smp · d2cec97b

Andrew Morton authored Jun 23, 2004

From: Paul Jackson <pj@sgi.com>

This patch makes cpu_present_map a real map for all configurations, instead of
a constant for non-SMP.  It also moves the definition of cpu_present_map out
of kernel/cpu.c into kernel/sched.c, because cpu.c isn't compiled into non-SMP
kernels.

The pattern is that each of the possible, present and online cpu maps are
actual kernel global cpumask_t variables, for all configurations.  They are
documented in include/linux/cpumask.h.  Some of the UP (NR_CPUS=1) code
cheats, and hardcodes the assumption that the single bit position of these
maps is always set, as an optimization.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d2cec97b

[PATCH] rcu: avoid passing an argument to the callback function · 8c1ce9d6

Andrew Morton authored Jun 23, 2004

From: Dipankar Sarma <dipankar@in.ibm.com>

This patch changes the call_rcu() API and avoids passing an argument to the
callback function as suggested by Rusty.  Instead, it is assumed that the
user has embedded the rcu head into a structure that is useful in the
callback and the rcu_head pointer is passed to the callback.  The callback
can use container_of() to get the pointer to its structure and work with
it.  Together with the rcu-singly-link patch, it reduces the rcu_head size
by 50%.  Considering that we use these in things like struct dentry and
struct dst_entry, this is good savings in space.

An example :

struct my_struct {
	struct rcu_head rcu;
	int x;
	int y;
};

void my_rcu_callback(struct rcu_head *head)
{
	struct my_struct *p = container_of(head, struct my_struct, rcu);
	free(p);
}

void my_delete(struct my_struct *p)
{
	...
	call_rcu(&p->rcu, my_rcu_callback);
	...
}
Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8c1ce9d6

[PATCH] reduce rcu_head size - core · b659a6fb

Andrew Morton authored Jun 23, 2004

From: Dipankar Sarma <dipankar@in.ibm.com>

This reduces the RCU head size by using a singly linked to maintain them.
The ordering of the callbacks is still maintained as before by using a tail
pointer for the next list.

Signed-Off-By : Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b659a6fb

[PATCH] rcu lock update: Code move & cleanup · 72914d30

Andrew Morton authored Jun 23, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Step three for reducing cacheline trashing within rcupdate.c:

Cleanup and code move from <linux/rcupdate.h> to kernel/rcupdate.c: Remove
internal details from the header file.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

72914d30

[PATCH] rcu lock update: Use a sequence lock for starting batches · 720e8a63

Andrew Morton authored Jun 23, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Step two for reducing cacheline trashing within rcupdate.c:

rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and calls
rcu_start_batch, even if the batch is already running or already scheduled to
run.

This can be avoided with a sequence lock: A sequence lock allows to read the
current batch number and next_pending atomically.  If next_pending is already
set, then there is no need to acquire the global mutex.

This means that for each grace period, there will be

- one write access to the rcu_ctrlblk.batch cacheline

- lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online()).  Behavior
  similar to the jiffies cacheline, shouldn't be a problem.

- cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them starting
  with spin_lock(&rcu_ctrlblk.state.mutex).

  For large enough cpus_online() this will be a problem, but all except two
  of the spin_lock calls only protect the rcu_cpu_mask bitmap, thus a
  hierarchical bitmap would allow to split the write accesses to multiple
  cachelines.

Tested on an 8-way with reaim.  Unfortunately it probably won't help with Jack
Steiner's 'ls' test since in this test only one cpu generates rcu entries.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

720e8a63

[PATCH] rcu lock update: Add per-cpu batch counter · 5c60169a

Andrew Morton authored Jun 23, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Below is the one of my patches from my rcu lock update.  Jack Steiner tested
the first one on a 512p and it resolved the rcu cache line trashing.  All were
tested on osdl with STP.

Step one for reducing cacheline trashing within rcupdate.c:

The current code uses the rcu_cpu_mask bitmap both for keeping track of the
cpus that haven't gone through a quiescent state and for checking if a cpu
should look for quiescent states.  The bitmap is frequently changed and the
check is done by polling - together this causes cache line trashing.

If it's cheaper to access a (mostly) read-only cacheline than a cacheline that
is frequently dirtied, then it's possible to reduce the trashing by splitting
the rcu_cpu_mask bitmap into two cachelines:

The patch adds a generation counter and moves it into a separate cacheline.
This allows to removes all accesses to rcu_cpumask (in the read-write
cacheline) from rcu_pending and at least 50% of the accesses from
rcu_check_quiescent_state.  rcu_pending and all but one call per cpu to
rcu_check_quiescent_state access the read-only cacheline.  Probably not enough
for 512p, but it's a start, just for 128 byte more memory use, without slowing
down rcu grace periods.  Obviously the read-only cacheline is not really
read-only: it's written once per grace period to indicate that a new grace
period is running.

Tests on an 8-way Pentium III with reaim showed some improvement:

oprofile hits:
Reference: http://khack.osdl.org/stp/293075/
Hits	   %
23741     0.0994  rcu_pending
19057     0.0798  rcu_check_quiescent_state
6530      0.0273  rcu_check_callbacks

Patched: http://khack.osdl.org/stp/293076/
8291      0.0579  rcu_pending
5475      0.0382  rcu_check_quiescent_state
3604      0.0252  rcu_check_callbacks

The total runtime differs between both runs, thus the % number must
be compared: Around 50% faster. I've uninlined rcu_pending for the
test.

Tested with reaim and kernbench.

Description:

- per-cpu quiescbatch and qs_pending fields introduced: quiescbatch contains
  the number of the last quiescent period that the cpu has seen and qs_pending
  is set if the cpu has not yet reported the quiescent state for the current
  period.  With these two fields a cpu can test if it should report a
  quiescent state without having to look at the frequently written
  rcu_cpu_mask bitmap.

- curbatch split into two fields: rcu_ctrlblk.batch.completed and
  rcu_ctrlblk.batch.cur.  This makes it possible to figure out if a grace
  period is running (completed != cur) without accessing the rcu_cpu_mask
  bitmap.

- rcu_ctrlblk.maxbatch removed and replaced with a true/false next_pending
  flag: next_pending=1 means that another grace period should be started
  immediately after the end of the current period.  Previously, this was
  achieved by maxbatch: curbatch==maxbatch means don't start, curbatch!=
  maxbatch means start.  A flag improves the readability: The only possible
  values for maxbatch were curbatch and curbatch+1.

- rcu_ctrlblk split into two cachelines for better performance.

- common code from rcu_offline_cpu and rcu_check_quiescent_state merged into
  cpu_quiet.

- rcu_offline_cpu: replace spin_lock_irq with spin_lock_bh, there are no
  accesses from irq context (and there are accesses to the spinlock with
  enabled interrupts from tasklet context).

- rcu_restart_cpu introduced, s390 should call it after changing nohz:
  Theoretically the global batch counter could wrap around and end up at
  RCU_quiescbatch(cpu).  Then the cpu would not look for a quiescent state and
  rcu would lock up.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

5c60169a

[PATCH] Move saved_command_line to init/main.c · b884e838

Andrew Morton authored Jun 23, 2004

From: Rusty Russell <rusty@rustcorp.com.au>

Currently every arch declares its own char saved_command_line[].  Make sure
every arch defines COMMAND_LINE_SIZE in asm/setup.h, and declare
saved_command_line in linux/init.h (init/main.c contains the definition).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

b884e838

[PATCH] jbd needs to wait for locked buffers · 4d4f4cc4

Andrew Morton authored Jun 23, 2004

From: Chris Mason <mason@suse.com>

jbd needs to wait for any io to complete on the buffer before changing the
end_io function.  Using set_buffer_locked means that it can change the
end_io function while the page is in the middle of writeback, and the
writeback bit on the page will never get cleared.

Since we set the buffer dirty earlier on, if the page was previously dirty,
pdflush or memory pressure might trigger a writepage call, which will race
with jbd's set_buffer_locked.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

4d4f4cc4

[PATCH] Allow i386 to reenable interrupts on lock contention · 36f9f209

Andrew Morton authored Jun 23, 2004

From: Zwane Mwaikambo <zwane@linuxpower.ca>

Following up on Keith's code, I adapted the i386 code to allow enabling
interrupts during contested locks depending on previous interrupt
enable status. Obviously there will be a text increase (only for non
CONFIG_SPINLINE case), although it doesn't seem so bad, there will be an
increased exit latency when we attempt a lock acquisition after spinning
due to the extra instructions. How much this will affect performance I'm
not sure yet as I haven't had time to micro bench.

   text    data     bss     dec     hex filename
2628024  921731       0 3549755  362a3b vmlinux-after
2621369  921731       0 3543100  36103c vmlinux-before
2618313  919222       0 3537535  35fa7f vmlinux-spinline

The code has been stress tested on a 16x NUMAQ (courtesy OSDL).
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

36f9f209