Commits · 9b34eb708314dd0786b8a904f5c223be55123ff2 · nexedi / linux

26 Sep, 2002 1 commit

Ingo Molnar authored Sep 25, 2002

From Andrew Morton.

There are a couple of places where we would enable interrupts while
write-holding the tasklist_lock ...  nasty.

9b34eb70

25 Sep, 2002 24 commits

Merge bk://ldm.bkbits.net/linux-2.5 · 2ce067b0
Linus Torvalds authored Sep 25, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
2ce067b0

[PATCH] tighter locking in pdflush · 4ab1a3e6

Andrew Morton authored Sep 25, 2002

Had a weird oops from Bill Irwin - the pdflush_list was corrupt.

The only thing I can think of is that something sprayed out a wakeup
when it shouldn't.  So tighten things up against that, and add some
printks to catch it if it happens again.

4ab1a3e6

[PATCH] speed up sys_sync() · 57eb0613

Andrew Morton authored Sep 25, 2002

Well it's a one-liner.  sys_sync() only syncs one queue at a time, and
can be slow if you have a lot of disks.  So poke pdflush, which knows
how to write all the queues in parallel.

57eb0613

[PATCH] increase traffic on linux-kernel · 4f3e8109

Andrew Morton authored Sep 25, 2002

[This has four scalps already.  Thomas Molina has agreed
 to track things as they are identified ]

Infrastructure to detect sleep-inside-spinlock bugs.  Really only
useful if compiled with CONFIG_PREEMPT=y.  It prints out a whiny
message and a stack backtrace if someone calls a function which might
sleep from within an atomic region.

This patch generates a storm of output at boot, due to
drivers/ide/ide-probe.c:init_irq() calling lots of things which it
shouldn't under ide_lock.

It'll find other bugs too.

4f3e8109

[PATCH] slab reclaim balancing · b65bbded

Andrew Morton authored Sep 25, 2002

A patch from Ed Tomlinson which improves the way in which the kernel
reclaims slab objects.

The theory is: a cached object's usefulness is measured in terms of the
number of disk seeks which it saves.  Furthermore, we assume that one
dentry or inode saves as many seeks as one pagecache page.

So we reap slab objects at the same rate as we reclaim pages.  For each
1% of reclaimed pagecache we reclaim 1% of slab.  (Actually, we _scan_
1% of slab for each 1% of scanned pages).

Furthermore we assume that one swapout costs twice as many seeks as one
pagecache page, and twice as many seeks as one slab object.  So we
double the pressure on slab when anonymous pages are being considered
for eviction.

The code works nicely, and smoothly.  Possibly it does not shrink slab
hard enough, but that is now very easy to tune up and down.  It is just:

	ratio *= 3;

in shrink_caches().

Slab caches no longer hold onto completely empty pages.  Instead, pages
are freed as soon as they have zero objects.  This is possibly a
performance hit for slabs which have constructors, but it's doubtful.
Most allocations after a batch of frees are satisfied from inside
internally-fragmented pages and by the time slab gets back onto using
the wholly-empty pages they'll be cache-cold.  slab would be better off
going and requesting a new, cache-warm page and reconstructing the
objects therein.  (Once we have the per-cpu hot-page allocator in
place.  It's happening).

As a consequence of the above, kmem_cache_shrink() is now unused.  No
great loss there - the serialising effect of kmem_cache_shrink and its
semaphore in front of page reclaim was measurably bad.

Still todo:

- batch up the shrinking so we don't call into prune_dcache and
  friends at high frequency asking for a tiny number of objects.

- Maybe expose the shrink ratio via a tunable.

- clean up slab.c

- highmem page reclaim in prune_icache: highmem pages can pin
  inodes.

b65bbded

[PATCH] use prepare_to_wait in VM/VFS · dfdacf59

Andrew Morton authored Sep 25, 2002

This uses the new wakeup machinery in some hot parts of the VFS and
block layers.

wait_on_buffer(), wait_on_page(), lock_page(), blk_congestion_wait().
Also in get_request_wait(), although the benefit for exclusive wakeups
will be lower.

dfdacf59

[PATCH] prepare_to_wait/finish_wait sleep/wakeup API · 3da08d6c

Andrew Morton authored Sep 25, 2002

This is worth a whopping 2% on spwecweb on an 8-way.  Which is faintly
surprising because __wake_up and other wait/wakeup functions are not
apparent in the specweb profiles which I've seen.


The main objective of this is to reduce the CPU cost of the wait/wakeup
operation.  When a task is woken up, its waitqueue is removed from the
waitqueue_head by the waker (ie: immediately), rather than by the woken
process.

This means that a subsequent wakeup does not need to revisit the
just-woken task.  It also means that the just-woken task does not need
to take the waitqueue_head's lock, which may well reside in another
CPU's cache.

I have no decent measurements on the effect of this change - possibly a
20-30% drop in __wake_up cost in Badari's 40-dds-to-40-disks test (it
was the most expensive function), but it's inconclusive.  And no
quantitative testing of which I am aware has been performed by
networking people.

The API is very simple to use (Linus thought it up):

my_func(waitqueue_head_t *wqh)
{
	DEFINE_WAIT(wait);

	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
	if (!some_test)
		schedule();
	finish_wait(wqh, &wait);
}

or:

	DEFINE_WAIT(wait);

	while (!some_test_1) {
		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
		if (!some_test_2)
			schedule();
		...
	}
	finish_wait(wqh, &wait);

You need to bear in mind that once prepare_to_wait has been performed,
your task could be removed from the waitqueue_head and placed into
TASK_RUNNING at any time.  You don't know whether or not you're still
on the waitqueue_head.

Running prepare_to_wait() when you're already on the waitqueue_head is
fine - it will do the right thing.

Running finish_wait() when you're actually not on the waitqueue_head is
fine.

Running finish_wait() when you've _never_ been on the waitqueue_head is
fine, as ling as the DEFINE_WAIT() macro was used to initialise the
waitqueue.

You don't need to fiddle with current->state.  prepare_to_wait() and
finish_wait() will do that.  finish_wait() will always return in state
TASK_RUNNING.

There are plenty of usage examples in vm-wakeups.patch and
tcp-wakeups.patch.

3da08d6c

[PATCH] mprotect_fixup fix · 02b1783c

Andrew Morton authored Sep 25, 2002

From David M-T.

When this function successfully merges the new range into an existing
VMA, it forgets to extend the new protection mode into the just-merged
pages.

02b1783c

[PATCH] hugetlb fix · 5538fdaa

Andrew Morton authored Sep 25, 2002

Patch from Rohit Seth

It fixes the problem which Andrea noted in his initial review of the
hugetlb code:

"In short doing "addr = vma->vm_end" and then checking if vm_end + len
 is below vm_next->vm_start is broken, because there's no guarantee
 that "addr" will be a largepage aligned address.  the LPAGE_ALIGN in
 found_addr should be dropped becaue moving the addr ahead without
 checking that addr+len doesn't then fall into a vma, will generate
 do_munmaps and in turn userspace mem corruption."

5538fdaa

[PATCH] NUMA-Q fixes · bce5aeb5

Martin J. Bligh authored Sep 25, 2002

 - Remove the const that someone incorrectly stuck in there, it type conflicts.
   Alan has a better plan for fixing this long term, but this fixes the compile
   warning for now.

 - Move the printk of the xquad_portio setup *after* we put something in the variable
   so it actually prints something useful, not 0 ;-)

 - To derive the size of the xquad_portio area, multiply the number of nodes by the
   size of each nodes, not the size of two nodes (and remove define). Doh!

bce5aeb5

Remove busy-wait for short RT nanosleeps. It's a random special case · 98ae8e2b
Linus Torvalds authored Sep 25, 2002
```
and does the wrong thing for higher HZ values anyway.
```
98ae8e2b
Merge bk://ldm@bkbits.net/linux-2.5 · 56d8b39d
Patrick Mochel authored Sep 25, 2002
```
into osdl.org:/home/mochel/src/kernel/devel/linux-2.5
```
56d8b39d
add disk device class · abe2e064
Patrick Mochel authored Sep 25, 2002

abe2e064
Merge bk://linus.bkbits.net/linux-2.5 · 4a99b33d
Patrick Mochel authored Sep 25, 2002
```
into hostme.bitkeeper.com:/ua/repos/l/ldm/linux-2.5
```
4a99b33d

[PATCH] exit-fix-2.5.38-E3 · 5dd6a6e5

Ingo Molnar authored Sep 24, 2002

This fixes a number of bugs in the thread-release code:

 - notify parents only if the group leader is a zombie,
   and if it's not a detached thread.

 - do not reparent children to zombie tasks.

 - introduce the TASK_DEAD state for tasks, to serialize the task-release
   path. (to some it might be confusing that tasks are zombies first, then
   dead :-)

 - simplify tasklist_lock usage in release_task().

the effect of the above bugs ranged from unkillable hung zombies to kernel
crashes. None of those happens with the patch applied.

5dd6a6e5

[PATCH] remove elevator_linus · 2684cd69
Jens Axboe authored Sep 24, 2002
```
Patch killing off elevator_linus for good. Sniffle.
```
2684cd69

[PATCH] deadline scheduler · 85b2148a

Jens Axboe authored Sep 24, 2002

This introduces the deadline-ioscheduler, making it the default.  2nd
patch coming that deletes elevator_linus in a minute.

This one has read_expire at 500ms, and writes_starved at 2.

85b2148a

[PATCH] PnP BIOS ESCD sanity check · 650e56ee
Thomas Hood authored Sep 24, 2002
```
Sanity checkthe ESCD size. From 2.4.
```
650e56ee

[PATCH] ALi and Cypress IDE fixes · 26b90050

Ivan Kokshaysky authored Sep 24, 2002

These two chipsets are most common on alpha.
- cy82c693: allow the generic IDE setup code to work correctly
  with broken PCI registers layout of this chip. This fixes
  quite a few problems with secondary channel, plus some hacks in
  arch code can go away.
- ALi M5229: enable DMA.

26b90050

[PATCH] 3ware driver update for 2.5.35 · 92f2c52c
Adam Radford authored Sep 24, 2002

92f2c52c

[PATCH] pidhash-2.5.38-A0 · 5191a147

Ingo Molnar authored Sep 24, 2002

This removes the cmpxchg from the PID allocator and replaces it with a
spinlock.  This spinlock is hit only a couple of times per bootup, so
it's not a performance issue.

5191a147

[PATCH] thread-flock-2.5.38-A3 · a16435af

Ingo Molnar authored Sep 24, 2002

Ulrich found another small detail wrt. POSIX requirements for threads -
this time it's the recursion features (read-held lock being write-locked
means an upgrade if the same 'process' is the owner, means a deadlock if a
different 'process').

this requirement even makes some sense - the group of threads who own a
lock really own all rights to the lock as well.

These changes fix this, all testcases pass now.  (inter-process
testcases as well, which are not affected by this patch.)

(SIGURG and SIGIO semantics should also continue to work - there's some
more stuff we can optimize with the new pidhash in this area, but that's
for later.)

a16435af

[PATCH] loop device broken in 2.5.38 · 86b18ae3

Theodore Y. Ts'o authored Sep 24, 2002

The loop device driver was broken in 2.5.38 when it was converted over
to use gendisk.  I discovered this while doing final regression testing
on the ext3 htree code.

The problem is that figure_loop_size() is setting the capacity of the
loop device in kilobytes (because that's what compute_loop_size()
returns), but set_capacity() expects the size in 512 byte sectors.

I've enclosed a patch which fixes the problem, as well as simplifying
the code by eliminating compute_loop_size(), since it is a static
function is only used once by figure_loop_size().

86b18ae3

Merge jfs@jfs.bkbits.net:linux-2.5 · 0de4d503
Dave Kleikamp authored Sep 24, 2002
```
into kleikamp.austin.ibm.com:/home/shaggy/bk/jfs-2.5
```
0de4d503

24 Sep, 2002 12 commits

[PATCH] fix null dereference in sys_mprotect · 0cd9efe3

Paul Mackerras authored Sep 24, 2002

As it is at the moment, sys_mprotect will dereference a null pointer
if you use it on a region that is contained within the first vma. I
have a little program that demonstrates this (I'll post it if anyone
is interested). What happens then is that the process hangs in
do_page_fault at the down_read on the mm->mmap_sem, since sys_mprotect
has done a down_write on mm->mmap_sem.

The problem is that mprotect_fixup isn't updating prev properly. Thus
we can finish the main loop in sys_mprotect with prev == NULL. This
has been the case since Christoph's cleanups went in. Prior to that,
mprotect_fixup always set prev to something non-NULL. I suspect that
not updating prev could also cause vmas to get dropped completely if
the region being mprotected spans more than one vma.

The patch below fixes the problem by making mprotect_fixup set prev to
a reasonable value in all circumstances.

0cd9efe3

Avoid possibly busy-looping in mouse read. · efae82c0
Linus Torvalds authored Sep 24, 2002

efae82c0
Merge osdl.org:/home/mochel/src/kernel/devel/linux-2.5-virgin · cd585d2f
Patrick Mochel authored Sep 24, 2002
```
into osdl.org:/home/mochel/src/kernel/devel/linux-2.5
```
cd585d2f

[PATCH] per-cpu data preempt-safing · c6e70088

Robert Love authored Sep 24, 2002

This unsafe access to per-CPU data via reordering of instructions or use
of "get_cpu()".

Before anyone balks at the brlock.h fix, note this was in the
alternative version of the code which is not used by default.

c6e70088

[PATCH] remove preempt workaround in slab.c · 7f644d00

Robert Love authored Sep 24, 2002

Before the irqs_disabled() check in preempt_schedule(), we worked around
some locking issues in slab.c.  Now that we will never preempt with
interrupts disabled, we can remove those and clean things up.

This is courtesy of Manfred Spraul.

7f644d00

[PATCH] s/preempt_count()/in_atomic() in do_exit() · 5d671309

Robert Love authored Sep 24, 2002

This converts the debugging check in do_exit from a check on
preempt_count() to in_atomic().

The main benefit to this is we will stop warning over the BKL and now
use the standard mechanism for such checks.

5d671309

[PATCH] flock_lock_file livelock fix · 0adfb15a

Matthew Wilcox authored Sep 23, 2002

Looks like I dropped a hunk from my patchset, sorry.

We never set FL_SLEEP in the flock case, so if we should block, we'll
livelock instead.

0adfb15a

Simplify elevator algorithm, make it prefer reads heavily. · a9ee74e7

Linus Torvalds authored Sep 23, 2002

This is needed for reasonable read latency with the new VM
behaviour. 

NOTE! This is way too unfair, Andrew and Jens are working on
alternatives.

a9ee74e7

[PATCH] another alpha update · 7f012496

Ivan Kokshaysky authored Sep 23, 2002

 - Makefile cleanups and fixes
 - a bunch of syscalls added
 - removed crap from asm/ide.h (it's not needed anymore)
 - __down_read_trylock fix

7f012496

Merge with DRI CVS tree · 76f92de7
Linus Torvalds authored Sep 23, 2002

76f92de7

[PATCH] ide io scheduler thing · 60abdcb3

Jens Axboe authored Sep 23, 2002

IDE must use blk_queue_empty() and not do a list_empty() on the
(potentially only) dispatch queue.  This took quite a while to find
while debugging a new io scheduler...

60abdcb3

[PATCH] pgrp-fix-2.5.38-A2 · 872aa4a8

Ingo Molnar authored Sep 23, 2002

This fixes the emacs bug reported by Andries.  It should probably also
fix other, terminal handling related weirdnesses introduced by the new
PID handling code in 2.5.38.

The bug was in the session_of_pgrp() function, if no proper session is
found in the process group then we must take the session ID from the
process that has pgrp PID (which does not necesserily have to be part of
the pgrp).  The fallback code is only triggered when no process in the
process group has a valid session - besides being faster, this also
matches the old implementation.

[ hey, who needs a POSIX conformance testsuite when we have emacs! ;) ]

872aa4a8

23 Sep, 2002 3 commits

driver model: add better platform device support. · f6bec0e6

Patrick Mochel authored Sep 23, 2002

Platform devices are devices commonly found on the motherboard of systems. This
includes legacy devices (serial ports, floppy controllers, parallel ports, etc)
and host bridges to peripheral buses. 

We already had a platform bus type, which gives a way to group platform devices
and drivers, and allow each to be bound to each other dynamically. Though before,
it didn't do anything. It still doesn't do much, but we now have:

- struct platform_device, which generically describes platform deviecs. This only
  includes a name and id in addition to a struct device, but more may be added later.

- implelemnt platform_device_register() and platform_device_unregister() to handle
  adding and removing these devices. 

- Create legacy_bus - a default parent device for legacy devices. 

- Change the floppy driver to define a platform_device (instead of a sys_device). 
  In driverfs, this gives us now:

a# tree -d /sys/bus/platform/
/sys/bus/platform/
|-- devices
|   `-- floppy0 -> ../../../root/legacy/floppy0
`-- drivers

and

# tree -d /sys/root/legacy/
/sys/root/legacy/
`-- floppy0

f6bec0e6

Merge osdl.org:/home/mochel/src/kernel/devel/linux-2.5-virgin · d668723c
Patrick Mochel authored Sep 23, 2002
```
into osdl.org:/home/mochel/src/kernel/devel/linux-2.5
```
d668723c

driver model: add support for multi-board systems. · aeb14ea3

Patrick Mochel authored Sep 23, 2002

- device struct sys_root for describing the individual boards of a multi-board
  system.

- allow for registration of alternate device roots.

- check if struct sys_device::root is set on registration, and add it as a child of
  an alternative root, if it's set.

aeb14ea3