Commits · 1c03b1a99c4e402215b9fe40dba14cee62e279a8 · Kirill Smelkov / linux

31 Oct, 2002 40 commits

[PATCH] fix sys_lookup_dcookie prototype · 1c03b1a9

John Levon authored Oct 31, 2002

We need to use u64 because the future 64-bit ports can theoretically
return the same value for two different dentries, as pointed out by
Ulrich Weigand.

The patch also changes return value of the syscall to give length of
data copied, needed for valgrind support (this bit is by Philippe Elie).

Note this is not a complete fix for mixed 32/64: userspace needs to
figure out the kernel pointer size when reading from the buffer. But
that's another fix...

NOTE! any oprofile users will need to upgrade after this goes in, and
the user-space equivalent is checked into CVS.  Sorry for the inconvenience

1c03b1a9

[PATCH] additional arch support for per-cpu kernel_stat · 97679f9c
Andrew Morton authored Oct 31, 2002
```
Companion to the previous patch: all the support needed for non-ia32
architectures.
```
97679f9c

[PATCH] make kernel_stat use per-cpu infrastructure · fd3e6205

Andrew Morton authored Oct 31, 2002

Patch from Ravikiran G Thirumalai <kiran@in.ibm.com>

1. Break out disk stats from kernel_stat and move disk stat to blkdev.h

2. Group cpu stat in kernel_stat and make them "per_cpu" instead of
   the NR_CPUS array

3. Remove EXPORT_SYMBOL(kstat) from ksyms.c (as I noticed that no module is
   using kstat)

fd3e6205

[PATCH] uninlining in ipc/* · 8f2215c6

Andrew Morton authored Oct 31, 2002

Uninlines some large functions in the ipc code.

Before:
   text    data     bss     dec     hex filename
  30226     224     192   30642    77b2 ipc/built-in.o

After:
   text    data     bss     dec     hex filename
  20274     224     192   20690    50d2 ipc/built-in.o

8f2215c6

[PATCH] use RCU for IPC locking · bb468c02

Andrew Morton authored Oct 31, 2002

Patch from Mingming, Rusty, Hugh, Dipankar, me:

- It greatly reduces the lock contention by having one lock per id.
  The global spinlock is removed and a spinlock is added in
  kern_ipc_perm structure.

- Uses ReadCopyUpdate in grow_ary() for locking-free resizing.

- In the places where ipc_rmid() is called, delay calling ipc_free()
  to RCU callbacks.  This is to prevent ipc_lock() returning an invalid
  pointer after ipc_rmid().  In addition, use the workqueue to enable
  RCU freeing vmalloced entries.

Also some other changes:

- Remove redundant ipc_lockall/ipc_unlockall

- Now ipc_unlock() directly takes IPC ID pointer as argument, avoid
  extra looking up the array.

The changes are made based on the input from Huge Dickens, Manfred
Spraul and Dipankar Sarma.  In addition, Cliff White has run OSDL's
dbt1 test on a 2 way against the earlier version of this patch.
Results shows about 2-6% improvement on the average number of
transactions per second.  Here is the summary of his tests:

                        2.5.42-mm2      2.5.42-mm2-ipclock
			-----------------------------
Average over 5 runs     85.0 BT         89.8 BT
Std Deviation 5 runs     7.4  BT         1.0 BT

Average over 4 best     88.15 BT        90.2 BT
Std Deviation 4 best     2.8 BT          0.5 BT


Also, another test today from Bill Hartner:

I tested Mingming's RCU ipc lock patch using a *new* microbenchmark - semopbench.
semopbench was written to test the performance of Mingming's patch.
I also ran a 3 hour stress and it completed successfully.

Explanation of the microbenchmark is below the results.
Here is a link to the microbenchmark source.

http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/semopbench.c

SUT : 8-way 700 Mhz PIII

I tested 2.5.44-mm2 and 2.5.44-mm2 + RCU ipc patch

>semopbench -g 64 -s 16 -n 16384 -r > sem.results.out
>readprofile -m /boot/System.map | sort -n +0 -r > sem.profile.out

The metric is seconds / per repetition.  Lower is better.

kernel              run 1     run 2
                    seconds   seconds
==================  =======   =======
2.5.44-mm2          515.1       515.4
2.5.44-mm2+rcu-ipc   46.7        46.7

With Mingming's patch, the test completes 10X faster.

bb468c02

[PATCH] tmpfs support for remap_file_pages · 0a4b1945

Andrew Morton authored Oct 31, 2002

From Hugh

Instate Ingo's shmem_populate on top of the previous patches, now using
shmem_getpage(,,,SGP_QUICK) for the nonblocking case (its find_lock_page
may block, but rarely for long). Note install_page will need redefining
if PAGE_CACHE_SIZE departs from PAGE_SIZE; note pgoff to populate must
be in terms of PAGE_SIZE; note page_cache_release if install_page fails.

filemap_populate similarly needs page_cache_release when install_page
fails, but filemap.c not included in this patch since we started out
from 2.5.43 rather than 2.5.43-mm2: whereas patches 1-8 could go
directly to 2.5.43, this 9/9 belongs with Ingo's population work.

0a4b1945

[PATCH] sys_remap_file_pages · d16dc20c

Andrew Morton authored Oct 31, 2002

Ingo's remap_file_pages patch.  Supported on ia32, x86-64, sparc
and sparc64.  Others will need to update mman.h and the syscall
tables.

d16dc20c

[PATCH] strip pagecache from to-be-reaped inodes · f9a316fa

Andrew Morton authored Oct 31, 2002

With large highmem machines and many small cached files it is possible
to encounter ZONE_NORMAL allocation failures.  This can be demonstrated
with a large number of one-byte files on a 7G machine.

All lowmem is filled with icache and all those inodes have a small
amount of highmem pagecache which makes them unfreeable.

The patch strips the pagecache from inodes as they come off the tail of
the inode_unused list.

I play tricks in there peeking at the head of the inode_unused list to
pick up the inode again after running iput().  The alternatives seemed
to involve more widespread changes.

Or running invalidate_inode_pages() under inode_lock which would be a
bad thing from a scheduling latency and lock contention point of view.

f9a316fa

[PATCH] exempt swapcahe pages from "use once" handling · 1bbb1949

Andrew Morton authored Oct 31, 2002

The kernel will presently reclaim swapcache pages as they come off the
tail of the inactive list even if they are referenced. That's the
"use-once" pagecache path and shouldn't be applied to swapcache pages.

This affects very few pages in practice because all those pages tend to
be mapped into pagetables anyway.

1bbb1949

[PATCH] empty the deferred lru-addition buffers in swapin_readahead · e550cf78

Andrew Morton authored Oct 31, 2002

If we're about to return to userspace after performing some swap
readahead, the pages in the deferred-addition LRU queues could stay
there for some time. So drain them after performing readahead.

e550cf78

[PATCH] start anon pages on the active list (properly this time) · 33709b5c

Andrew Morton authored Oct 31, 2002

Use lru_cache_add_active() so ensure that pages which are, or will be
mapped into pagetables are started out on the active list.

33709b5c

[PATCH] lru_add_active(): for starting pages on the active list · 228c3d15

Andrew Morton authored Oct 31, 2002

This is the first in a series of patches which tune up the 2.5
performance under heavy swap loads.

Throughput on stupid swapstormy tests is increased by 1.5x to 3x.
Still about 20% behind 2.4 with multithreaded tests.  That is not
easily fixable - the virtual scan tends to apply a form of load
control: particular processes are heavily swapped out so the others can
get ahead.  With 2.5 all processes make very even progress and much
more swapping is needed.  It's on par with 2.4 for single-process
swapstorms.


In this patch:

The code which tries to start mapped pages out on the active list
doesn't work very well.  It uses an "is it mapped into pagetables"
test.  Which doesn't work for, say, swap readahead pages.  They are not
mapped into pagetables when they are spilled onto the LRU.

So create a new `lru_cache_add_active()' function for deferred addition
of pages to their active list.

Also move mark_page_accessed() from filemap.c to swap.c where all
similar functions live.  And teach it to not try to move pages which
are in the deferred-addition list onto the active list.  That won't
work, and it's bogusly clearing PageReferenced in that case.

The deferred-addition lists are a pest.  But lru_cache_add used to be
really expensive in sime workloads on some machines.  Must persist.

228c3d15

[PATCH] flush_dcache_page in get_user_pages() · e735f278

Andrew Morton authored Oct 31, 2002

Davem said:

"Ho hum, it is tricky :-)))

 At bio_map_user() you need to see the user's most recent write to the
 page if you are going "user --> device".  So if "user --> device"
 bio_map_user() must flush_dcache_page().

 I find the write_to_vm condition confusion which is probably why I am
 sitting here spelling this out :-)

 At bio_unmap_user(), if we are going "device --> user" you have to
 flush_dcache_page().  And actually, this flush could just as
 legitimately occur at bio_map_user() time.

 Therefore, the easiest thing to do is always flush_dcache_page() at
 bio_map_user().

 All the other cases are going to be like this, so we might as well
 cut to the chase and flush_dcache_page() for all the pages inside of
 get_user_pages()."

e735f278

[PATCH] uninline some things in mm/*.c · 79425084

Andrew Morton authored Oct 31, 2002

Tuned for gcc-2.95.3:

	filemap.c:	10815 -> 10046
	highmem.c:	3392 -> 3104
	mmap.c:		5998 -> 5854
	mremap.c:	3058 -> 2802
	msync.c:	1521 -> 1489
	page_alloc.c:	8487 -> 8167

79425084

[PATCH] speedup heuristic for get_unmapped_area · 631709da

Andrew Morton authored Oct 31, 2002

[I was going to send shared pagetables today, but it failed in
my testing under X :( ]

the first one is an mmap inefficiency that was reported by Saurabh Desai.
The test_str02 NPTL test-utility does the following: it tests the maximum
number of threads by creating a new thread, which thread creates a new
thread itself, etc. It basically creates thousands of parallel threads,
which means thousands of thread stacks.

NPTL uses mmap() to allocate new default thread stacks - and POSIX
requires us to install a 'guard page' as well, which is done via
mprotect(PROT_NONE) on the first page of the stack. This means that tons
of NPTL threads means 2* tons of vmas per MM, all allocated in a forward
fashion starting at the virtual address of 1 GB (TASK_UNMAPPED_BASE).

Saurabh reported a slowdown after the first couple of thousands of
threads, which i can reproduce as well. The reason for this slowdown is
the get_unmapped_area() implementation, which tries to achieve the most
compact virtual memory allocation, by searching for the vma at
TASK_UNMAPPED_BASE, and then linearly searching for a hole. With thousands
of linearly allocated vmas this is an increasingly painful thing to do ...

obviously, high-performance threaded applications will create stacks
without the guard page, which triggers the anon-vma merging code so we end
up with one large vma, not tons of small vmas.

it's also possible for userspace to be smarter by setting aside a stack
space and keeping a bitmap of allocated stacks and using MAP_FIXED (this
also enables it to do the guard page not via mprotect() but by keeping the
stacks apart by 1 page - ie. half the number of vmas) - but this also
decreases flexibility.

So i think that the default behavior nevertheless makes sense as well, so
IMO we should optimize it in the kernel.

there are various solutions to this problem, none of which solve the
problem in a 100% sufficient way, so i went for the simplest approach: i
added code to cache the 'last known hole' address in mm->free_area_cache,
which is used as a hint to get_unmapped_area().

this fixed the test_str02 testcase wonderfully, thread creation
performance for this testcase is O(1) again, but this simpler solution
obviously has a number of weak spots, and the (unlikely but possible)
worst-case is quite close to the current situation. In any case, this
approach does not sacrifice the perfect VM compactness out mmap()
implementation achieves, so it's a performance optimization with no
externally visible consequences.

The most generic and still perfectly-compact VM allocation solution would
be to have a vma tree for the 'inverse virtual memory space', ie. a tree
of free virtual memory ranges, which could be searched and iterated like
the space of allocated vmas. I think we could do this by extending vmas,
but the drawback is larger vmas. This does not save us from having to scan
vmas linearly still, because the size constraint is still present, but at
least most of the anon-mmap activities are constant sized. (both malloc()
and the thread-stack allocator uses mostly fixed sizes.)

This patch contains some fixes from Dave Miller - on some architectures
it is not posible to evaluate TASK_UNMAPPED_BASE at compile-time.

631709da

[PATCH] Orlov block allocator for ext2 · b2205dc0

Andrew Morton authored Oct 31, 2002

This is Al's implementation of the Orlov block allocator for ext2.

At least doubles the throughput for the traverse-a-kernel-tree
test and is well tested.

I still need to do the ext3 version.

No effort has been put into tuning it at this time, so more gains
are probably possible.

b2205dc0

Merge bk://ldm.bkbits.net/linux-2.5-kobject · 4856e09e
Linus Torvalds authored Oct 31, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
4856e09e

kobject: don't create directory for kobject/subsystem if name is NULL. · b053262f

Patrick Mochel authored Oct 31, 2002

This allows subsystems to exist the hierarchy, but not be exported via
the filesystem. This fixes a minor flaw with partitions, as partition
objects are children of block devices, though they register with the
partition subsystem. Really, the partition subsystem shouldn't have
presence in the tree at all, yet still exist.

b053262f

Merge http://gkernel.bkbits.net/alpha-2.5 · 1baa95c5
Linus Torvalds authored Oct 31, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
1baa95c5
Merge bk://ldm.bkbits.net/linux-2.5-kobject · 6dc1ec37
Linus Torvalds authored Oct 31, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
6dc1ec37
Fix alpha build. · f32abcc0
Jeff Garzik authored Oct 31, 2002

f32abcc0
turn off kobject debugging by default. · 24555ac2
Patrick Mochel authored Oct 30, 2002

24555ac2
driverfs: die die die · 808897cf
Patrick Mochel authored Oct 30, 2002

808897cf
convert edd to use kobjects and sysfs. · 60211581
Patrick Mochel authored Oct 30, 2002

60211581
driver model: remove few remaining references to driverfs. · 293c14d9
Patrick Mochel authored Oct 30, 2002

293c14d9
make sure block device_init() is called before part_init(). · 69a304e3
Patrick Mochel authored Oct 30, 2002

69a304e3

acpi: convert to use kobjects and sysfs. · 8bebafe7

Patrick Mochel authored Oct 30, 2002

- replace driver_dir_entry in acpi_device with struct kobject.
- register acpi with firmware subsystem on startup.
- register sub-subsystem.
- put namespace hierarchy under that.

8bebafe7

create firmware subsystem and register it on startup. · c408284c
Patrick Mochel authored Oct 30, 2002

c408284c
Merge bk://extfs.bkbits.net/extfs-2.5-update · 6e6e099b
Linus Torvalds authored Oct 30, 2002
```
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
```
6e6e099b