- 03 Jan, 2005 40 commits
-
-
Martin J. Bligh authored
Badari hit a problem when configuring PAE off (ie CONFIG_4G) where the pkmap area could end up overlapping the fixmap area. For some reason, PKMAP_BASE was defined statically, which seems rather pointless, and asking for trouble. Patch below definines it dynamically, under the fixmap area. The ordering of the VMALLOC_RESERVE space is: FIXADDR_TOP fixed_addresses FIXADDR_START temp fixed addresses FIXADDR_BOOT_START Persistent kmap area PKMAP_BASE VMALLOC_END Vmalloc area VMALLOC_START high_memory Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
David Howells authored
The attached patch fixes the IDE driver to initialise correctly in the case that IDE_ARCH_OBSOLETE_INIT is not defined. Not defining this macro would seem to be the correct thing to do since it includes the word "obsolete" in its name. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
David Howells authored
The attached patch creates a generic set of termio userspace access functions with proper error handling. None of the current archs check for errors in this case. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Ananth N. Mavinakayanahalli authored
Here is a patch that adds a wrapper for defining jprobe.entry to make t easy to handle the three dword function descriptors defined by the PowerPC ELF ABI. x86, ppc64 and x86_64 are also updated. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
David Gibson authored
Currently the ppc64 sysfs code registers an entry for each possible cpu in sysfs, rather than just online cpus. That makes sense, since the sysfs entries are needed to control onlining of the cpus. However, this is done even if CONFIG_HOTPLUG_CPU is not set, or if it is not a hotplug capable (DLPAR) machine, which is a bit misleading. Secondly it also registers all the other sysfs entries (mostly performance monitoring controls) on all possible cpus, although they are quite meaningless on non-online cpus. This patch alters the code to only register sysfs directories at boot for cpus which are either online or could be onlined (cpu is possible, and CONFIG_HOTPLUG_CPU and an lpar machine). Furthermore, the entries apart from 'online' itself and 'physical_id' are only registered for online CPUs (and deregistered again if a cpu goes offline). Currently the ppc64 sysfs code registers an entry for each possible cpu in sysfs, rather than just online cpus. That makes sense, since the sysfs entries are needed to control onlining of the cpus. However, this is done even if CONFIG_HOTPLUG_CPU is not set, or if it is not a hotplug capable (DLPAR) machine, which is a bit misleading. Secondly it also registers all the other sysfs entries (mostly performance monitoring controls) on all possible cpus, although they are quite meaningless on non-online cpus. This patch alters the code to only register sysfs directories at boot for cpus which are either online or could be onlined (cpu is possible, and CONFIG_HOTPLUG_CPU and an lpar machine). Furthermore, the entries apart from 'online' itself and 'physical_id' are only registered for online CPUs (and deregistered again if a cpu goes offline). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Ananth N. Mavinakayanahalli authored
Kprobes (Kernel dynamic probes) is a lightweight mechanism for kernel modules to insert probes into a running kernel, without the need to modify the underlying source. The probe handlers can then be coded to log relevent data at the probe point. More information on kprobes can be found at: http://www-124.ibm.com/developerworks/oss/linux/projects/kprobes/ Jprobes (or jumper probes) is a small infrastructure to access function arguments. It can be used by defining a small stub with the same template as the routine in kernel, within which the required parameters can be logged. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Arthur Othieno authored
Documentation/powerpc/cpu_features.txt mysteriously disappeared sometime when 2.5 forked off. Searching through BK logs on linux.bkbits.net didn't reveal anything, unfortunately. The only reference I could pick up from searching the available lkml archives is the 2.4.20-pre11 ChangeLog where this was first merged. Thus far, nothing indicates it was intentionally removed, and AFAICS, is still up to date with the current code. Signed-off-by: Arthur Othieno <a.othieno@bluewin.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Matt Porter authored
Fixes io_remap_page_range() to use the 32-bit address translator similar to ioremap(). Someday u64 start/end resources should make this unnecessary. Fixes set_pte() to handle a long long pte_t properly. Signed-off-by: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Matt Porter authored
We'd like to get a uImage when just using 'make' on many targets. After some discussion, it made sense to simply add uImage to the default targets since it adds minimal build overhead and will work on all platforms. Also, fix a dependency in the boot stuff. Signed-off-by: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Corey Minyard authored
Add a debugging interface for PowerPC that allows signal handlers (or any jump to a context, really) to perform debug functions. It allows the a user program to turn on single-stepping, for instance, and the thread will get a trap after executing the next instruction. It can also (on supported PPC processors) turn on branch tracing and get a trap after the next branch instruction is executed. This is useful for in-application debugging. Note that you can enable single-stepping on x86 processors directly from signal handlers. Newer x86 processors have the equivalent of a branch-trace bit in the IA32_DEBUGCTL MSR and could have similar function to this syscall. Most other processors could benefit from a similar interface, except for ARM which is extraordinarily broken for debugging. Future uses of this could be adding the ability to set the hardware breakpoint registers from a signal handler. Signed-off-by: Corey Minyard <minyard@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Matt Porter authored
This trivial patch removes long-standing typo in ibm44x.h. In fact, we already have correct DCRN_CPC0_GPIO define later in the same file. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> Signed-off-by: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Matt Porter authored
This patch removes annoying warnings in ebony.c. Fix is similar to one I made in ocotea.c before. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> Signed-off-by: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Kumar Gala authored
Remove prototypes and externs out of the .c files Signed-off-by: Andy Fleming <afleming@freescale.com> Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Kumar Gala authored
Adds oprofile support for the e500 PowerPC core. Signed-off-by: Andy Fleming <afleming@freescale.com> Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Matt Porter authored
Patch from Eugene to do some cleanup of the PPC4xx PIC code. Separates the interrupts that can have polarity/triggering modified for platform modification if necessary. Between the two of us, it's tested on most of the affected platforms. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> Signed-off-by: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Randy Vinson authored
I've added support for the IBM 750FX and 750GX Eval Boards (Chestnut/Buckeye). Signed-off-by: Randy Vinson <rvinson@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Mark A. Greer authored
This patch adds support for the Artesyn Katana 750i, 752i, and 3750. Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Mark A. Greer authored
This patch adds support for the Force CPCI-690 cPCI board. Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Mark A. Greer authored
This patch adds support for a line of evaluation platforms from Marvell that use the Marvell GT64260[ab] host bridges. This patch depends on the Marvell host bridge support patch (mv64x60). Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Mark A. Greer authored
Here is an incremental patch [hopefully] with your concerns addressed. Note that the arch/ppc/boot code is not kernel code and only exists for a short period of time before execution jumps to the kernel. Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Mark A. Greer authored
This patch adds core support for a line of host bridges from Marvell (formerly Galileo). This code has been tested with a GT64260a, GT64260b, MV64360, and MV64460. Patches for platforms that use these bridges will be sent separately. The patch is rather large so a link is provided. Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Tom Rini authored
The following patch switches ppc32 from using arch/ppc/defconfig to arch/ppc/configs/common_defconfig as a defconfig. These files are supposed to be identical, but always end up out of sync. This also updates the common_defconfig with current options. Signed-off-by: Tom Rini <trini@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Kumar Gala authored
Moves common handling of InstructionStorage, Alignment, Program, and Decrementer exceptions handlers for Book-E processors (44x & e500) into common code. Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Kumar Gala authored
Updates the Freescale Book-E MMU usage to match the architecture spec. This is mainly growing the widths of fields in various registers to match the architecture spec instead of the implementation. Signed-off-by: Becky Gill <becky.gill@freescale.com> Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Martin Josefsson authored
Here's a patch that fixes a pretty serious bug introduced by a recent "bugfix". The problem is that RST packets are ignored if they follow an ACK packet, this means that the timeout of the connection isn't decreased, so we get lots of old connections lingering around until the timeout expires, the default timeout for state ESTABLISHED is 5 days. This needs to go into -bk as soon as possible. The bug is present in 2.6.10 as well. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
When ipt_registrater_match() fails, ipt_recent doesn't remove its proc entry. Found by nfsim. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
After changing firewall rules, we try to return the counters to userspace. We didn't fail at that point if the copy failed, but it doesn't really matter. Someone added a warn_unused_result attribute to copy_to_user, so we get bogus warnings. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
We've been threatening to do this for ages: remove the backwards compatibility code. We can now combine ip_conntrack_core.c and ip_conntrack_standalone.c, likewise for the NAT code, but that will come later. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
I removed this code in a previous patch, and Patrick McHardy explained what was wrong. Add a comment. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
Several places use ip_ct_selective_cleanup() as a general iterator, which it was not intended for (it takes a const ip_conntrack *). So rename it, and make it take a non-const argument. Also, it missed unconfirmed connections, which aren't in the hash table. This introduces a potential problem for users which expect to iterate all connections (such as the helper deletion code). So keep a linked list of unconfirmed connections as well. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
On failure from register_sysctl_table, we return with exit 0. Oops. init and fini should also be static. nfsim found these. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Rusty Russell authored
Peejix's nfsim test for ipt_recent, written two days ago, revealed this bugs with ipt_recent: checkentry() returns true or false, not an error. (Maybe it should, but that's a much larger change). Also, make hash_func() static. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Brent Casavant authored
Modifies the TCP ehash and TCP bhash to enable the use of vmalloc to alleviate boottime memory allocation imbalances on NUMA systems, utilizing flags to the alloc_large_system_hash routine in order to centralize the enabling of this behavior. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Brent Casavant authored
The following patch modifies the dentry cache and inode cache to enable the use of vmalloc to alleviate boottime memory allocation imbalances on NUMA systems, utilizing flags to the alloc_large_system_hash routine in order to centralize the enabling of this behavior. In general, for each hash, we check at the early allocation point whether hash distribution is enabled, and if so we defer allocation. At the late allocation point we perform the allocation if it was not earlier deferred. These late allocation points are the same points utilized prior to the addition of alloc_large_system_hash to the kernel. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Brent Casavant authored
NUMA systems running current Linux kernels suffer from substantial inequities in the amount of memory allocated from each NUMA node during boot. In particular, several large hashes are allocated using alloc_bootmem, and as such are allocated contiguously from a single node each. This becomes a problem for certain workloads that are relatively common on big-iron HPC NUMA systems. In particular, a number of MPI and OpenMP applications which require nearly all available processors in the system and nearly all the memory on each node run into difficulties. Due to the uneven memory distribution onto a few nodes, any thread on those nodes will require a portion of its memory be allocated from remote nodes. Any access to those memory locations will be slower than local accesses, and thereby slows down the effective computation rate for the affected CPUs/threads. This problem is further amplified if the application is tightly synchronized between threads (as is often the case), as they entire job can run only at the speed of the slowest thread. Additionally since these hashes are usually accessed by all CPUS in the system, the NUMA network link on the node which hosts the hash experiences disproportionate traffic levels, thereby reducing the memory bandwidth available to that node's CPUs, and further penalizing performance of the threads executed thereupon. As such, it is desired to find a way to distribute these large hash allocations more evenly across NUMA nodes. Fortunately current kernels do perform allocation interleaving for vmalloc() during boot, which provides a stepping stone to a solution. This series of patches enables (but does not require) the kernel to allocate several boot time hashes using vmalloc rather than alloc_bootmem, thereby causing the hashes to be interleaved amongst NUMA nodes. In particular the dentry cache, inode cache, TCP ehash, and TCP bhash have been changed to be allocated in this manner. Due to the limited vmalloc space on architectures such as i386, this behavior is turned on by default only for IA64 NUMA systems (though there is no reason other interested architectures could not enable it if desired). Non-IA64 and non-NUMA systems continue to use the existing alloc_bootmem() allocation mechanism. A boot line parameter "hashdist" can be set to override the default behavior. The following two sets of example output show the uneven distribution just after boot, using init=/bin/sh to eliminate as much non-kernel allocation as possible. Without the boot hash distribution patches: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3697696 172960 1 3882992 3866656 16336 2 3883008 3866784 16224 3 3882992 3866464 16528 4 3883008 3866592 16416 5 3883008 3866720 16288 6 3882992 3342176 540816 7 3883008 3865440 17568 8 3882992 3866560 16432 9 3883008 3866400 16608 10 3882992 3866592 16400 11 3883008 3866400 16608 12 3882992 3866400 16592 13 3883008 3866432 16576 14 3883008 3866528 16480 15 3864768 3848256 16512 ToT 62097440 61152096 945344 Notice that nodes 0 and 6 have a substantially larger memory utilization than all other nodes. With the boot hash distribution patch: Nid MemTotal MemFree MemUsed (in kB) 0 3870656 3789792 80864 1 3882992 3843776 39216 2 3883008 3843808 39200 3 3882992 3843904 39088 4 3883008 3827488 55520 5 3883008 3843712 39296 6 3882992 3843936 39056 7 3883008 3844096 38912 8 3882992 3843712 39280 9 3883008 3844000 39008 10 3882992 3843872 39120 11 3883008 3843872 39136 12 3882992 3843808 39184 13 3883008 3843936 39072 14 3883008 3843712 39296 15 3864768 3825760 39008 ToT 62097440 61413184 684256 While not perfectly even, we can see that there is a substantial improvement in the spread of memory allocated by the kernel during boot. The remaining uneveness may be due in part to further boot time allocations that could be addressed in a similar manner, but some difference is due to the somewhat special nature of node 0 during boot. However the uneveness has fallen to a much more acceptable level (at least to a level that SGI isn't concerned about). The astute reader will also notice that in this example, with this patch approximately 256 MB less memory was allocated during boot. This is due to the size limits of a single vmalloc. More specifically, this is because the automatically computed size of the TCP ehash exceeds the maximum size which a single vmalloc can accomodate. However this is of little practical concern as the vmalloc size limit simply reduces one ridiculously large allocation (512MB) to a slightly less ridiculously large allocation (256MB). In practice machines with large memory configurations are using the thash_entries setting to limit the size of the TCP ehash _much_ lower than either of the automatically computed values. Illustrative of the exceedingly large nature of the automatically computed size, SGI currently recommends that customers boot with thash_entries=2097152, which works out to a 32MB allocation. In any case, setting hashdist=0 will allow for allocations in excess of vmalloc limits, if so desired. Other than the vmalloc limit, great care was taken to ensure that the size of TCP hash allocations was not altered by this patch. Due to slightly different computation techniques between the existing TCP code and alloc_large_system_hash (which is now utilized), some of the magic constants in the TCP hash allocation code were changed. On all sizes of system (128MB through 64GB) that I had access to, the patched code preserves the previous hash size, as long as the vmalloc limit (256MB on IA64) is not encountered. There was concern that changing the TCP-related hashes to use vmalloc space may adversely impact network performance. To this end the netperf set of benchmarks was run. Some individual tests seemed to benefit slightly, some seemed to be harmed slightly, but in all cases the average difference with and without these patches was well within the variabilty I would see from run to run. The following is the overall netperf averages (30 10 second runs each) against an older kernel with these same patches. These tests were run over loopback as GigE results were so inconsistent run to run both with and without these patches that they provided no meaningful comparison that I could discern. I used the same kernel (IA64 generic) for each run, simply varying the new "hashdist" boot parameter to turn on or off the new allocation behavior. In all cases the thash_entries value was manually specified as discussed previously to eliminate any variability that might result from that size difference. HP ZX1, hashdist=0 ================== TCP_RR = 19389 TCP_MAERTS = 6561 TCP_STREAM = 6590 TCP_CC = 9483 TCP_CRR = 8633 HP ZX1, hashdist=1 ================== TCP_RR = 19411 TCP_MAERTS = 6559 TCP_STREAM = 6584 TCP_CC = 9454 TCP_CRR = 8626 SGI Altix, hashdist=0 ===================== TCP_RR = 16871 TCP_MAERTS = 3925 TCP_STREAM = 4055 TCP_CC = 8438 TCP_CRR = 7750 SGI Altix, hashdist=1 ===================== TCP_RR = 17040 TCP_MAERTS = 3913 TCP_STREAM = 4044 TCP_CC = 8367 TCP_CRR = 7538 I believe the TCP_CC and TCP_CRR are the tests most sensitive to this particular change. But again, I want to emphasize that even the differences you see above are _well_ within the variability I saw from run to run of any given test. In addition, Jose Santos at IBM has run specSFS, which has been particularly sensitive to TLB issues, against these patches and saw no performance degredation (differences down in the noise). This patch: Modifies alloc_large_system_hash to enable the use of vmalloc to alleviate boottime allocation imbalances on NUMA systems. Due to limited vmalloc space on some architectures (i.e. x86), the use of vmalloc is enabled by default only on NUMA IA64 kernels. There should be no problem enabling this change for any other interested NUMA architecture. Signed-off-by: Brent Casavant <bcasavan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Alex Williamson authored
I noticed the function __read_page_state() curiously high in a q-tools profile of a write to a software raid0 device. Seems this is because we're checking page_states for all possible cpus and we have NR_CPUS possible when CONFIG_HOTPLUG_CPU=y. The default config for ia64 is now NR_CPUS=512, so on a little 8-way box, this is a significant waste of time. The patch below updates __read_page_state() and __get_page_state() to only count page_state info for online cpus. To keep the stats consistent, the page_alloc notifier is updated to move page_states off of the cpu going offline. On my profile, this dropped __read_page_state() back into the noise and boosted block write performance by 5% (as measured by spew - http://spew.berlios.de). Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Manfred Spraul authored
Add ARCH_SLAB_MINALIGN and document ARCH_KMALLOC_MINALIGN: The flags allow the arch code to override the default minimum object aligment (BYTES_PER_WORD). Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Andrew Morton authored
mark_page_accessed() is more heavyweight than we need: the page is already headed for the active list, so setting the software-referenced bit is equivalent. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Miquel van Smoorenburg authored
When reading a (partial) page from disk using read(), the kernel only marks the page as "accessed" if the read started at a page boundary. This means that files that are accessed randomly at non-page boundaries (usually database style files) will not be cached properly. The patch below uses the readahead state instead. If a page is read(), it is marked as "accessed" if the previous read() was for a different page, whatever the offset in the page. Testing results: - Boot kernel with mem=128M - create a testfile of size 8 MB on a partition. Unmount/mount. - then generate about 10 MB/sec streaming writes for i in `seq 1 1000` do dd if=/dev/zero of=junkfile.$i bs=1M count=10 sync cat junkfile.$i > /dev/null sleep 1 done - use an application that reads 128 bytes 64000 times from a random offset in the 64 MB testfile. 1. Linux 2.6.10-rc3 vanilla, no streaming writes: # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.22s system 5% cpu 4.456 total 2. Linux 2.6.10-rc3 vanilla, streaming writes: # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.16s system 2% cpu 7.667 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.37s system 1% cpu 23.294 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.02s user 0.99s system 1% cpu 1:11.52 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.21s system 2% cpu 10.273 total 3. Linux 2.6.10-rc3 with read-page-access.patch , streaming writes: # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.02s user 0.21s system 3% cpu 7.634 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.04s user 0.22s system 2% cpu 9.588 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.02s user 0.12s system 24% cpu 0.563 total # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.13s system 98% cpu 0.163 total As expected, with the read-page-access.patch, the kernel keeps the 8 MB testfile cached as expected, while without it, it doesn't. So this is useful for workloads where one smallish (wrt RAM) file is read randomly over and over again (like heavily used database indexes), while other I/O is going on. Plain 2.6 caches those files poorly, if the app uses plain read(). Signed-Off-By: Miquel van Smoorenburg <miquels@cistron.nl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-
Dave Hansen authored
When CONFIG_HIGHMEM=y, but ZONE_NORMAL isn't quite full, there is, of course, no actual memory at *high_memory. This isn't a problem with normal virt<->phys translations because it's never dereferenced, but CONFIG_NONLINEAR is a bit more finicky. So, don't do virt_to_phys() to non-existent addresses. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
-