- 12 Apr, 2004 40 commits
-
-
Andrew Morton authored
From: Gerd Knorr <kraxel@bytesex.org> The VIDIOC_CROPCAP ioctl had wrong R/W bits, this patch fixes it.
-
Andrew Morton authored
From: Andreas Gruenbacher <agruen@suse.de> Return EOPNOTSUPP rather than EINVAL when we discover an ACL version mismatch.
-
Andrew Morton authored
From: Olaf Kirch <okir@suse.de> Some NFS clients respond badly to a TCP connection being reset immediately after it has been accepted so: - Accept more connections before starting to drop them - Always drop the oldest connection - Random Early Drop doesn't really help here, and can hurt - ratelimit the friendly warnings.
-
Andrew Morton authored
From: Andrea Arcangeli <andrea@suse.de> We need a barrier before checking for kthread_should_stop in do_stop.
-
Andrew Morton authored
From: Davide Libenzi <davidel@xmailserver.org> When I split evenpoll_release() in an inline fast path plus an eventpoll_release_file() slow path, I forgot to change comments.
-
Andrew Morton authored
From: Romain Francoise <romain@orebokech.com> arch/sparc64/mm/hugetlbpage.c does not include linux/module.h so EXPORT_SYMBOL prints out warnings, and since sparc64 Makefiles have -Werror, the build fails.
-
Andrew Morton authored
From: Olof Johansson <olof@austin.ibm.com> As discussed on the ppc64 list yesterday and today: On some ppc64 systems, Open Firmware will give memory device nodes that are only 16MB in size, instead of the 256MB that our NUMA code currently expects (see MEMORY_INCREMENT in mmzone.h). Just changing the defines from 256MB to 16MB makes the table blow up from 32KB to 512KB, so this patch also makes it dynamically allocated based on actual memory size. Since all this is done before (well, during) bootmem init so we need to use lmb_alloc(). Finally, there's no need to use a full int for node ID. Current max is 16 nodes, so a signed char still leaves plenty of room to grow.
-
Andrew Morton authored
From: Ingo Molnar <mingo@elte.hu> Dave reported that /proc/*/status sometimes shows 101% as LoadAVG, which makes no sense. the reason of the bug is slightly incorrect scaling of the load_avg value. The patch below fixes this.
-
Andrew Morton authored
From: Arjan van de Ven <arjanv@redhat.com> Below is a patch to enable 4Kb stacks for x86. The goal of this is to 1) Reduce footprint per thread so that systems can run many more threads (for the java people) 2) Reduce the pressure on the VM for order > 0 allocations. We see real life workloads (granted with 2.4 but the fundamental fragmentation issue isn't solved in 2.6 and isn't solvable in theory) where this can be a problem. In addition order > 0 allocations can make the VM "stutter" and give more latency due to having to do much much more work trying to defragment The first 2 bits of the patch actually affect compiler options in a generic way: I propose to disable the -funit-at-a-time feature from gcc. With this enabled (and it's default with -O2), gcc will very agressively inline functions, which is nice and all for userspace, but for the kernel this makes us suffer a gcc deficiency more: gcc is extremely bad at sharing stackslots, for example a situation like this: if (some_condition) function_A(); else function_B(); with -funit-at-a-time, both function_A() and _B() might get inlined, however the stack usage of both functions of the parent function grows the stack usage of both functions COMBINED instead of the maximum of the two. Even with the normal 8Kb stacks this is a danger since we see some functions grow 3Kb to 4Kb of stack use this way. With 4Kb stacks, 4Kb of stack usage growth obviously is deadly ;-( but even with 8Kb stacks it's pure lottery. Disabling -funit-at-a-time also exposes another thing in the -mm tree; the attribute always_inline is considered harmful by gcc folks in that when gcc makes a decision to NOT inline a function marked this way, it throws an error. Disabling -funit-at-a-time disables some of the agressive inlining (eg of large functions that come later in the .c file) so this would make your tree not compile. The 4k stackness of the kernel is included in modversions, so people don't load 4k-stack modules into 8k-stack kernels. At present 4k stacks are selectable in config. When the feature has settled in we should remove the 8k option. This will break the nvidia modules. But Fedora uses 4k stacks so a new nvidia driver is expected soon.
-
Andrew Morton authored
drivers/acpi/events/evmisc.c: In function `acpi_ev_queue_notify_request': drivers/acpi/events/evmisc.c:143: warning: too many arguments for format
-
Andrew Morton authored
The filempa_nopage() logic will go into a tight loop if do_page_cache_readahead() doesn't actually start I/O against the target page. This can happen if the disk is read-congested, or if the filesystem doesn't want to read that part of the file for some reason. We will accidentally break out of the loop because (ra->mmap_miss > ra->mmap_hit + MMAP_LOTSAMISS) will eventually become true. Fix that up.
-
Andrew Morton authored
Remove the hardwired pagefault readaround distance in filemap_nopage() and use the per-file readahead setting. The main reason for this is in fact laptop-mode. If you want to prevent the disk from spinning up then you want all of your application's pages to be pulled into memory in one hit. Otherwise the disk will spin up each time you use a new part of whatever application(s) you are running.
-
Andrew Morton authored
From: Bart Samwel <bart@samwel.tk> Add support for the value "0" to ext3's "commit" option. When this value is given, ext3 substitutes it by the default commit interval. Introduce a constant JBD_DEFAULT_MAX_COMMIT_AGE for this.
-
Andrew Morton authored
From: Bart Samwel <bart@samwel.tk> Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop". In this mode the kernel will attempt to avoid spinning disks up. Algorithm: the idea is to hold dirty data in memory for a long time, but to flush everything which has been accumulated if the disk happens to spin up for other reasons. - Whenever a disk request completes (read or write), schedule a timer a few seconds hence. If the timer was already pending, reset it to a few seconds hence. - When the timer expires, write back the whole world. We use sync_filesystems() for this because it will force ext3 journal commits as well. - In balance_dirty_pages(), kick off background writeback when we hit the high threshold (dirty_ratio), not when we hit the low threshold. This has the effect of causing "lumpy" writeback which is something I spent a year fixing, but in laptop mode, it is desirable. - In try_to_free_pages(), only kick pdflush if the VM is getting into distress: we want to keep scanning for clean pages, deferring writeback. - In page reclaim, avoid writing back the odd random dirty page off the LRU: only start I/O if the scanning is working harder. The effect is to perform a sync() a few seconds after all I/O has ceased. The value which was written into /proc/sys/vm/laptop-mode determines, in seconds, the delay between the final I/O and the flush. Additionally, the patch adds tools which help answer the question "why the heck does my disk spin up all the time?". The user may set /proc/sys/vm/block_dump to a non-zero value and the kernel will print out information which will identify the process which is performing disk reads or which is dirtying pagecache. The user should probably disable syslogd before setting block-dump.
-
Andrew Morton authored
This is always equal to constant zero.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> rmap's try_to_unmap_one comments on find_vma failure, that a page may temporarily be absent from a vma during mremap: no longer, though it is still possible for this find_vma to fail, while unmap_vmas drops page_table_lock (but that is no problem for file truncation).
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> mremap's move_vma should think ahead to lessen the chance of failure during its rewind on failure: running out of memory always possible, but it's silly for it to embark when it's near the map_count limit.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Subtle point from Rajesh Venkatasubramanian: when mremap's move_vma fails and so rewinds, before moving the file-based ptes back, we must move new_vma before old vma in the i_mmap or i_mmap_shared list, so that when racing against vmtruncate we cannot propagate pages to be truncated back from new_vma into the just cleaned old_vma.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Partial rewrite of mremap's move_vma. Rajesh Venkatasubramanian has pointed out that vmtruncate could miss ptes, leaving orphaned pages, because move_vma only made the new vma visible after filling it. We see no good reason for that, and time to make move_vma more robust. Removed all its vma merging decisions, leave them to mmap.c's vma_merge, with copy_vma added. Removed duplicated is_mergeable_vma test from vma_merge, and duplicated validate_mm from insert_vm_struct. move_vma move from old to new then unmap old; but on error move back from new to old and unmap new. Don't unwind within move_page_tables, let move_vma call it explicitly to unwind, with the right source vma. Get the VM_ACCOUNTing right even when the final do_munmap fails.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Clean up mremap move's copy_one_pte: - get_one_pte_map_nested already weeded out the pte_none case, now don't even call copy_one_pte if it has nothing to do. - check pfn_valid before passing page to page_remove_rmap.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> First of six patches against 2.6.5-rc3, cleaning up mremap's move_vma, and fixing truncation orphan issues raised by Rajesh Venkatasubramanian. Originally done as part of the anonymous objrmap work on mremap move, but useful fixes now extracted for mainline. The mremap changes need some exposure in the -mm tree first, but the first (fork one-liner) is safe enough to go straight into 2.6.5. From: Rajesh Venkatasubramanian. Despite the comment that child vma should be inserted just after parent vma, 2.5.6 did exactly the reverse: thus a racing vmtruncate may free the child's ptes, then advance to the parent, and meanwhile copy_page_range has propagated more ptes from the parent to the child, leaving file pages still mapped after truncation.
-
Andrew Morton authored
The compound page logic is a little fragile - it relies on additional metadata in the pageframes which some other kernel code likes to stomp on (xfs was doing this). Also, because we're treating all higher-order pages as compound pages it is no longer possible to free individual lower-order pages from the middle of higher-order pages. At least one ARM driver insists on doing this. We only really need the compound page logic for higher-order pages which can be mapped into user pagetables and placed under direct-io. This covers hugetlb pages and, conceivably, soundcard DMA buffers which were allcoated with a higher-order allocation but which weren't marked PageReserved. The patch arranges for the hugetlb implications to allocate their pages with compound page metadata, and all other higher-order allocations go back to the old way. (Andrea supplied the GFP_LEVEL_MASK fix)
-
Andrew Morton authored
Rework the code layout a bit. No logic change.
-
Andrew Morton authored
From: Jens Axboe <axboe@suse.de> Takashi did some nice latency testing of the current kernel (with -mm writeback changes), and the biggest offender in general core is mpage_writepages().
-
Andrew Morton authored
The radix-tree walk for writeback has a couple of problems: a) It always scans a file from its first dirty page, so if someone is repeatedly dirtying the front part of a file, pages near the end may be starved of writeout. (Well, not completely: the `kupdate' function will write an entire file once the file's dirty timestamp has expired). b) When the disk queues are huge (10000 requests), there can be a very large number of locked pages. Scanning past these in writeback consumes quite some CPU time. So in each address_space we record the index at which the last batch of writeout terminated and start the next batch of writeback from that point.
-
Andrew Morton authored
If pdflush hits a locked-and-clean buffer in __block_write_full_page() it will just pass over the buffer. Typically the buffer is an ext3 data=ordered buffer which is being written by kjournald, but a similar thing can happen with blockdev buffers and ll_rw_block(). This is bad because the buffer is still under I/O and a subsequent fsync's fdatawait() needs to know about it. It is not practical to tag the page for writeback - only the submitter of the I/O can do that, because the submitter has control of the end_io handler. So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on the underway I/O. There is a risk that pdflush::background_writeout() will lock up, repeatedly trying and failing to write the same page. This is prevented by ensuring that background_writeout() always throttles when it made no progress.
-
Andrew Morton authored
fdatasync can fail to wait on some pages due to a race. If some task (eg pdflush) is flushing the same mapping it can remove a page's dirty tag but not then mark that page as being under writeback, because pdflush hit a locked buffer in __block_write_full_page(). This will happen because kjournald is writing the buffer. In this situation __block_write_full_page() will redirty the page so that fsync notices it, but there is a window where the page eludes the radix tree dirty page walk. Consequently a concurrent fsync will fail to notice the page when walking the radix tree's dirty pages. The approach taken by this patch is to leave the page marked as dirty in the radix tree while ->writepage is working out what to do with it. This ensures that a concurrent write-for-sync will successfully locate the page and will then block in lock_page() until the non-write-for-sync code has finished altering the page state.
-
Andrew Morton authored
Remove the now-unneeded page.list field.
-
Andrew Morton authored
Switch the m68k pointer-table code over to page->lru.
-
Andrew Morton authored
Switch the ARM `small_page' code over to page->lru.
-
Andrew Morton authored
The compound page logic is using page->lru, and these get will scribbled on in various places so switch the Compound page logic over to using ->mapping and ->private.
-
Andrew Morton authored
The address_space.readapges() function currently takes a list of pages, strung together via page->list. Switch it to using page->lru. This changes the API into filesystems.
-
Andrew Morton authored
Switch it to ->lru
-
Andrew Morton authored
Switch them over to page.lru
-
Andrew Morton authored
Switch the page allocator over to using page.lru for the buddy lists.
-
Andrew Morton authored
slab.c is using page->list. Switch it over to using page->lru so we can remove page.list.
-
Andrew Morton authored
This code is playing with page->lru from pages which came from slab. But to remove page->list we need to convert slab over to using page->lru. So we cannot allow the i386 pagetable code to go scribbling on the ->lru field of active slab pages. This optimisation was pretty thin, and it is more important to shrink the pageframe (on all architectures).
-
Andrew Morton authored
Remove remaining references to address_space.clean_pages.
-
Andrew Morton authored
Instead, use a radix-tree walk of the pages which are tagged as being under writeback. The new function wait_on_page_writeback_range() was generalised out of filemap_fdatawait(). We can later use this to provide concurrent fsync of just a section of a file.
-
Andrew Morton authored
Now remove address_space.io_pages.
-