- 12 Mar, 2004 40 commits
-
-
Andrew Morton authored
As kswapd is now scanning zones in the highmem->normal->dma direction it can get into competition with the page allocator: kswapd keep on trying to free pages from highmem, then kswapd moves onto lowmem. By the time kswapd has done proportional scanning in lowmem, someone has come in and allocated a few pages from highmem. So kswapd goes back and frees some highmem, then some lowmem again. But nobody has allocated any lowmem yet. So we keep on and on scanning lowmem in response to highmem page allocations. With a simple `dd' on a 1G box we get: r b swpd free buff cache si so bi bo in cs us sy wa id 0 3 0 59340 4628 922348 0 0 4 28188 1072 808 0 10 46 44 0 3 0 29932 4660 951760 0 0 0 30752 1078 441 1 6 30 64 0 3 0 57568 4556 924052 0 0 0 30748 1075 478 0 8 43 49 0 3 0 29664 4584 952176 0 0 0 30752 1075 472 0 6 34 60 0 3 0 5304 4620 976280 0 0 4 40484 1073 456 1 7 52 41 0 3 0 104856 4508 877112 0 0 0 18452 1074 97 0 7 67 26 0 3 0 70768 4540 911488 0 0 0 35876 1078 746 0 7 34 59 1 2 0 42544 4568 939680 0 0 0 21524 1073 556 0 5 43 51 0 3 0 5520 4608 976428 0 0 4 37924 1076 836 0 7 41 51 0 2 0 4848 4632 976812 0 0 32 12308 1092 94 0 1 33 66 Simple fix: go back to scanning the zones in the dma->normal->highmem direction so we meet the page allocator in the middle somewhere. r b swpd free buff cache si so bi bo in cs us sy wa id 1 3 0 5152 3468 976548 0 0 4 37924 1071 650 0 8 64 28 1 2 0 4888 3496 976588 0 0 0 23576 1075 726 0 6 66 27 0 3 0 5336 3532 976348 0 0 0 31264 1072 708 0 8 60 32 0 3 0 6168 3560 975504 0 0 0 40992 1072 683 0 6 63 31 0 3 0 4560 3580 976844 0 0 0 18448 1073 233 0 4 59 37 0 3 0 5840 3624 975712 0 0 4 26660 1072 800 1 8 46 45 0 3 0 4816 3648 976640 0 0 0 40992 1073 526 0 6 47 47 0 3 0 5456 3672 976072 0 0 0 19984 1070 320 0 5 60 35
-
Andrew Morton authored
Currently kswapd walks across all zones in dma->normal->highmem order, performing proportional scanning until all zones are OK. This means that pressure against ZONE_NORMAL causes unnecessary reclaim of ZONE_HIGHMEM. To fix that up we change kswapd so that it walks the zones in the high->normal->dma direction, skipping zones which are OK. Once it encounters a zone which needs some reclaim kswapd will perform proportional scanning against that zone as well as all the succeeding lower zones. We scan the lower zones even if they have sufficient free pages. This is because a) the lower zone may be above pages_high, but because of the incremental min, the lower zone may still not be eligible for allocations. That's bad because cache in that lower zone will then not be scanned at the correct rate. b) pages in this lower zone are usable for allocations against the higher zone. So we do want to san all the relevant zones at an equal rate.
-
Andrew Morton authored
- If max_scan evaluates to zero due to a very small inactive list and high `priority' numbers, we don't want to thrlttle yet. - In balance_pgdat(), we may end up not scanning any pages because all zones happened to be above pages_high. Avoid throttling in this case too.
-
Andrew Morton authored
When page reclaim is working out how many pages to san in a zone (max-scan) it presently rounds that number up if it looks too small - for work batching. Problem is, this can result in excessive scanning against small zones which have few inactive pages. So remove it. Not that it is possible for max_scan to be zero. That's OK - it'll become non-zero as the priority increases.
-
Andrew Morton authored
Page reclaim is currently a bit schitzo: sometimes we say "go and scan this many pages and tell me how many pages were freed" and at other times we say "go and scan this many pages, but stop if you freed this many". It makes the logic harder to control and to understand. This patch coverts everything into the "go and scan this many pages and tell me how many pages were freed" model. It doesn't seem to affect performance much either way.
-
Andrew Morton authored
We currently have a problem with the balancing of reclaim between zones: much more reclaim happens against highmem than against lowmem. This patch partially fixes this by changing the direct reclaim path so it does not bale out of the zone walk after having reclaimed sufficient pages from highmem: go on to reclaim from lowmem regardless of how many pages we reclaimed from lowmem.
-
Andrew Morton authored
The patch which went in six months or so back which said "only reclaim slab if we're scanning lowmem pagecache" was wrong. I must have been asleep at the time. We do need to scan slab in response to highmem page reclaim as well. Because all the math is based around the total amount of memory in the machine, and we know that if we're performing highmem page reclaim then the lower zones have no free memory.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The logic which calculates the numberof pages which were scanned is mucked up. Fix.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> In try_to_free_pages(), put even pressure on the slab even if we have reclaimed enough pages from the LRU.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> In shrink_slab(), do the multiply before the divide to avoid losing precision.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> If refill_inactive_zone() is running in its dont-reclaim-mapped-memory mode we are tossing away the referenced infomation on active mapped pages. So put that info back if we're not going to deactivate the page.
-
Andrew Morton authored
The logic in balance_pgdat() is all bollixed up. - the incoming arg `nr_pages' should be used to determine if we're being asked to free a specific number of pages, not `to_free'. - local variable `to_free' is not appropriate for the determination of whether we failed to bring all zones to appropriate free pages levels. Fix this by correctly calculating `all_zones_ok' and then use all_zones_ok to determine whether we need to throttle kswapd. So the logic now is: for (increasing priority) { all_zones_ok = 1; for (all zones) { to_reclaim = number of pages to try to reclaim from this zone; max_scan = number of pages to scan in this pass (gets larger as `priority' decreases) /* * set `reclaimed' to the number of pages which were * actually freed up */ reclaimed = scan(max_scan pages); reclaimed += shrink_slab(); to_free -= reclaimed; /* for the `nr_pages>0' case */ /* * If this scan failed to reclaim `to_reclaim' or more * pages, we're getting into trouble. Need to scan * some more, and throttle kswapd. Note that this * zone may now have sufficient free pages due to * freeing activity by some other process. That's * OK - we'll pick that info up on the next pass * through the loop. */ if (reclaimed < to_reclaim) all_zones_ok = 0; } if (to_free > 0) continue; /* swsusp: need to do more work */ if (all_zones_ok) break; /* kswapd is done */ /* * OK, kswapd is getting into trouble. Take a nap, then take * another pass across the zones. */ blk_congestion_wait(); }
-
Andrew Morton authored
From: Nikita Danilov <Nikita@Namesys.COM> Now that decision to reclaim mapped memory is taken on the basis of zone->prev_priority, priority argument is no longer needed.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The addition of the smp_mb and the other change is to try to close the window for races a bit. Obviously they can still happen, it's a racy interface and it doesn't matter much.
-
Andrew Morton authored
Teach blk_congestion_wait() to return the number of jiffies remaining. This is for debug, but it is also nicely consistent.
-
Andrew Morton authored
To check on zone balancing, split the /proc/vmstat:pgsteal, pgreclaim pgalloc and pgscan stats into per-zone counters. Additionally, split the pgscan stats into pgscan_direct and pgscan_kswapd to see who's doing how much scanning. And add a metric for the number of slab objects which were scanned.
-
Andrew Morton authored
From: Paul Fulghum <paulkf@microgate.com> * track driver API changes * remove cast (kernel janitor)
-
Andrew Morton authored
From: Paul Fulghum <paulkf@microgate.com> * Track driver API changes * Remove cast (kernel janitor)
-
Andrew Morton authored
From: Paul Fulghum <paulkf@microgate.com> Patch for synclinkmp.c * Track driver API changes * Remove cast (kernel janitor) * Replace page_free call with kfree (to match kmalloc allocation)
-
Andrew Morton authored
From: Chris Mason <mason@suse.com> mempool_alloc() and mempool_free() check pool->curr_nr without any locks held. This can lead to skipping a wakeup when there are people waiting, and sleeping when there are free elements in the pool. I can't trigger this reliably, but sooner or later someone on ppc is probably going to hit it.
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> M68k interrupt management: rename routines to not confuse them with syscalls - sys_{request,free}_irq() -> cpu_{request,free}_irq() - q40_sys_default_handler[] -> q40_default_handler - sys_default_handler() -> default_handler()
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Mac IDE: Make sure the core IDE driver doesn't try to request the MMIO ports a second time, since this will fail.
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Apollo fb: Add sysfs support (from James Simmons)
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Amiga Framemaster II fb: Add sysfs support (from James Simmons)
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Add missing implementation for non-atomic __test_and_set_bit()
-
Andrew Morton authored
From: James Simmons <jsimmons@infradead.org>, Kronos <kronos@kronoz.cjb.net> Various fixes and enhancements to the monitor hardware detection code. The only driver that uses it is the radeon driver. Old EDID parsing code was very verbose, half of the patch address this (ie. print lots of stuff iff DEBUG). The other big change is the FB_MODE_IS_* stuff: we really need a way to know the origin of a video mode. In this way we can select video mode that comes from EDID instead of VESA or GTF. Drivers other than radeonfb won't be affected because they cannot (yet) get EDID from the monitor and don't use EDID related code.
-
Andrew Morton authored
From: Michel Marti <michel.marti@objectxp.com> The blkmtd driver oopses in add_device(). The following trivial patch fixes this.
-
Andrew Morton authored
From: Arjan van de Ven <arjanv@redhat.com> Readahead of raid0 was suboptimal; it read only 1 stride ahead. The problem with this is that while it will keep all spindles busy, it will not actually manage to make larger IO's, eg each disk would just do the chunk size IO. Doing at least 2 chunks is more than appropriate so that each spindle will get a chance to merge IO's. (Neil fixed raid6 and raid6 too)
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> Someone added __attribute_used__ throughout module.h, but didn't remove the ", unused". Looks like some arch/gcc combos still consider it unused, and discard the fn.
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Make CONFIG_NVRAM depend on the prerequisites that are explicitly checked for in drivers/char/nvram.c, or on CONFIG_GENERIC_NVRAM (for PPC).
-
Andrew Morton authored
From: Geert Uytterhoeven <geert@linux-m68k.org> Add missing include (needed for struct inode)
-
Andrew Morton authored
From: Marc-Christian Petersen <m.c.p@wolk-project.de> The attached patch is needed to stop showing us "Macintosh device drivers" for all architectures via menuconfig || xconfig || gconfig. It's only necessary for PPC and/or MAC. ACKed by benh.
-
Andrew Morton authored
From: Arjan van de Ven <arjanv@redhat.com> Several of the pte_chain_alloc() allocators that use GFP_ATOMIC have a fallback for failure that sleeps; they thus need to not warn on failure.. Seen during a big fork on a busy system.
-
Andrew Morton authored
From: "Randy.Dunlap" <rddunlap@osdl.org> cciss_scsi_detect() can be called after init (for TAPE support).
-
Andrew Morton authored
From: Matt Domsch <Matt_Domsch@dell.com> Patch below from Patrick J. LoPresti and myself. Patrick describes: Why this patch? The problem is that the legacy BIOS interface (INT13/AH=3D08) for querying the disk geometry returns different values than the extended INT13 interface which the EDD code currently uses. This is because the legacy interface only provides a 10-bit cylinder field, so modern BIOSes "lie" about the head/sector counts in order to make more of the disk visible within the first 1024 cylinders. Many non-Linux applications, including the stock Windows boot loader, DOS fdisk, etc., rely upon the legacy interface and geometry. So it is useful to be able to obtain the legacy values from a running Linux kernel. What this patch does is to add new entries under /sys/firmware/edd/int13_devXX named "legacy_cylinders", "legacy_heads", and "legacy_sectors". These provide the geometry given by the legacy INT13/AH=3D08 BIOS interface, just like the current "default_cylinders" etc. provide the the geometry given by the INT13/AH=3D48 interface. Without this patch, I cannot use Linux to partition a drive and install Windows, which happens to be my application. - Pat http://unattended.sourceforge.net/ In addition, this adds two buggy BIOS workarounds in the EDD int13 calls as suggested by Ralf Brown's interrupt list. I'm also interested in moving this code out of arch/i386/kernel/edd.c and include/asm-i386/edd.h, as I believe it is applicable on x86-64 as well. However, there's no good place under drivers/ to put edd.c when it's not tied to a bus, but to several CPU architectures and their firmwares... Maybe a new directory drivers/firmware?
-
Andrew Morton authored
sound/oss/wavfront.c: In function `wavefront_download_firmware': sound/oss/wavfront.c:2524: warning: implicit declaration of function `sys_open' sound/oss/wavfront.c:2533: warning: implicit declaration of function `sys_read' sound/oss/wavfront.c:2582: warning: implicit declaration of function `sys_close
-
Andrew Morton authored
From: Chris Mason <mason@suse.com> This patch fixes a problem we're hitting on ia64 with page sizes > 4k. When the page size is greater than the block size, and parts of the page fall past the end of the device, readpage will fail because blkdev_get_block returns -EIO for blocks past i_size. The attached patch changes blkdev_get_block to return holes when reading past the end of the device, which allows us to read that last valid 4k block and then fill the rest of the page with zeros. Writes will still fail with -EIO.
-
Andrew Morton authored
From: vda <vda@port.imtp.ilyichevsk.odessa.ua> Add a missing test for the "root=/dev/ram" kernel boot option. It's just an alias for /dev/ram0, but it worked in 2.4...
-
Andrew Morton authored
From: Srivatsa Vaddagiri <vatsa@in.ibm.com> current_is_keventd() doesn't need to search across all the CPUs to identify itself.
-
Andrew Morton authored
From: Andi Kleen <ak@muc.de> I was debugging some code that corrupted the vma rb lists and for that I fixed validate_mm to not be recursive and do some more checks. It's slower now, but that shouldn't be a problem. Also make it non static to allow easier checks elsewhere.
-