Linux 2.1.127pre2

I just found a case that could certainly result in endless page faults, and an endless stream of __get_free_page() calls. It's been there forever, and I bascially thought it could never happen, but thinking about it some more it can happen a lot more easily than I thought. The problem is that the page fault handling code will give up if it cannot allocate a page table entry. We have code in place to handle the final page allocation failure, but the "mid-way" failures just failed, and caused the page fault to be done over and over again. More importantly, this could happen from kernel mode when a system call was trying to fill in a user page, in which case it wouldn't even be interruptible. It's really unlikely to happen (because the page tables tend to be set up already), but I suspect it can be triggered by execve'ing a new process which is not going to have any existing page tables. Even then we're likely to have old pages available (the ones we free'd from the previous process), but at least it doesn't sound impossible that this could be a problem. I've not seen this behaviour myself, but it could have caused Andrea's problems, especially the harder to find ones. Andrea, can you check this patch (against clean 2.1.126) out and see if it makes any difference to your testing? (Right now it does the wrong error code: it will cause a SIGSEGV instead of a SIGBUS when we run out of memory, but that's a small detail). Essentially, instead of trying to call "oom()" and sending a signal (which doesn't work for kernel level accesses anyway), the code returns the proper return value from handle_mm_fault(), which allows the caller to do the right thing (which can include following the exception tables). That way we can handle the case of running out of memory from a kernel mode access too.. (This is also why the fault gets the wrong signal - I didn't bother to fix up the x86 fault handler all that much ;) Btw, the reason I'm sending out these patches in emails instead of just putting them on ftp.kernel.org is that the machine has had disk problems for the last week, and finally gave up completely last Friday or so. So ftp.kernel.org is down until we have a new raid array or the old one magically recovers. Sorry about the spamming. Linus

Linux 2.1.127pre2
I just found a case that could certainly result in endless page faults, and an endless stream of __get_free_page() calls. It's been there forever, and I bascially thought it could never happen, but thinking about it some more it can happen a lot more easily than I thought. The problem is that the page fault handling code will give up if it cannot allocate a page table entry. We have code in place to handle the final page allocation failure, but the "mid-way" failures just failed, and caused the page fault to be done over and over again. More importantly, this could happen from kernel mode when a system call was trying to fill in a user page, in which case it wouldn't even be interruptible. It's really unlikely to happen (because the page tables tend to be set up already), but I suspect it can be triggered by execve'ing a new process which is not going to have any existing page tables. Even then we're likely to have old pages available (the ones we free'd from the previous process), but at least it doesn't sound impossible that this could be a problem. I've not seen this behaviour myself, but it could have caused Andrea's problems, especially the harder to find ones. Andrea, can you check this patch (against clean 2.1.126) out and see if it makes any difference to your testing? (Right now it does the wrong error code: it will cause a SIGSEGV instead of a SIGBUS when we run out of memory, but that's a small detail). Essentially, instead of trying to call "oom()" and sending a signal (which doesn't work for kernel level accesses anyway), the code returns the proper return value from handle_mm_fault(), which allows the caller to do the right thing (which can include following the exception tables). That way we can handle the case of running out of memory from a kernel mode access too.. (This is also why the fault gets the wrong signal - I didn't bother to fix up the x86 fault handler all that much ;) Btw, the reason I'm sending out these patches in emails instead of just putting them on ftp.kernel.org is that the machine has had disk problems for the last week, and finally gave up completely last Friday or so. So ftp.kernel.org is down until we have a new raid array or the old one magically recovers. Sorry about the spamming. Linus
a93be803 · Linus Torvalds · d7cc008e · a93be803 · a93be803 · a93be803
Commit a93be803 authored Nov 23, 2007 by Linus Torvalds
29 changed files
--- a/arch/i386/boot/tools/build.c
+++ b/arch/i386/boot/tools/build.c
@@ -151,7 +151,7 @@ int main(int argc, char ** argv)
 	if (setup_sectors < SETUP_SECTS)
 		setup_sectors = SETUP_SECTS;
 	fprintf(stderr, "Setup is %d bytes.\n", i);
-	memset(buf, sizeof(buf), 0);
+	memset(buf, 0, sizeof(buf));
 	while (i < setup_sectors * 512) {
 		c = setup_sectors * 512 - i;
 		if (c > sizeof(buf))

--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -156,7 +156,14 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
 				goto bad_area;
 	}
-	handle_mm_fault(tsk, vma, address, write);
+
+	/*
+	 * If for any reason at all we couldn't handle the fault,
+	 * make sure we exit gracefully rather than endlessly redo
+	 * the fault.
+	 */
+	if (!handle_mm_fault(tsk, vma, address, write))
+		goto bad_area;

 	/*
 	 * Did it hit the DOS screen memory VA from vm86 mode?

--- a/arch/mips/sgi/kernel/indy_sc.c
+++ b/arch/mips/sgi/kernel/indy_sc.c
@@ -5,7 +5,6 @@
 * Copyright (C) 1997 Ralf Baechle (ralf@gnu.org),
 * derived from r4xx0.c by David S. Miller (dm@engr.sgi.com).
 */
-#include <linux/config.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>

--- a/arch/mips/sgi/kernel/setup.c
+++ b/arch/mips/sgi/kernel/setup.c
@@ -5,6 +5,7 @@
 * Copyright (C) 1996 David S. Miller (dm@engr.sgi.com)
 * Copyright (C) 1997, 1998 Ralf Baechle (ralf@gnu.org)
 */
+#include <linux/config.h>
 #include <linux/init.h>
 #include <linux/kbd_ll.h>
 #include <linux/kernel.h>

--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -27,7 +27,6 @@

 #include <linux/module.h>

-#include <linux/config.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>

--- a/drivers/misc/parport_pc.c
+++ b/drivers/misc/parport_pc.c
@@ -7,6 +7,8 @@
 *          Andrea Arcangeli
 *
 * based on work by Grant Guenther <grant@torque.net> and Phil Blundell.
+ *
+ * Cleaned up include files - Russell King <linux@arm.uk.linux.org>
 */

 /* This driver should work with any hardware that is broadly compatible
@@ -557,12 +559,12 @@ static int programmable_irq_support(struct parport *pb)
 static int irq_probe_ECP(struct parport *pb)
 {
 	int irqs, i;
-		
+
 	sti();
 	irqs = probe_irq_on();
 		
-	parport_pc_write_econtrol(pb, 0x00);    /* Reset FIFO */
-	parport_pc_write_econtrol(pb, 0xd0);    /* TEST FIFO + nErrIntrEn */
+	parport_pc_write_econtrol(pb, 0x00);	/* Reset FIFO */
+	parport_pc_write_econtrol(pb, 0xd0);	/* TEST FIFO + nErrIntrEn */

 	/* If Full FIFO sure that WriteIntrThresold is generated */
 	for (i=0; i < 1024 && !(parport_pc_read_econtrol(pb) & 0x02) ; i++) 

--- a/drivers/misc/parport_procfs.c
+++ b/drivers/misc/parport_procfs.c
@@ -8,16 +8,11 @@
 *
 * based on work by Grant Guenther <grant@torque.net>
 *              and Philip Blundell
+ *
+ * Cleaned up include files - Russell King <linux@arm.uk.linux.org>
 */

-#include <linux/stddef.h>
-#include <linux/tasks.h>
-#include <linux/ctype.h>
-#include <asm/ptrace.h>
-#include <asm/io.h>
-#include <asm/dma.h>
-#include <asm/irq.h>
-
+#include <linux/sched.h>
 #include <linux/delay.h>
 #include <linux/errno.h>
 #include <linux/interrupt.h>
@@ -26,6 +21,11 @@
 #include <linux/malloc.h>
 #include <linux/proc_fs.h>
 #include <linux/parport.h>
+#include <linux/ctype.h>
+
+#include <asm/io.h>
+#include <asm/dma.h>
+#include <asm/irq.h>

 struct proc_dir_entry *base = NULL;


--- a/drivers/misc/parport_share.c
+++ b/drivers/misc/parport_share.c
@@ -105,7 +105,7 @@ struct parport *parport_register_port(unsigned long base, int irq, int dma,
 	tmp->ops = ops;
 	tmp->number = portnum;
 	memset (&tmp->probe_info, 0, sizeof (struct parport_device_info));
-	spin_lock_init(&tmp->cad_lock);
+	tmp->cad_lock = RW_LOCK_UNLOCKED;
 	spin_lock_init(&tmp->waitlist_lock);
 	spin_lock_init(&tmp->pardevice_lock);


--- a/drivers/net/3c509.c
+++ b/drivers/net/3c509.c
@@ -34,6 +34,7 @@
 		v1.10 4/21/97 Fixed module code so that multiple cards may be detected,
 				other cleanups.  -djb
 		Andrea Arcangeli:	Upgraded to Donald Becker's version 1.12.
+		Rick Payne:	Fixed SMP race condition
 */

 static char *version = "3c509.c:1.12 6/4/97 becker@cesdis.gsfc.nasa.gov\n";
@@ -59,6 +60,7 @@ static char *version = "3c509.c:1.12 6/4/97 becker@cesdis.gsfc.nasa.gov\n";
 #include <linux/skbuff.h>
 #include <linux/delay.h>	/* for udelay() */

+#include <asm/spinlock.h>
 #include <asm/bitops.h>
 #include <asm/io.h>

@@ -122,6 +124,7 @@ enum RxFilter {
 struct el3_private {
 	struct enet_statistics stats;
 	struct device *next_dev;
+	spinlock_t lock;
 	/* skb send-queue */
 	int head, size;
 	struct sk_buff *queue[SKB_QUEUE_SIZE];
@@ -401,6 +404,9 @@ el3_open(struct device *dev)
 	outw(RxReset, ioaddr + EL3_CMD);
 	outw(SetStatusEnb | 0x00, ioaddr + EL3_CMD);

+	/* Set the spinlock before grabbing IRQ! */
+	((struct el3_private *)dev->priv)->lock = (spinlock_t) SPIN_LOCK_UNLOCKED;
+
 	if (request_irq(dev->irq, &el3_interrupt, 0, "3c509", dev)) {
 		return -EAGAIN;
 	}
@@ -520,6 +526,11 @@ el3_start_xmit(struct sk_buff *skb, struct device *dev)
 	if (test_and_set_bit(0, (void*)&dev->tbusy) != 0)
 		printk("%s: Transmitter access conflict.\n", dev->name);
 	else {
+		unsigned long flags;
+
+	    	/* Spin on the lock, until we're clear of an IRQ */
+	    	spin_lock_irqsave(&lp->lock, flags);
+	    
 		/* Put out the doubleword header... */
 		outw(skb->len, ioaddr + TX_FIFO);
 		outw(0x00, ioaddr + TX_FIFO);
@@ -536,6 +547,8 @@ el3_start_xmit(struct sk_buff *skb, struct device *dev)
 		} else
 			/* Interrupt us when the FIFO has room for max-sized packet. */
 			outw(SetTxThreshold + 1536, ioaddr + EL3_CMD);
+
+		spin_unlock_irqrestore(&lp->lock, flags);
 	}

 	dev_kfree_skb (skb);
@@ -560,6 +573,7 @@ static void
 el3_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 {
 	struct device *dev = (struct device *)dev_id;
+	struct el3_private *lp;
 	int ioaddr, status;
 	int i = INTR_WORK;

@@ -568,6 +582,9 @@ el3_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 		return;
 	}

+	lp = (struct el3_private *)dev->priv;
+	spin_lock(&lp->lock);
+
 	if (dev->interrupt)
 		printk("%s: Re-entering the interrupt handler.\n", dev->name);
 	dev->interrupt = 1;
@@ -629,7 +646,7 @@ el3_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 		printk("%s: exiting interrupt, status %4.4x.\n", dev->name,
 			   inw(ioaddr + EL3_STATUS));
 	}
-
+	spin_unlock(&lp->lock);
 	dev->interrupt = 0;
 	return;
 }

--- a/drivers/net/ne2.c
+++ b/drivers/net/ne2.c
@@ -55,7 +55,6 @@ static const char *version =
 "ne2.c:v0.90 Oct 14 1998 David Weinehall <tao@acc.umu.se>\n";

 #include <linux/module.h>
-#include <linux/config.h>
 #include <linux/version.h>

 #include <linux/kernel.h>

--- a/drivers/scsi/ChangeLog.ncr53c8xx
+++ b/drivers/scsi/ChangeLog.ncr53c8xx
+Wed Oct 21 21:00 1998 Gerard Roudier (groudier@club-internet.fr)
+	* revision 3.1a
+	- Changes from Eddie Dost for Sparc and Alpha:
+	  ioremap/iounmap support for Sparc.
+	  pcivtophys changed to bus_dvma_to_phys.
+	- Add the 53c876 description to the chip table. This is only usefull 
+	  for printing the right name of the controller.
+	- DEL-441 Item 2 work-around for the 53c876 rev <= 5 (0x15).
+	- Add additionnal checking of INQUIRY data:
+	  Check INQUIRY data received length is at least 7. Byte 7 of 
+	  inquiry data contains device features bits and the driver might 
+	  be confused by garbage. Also check peripheral qualifier.
+	- Cleanup of the SCSI tasks management:
+	  Remove the special case for 32 tags. Now the driver only uses the 
+	  scheme that allows up to 64 tags per LUN.
+	  Merge some code from the 896 driver.
+	  Use a 1,3,5,...MAXTAGS*2+1 tag numbering. Previous driver could  
+	  use any tag number from 1 to 253 and some non conformant devices  
+	  might have problems with large tag numbers.
+	- 'no_sync' changed to 'no_disc' in the README file. This is an old 
+	  and trivial mistake that seems to demonstrate the README file is 
+	  not often read. :)
+
 Sun Oct  4 14:00 1998 Gerard Roudier (groudier@club-internet.fr)
 	* revision 3.0i
 	- Cosmetic changes for sparc (but not for the driver) that needs  

--- a/drivers/scsi/README.ncr53c8xx
+++ b/drivers/scsi/README.ncr53c8xx
@@ -4,7 +4,7 @@ Written by Gerard Roudier <groudier@club-internet.fr>
 21 Rue Carnot
 95170 DEUIL LA BARRE - FRANCE

-27 June 1998
+18 October 1998
 ===============================================================================

 1.  Introduction
@@ -21,7 +21,7 @@ Written by Gerard Roudier <groudier@club-internet.fr>
      8.4  Set order type for tagged command
      8.5  Set debug mode
      8.6  Clear profile counters
-      8.7  Set flag (no_sync)
+      8.7  Set flag (no_disc)
      8.8  Set verbose level
 9.  Configuration parameters
 10. Boot setup commands
@@ -424,7 +424,7 @@ Available commands:
    The "clearprof" command allows you to clear these counters at any time.


-8.7 Set flag (no_sync)
+8.7 Set flag (no_disc)
 
    setflag <target> <flag>

@@ -432,11 +432,11 @@ Available commands:

    For the moment, only one flag is available:

-        no_sync:   not allow target to disconnect.
+        no_disc:   not allow target to disconnect.

    Do not specify any flag in order to reset the flag. For example:
    - setflag 4
-      will reset no_sync flag for target 4, so will allow it disconnections.
+      will reset no_disc flag for target 4, so will allow it disconnections.
    - setflag all
      will allow disconnection for all devices on the SCSI bus.

@@ -1067,7 +1067,7 @@ Try to enable one feature at a time with control commands.  For example:
  Will enable fast synchronous data transfer negotiation for all targets.

 - echo "setflag 3" >/proc/scsi/ncr53c8xx/0
-  Will reset flags (no_sync) for target 3, and so will allow it to disconnect 
+  Will reset flags (no_disc) for target 3, and so will allow it to disconnect 
  the SCSI Bus.

 - echo "settags 3 8" >/proc/scsi/ncr53c8xx/0

--- a/drivers/scsi/ncr53c8xx.c
+++ b/drivers/scsi/ncr53c8xx.c
--- a/drivers/scsi/ncr53c8xx.h
+++ b/drivers/scsi/ncr53c8xx.h
@@ -45,7 +45,7 @@
 /*
 **	Name and revision of the driver
 */
-#define SCSI_NCR_DRIVER_NAME		"ncr53c8xx - revision 3.0i"
+#define SCSI_NCR_DRIVER_NAME		"ncr53c8xx - revision 3.1a"

 /*
 **	Check supported Linux versions
@@ -468,7 +468,10 @@ typedef struct {
 {PCI_DEVICE_ID_NCR_53C875, 0x01, "875",  6, 16, 5,			\
 FE_WIDE|FE_ULTRA|FE_CLK80|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
 ,									\
- {PCI_DEVICE_ID_NCR_53C875, 0xff, "875",  6, 16, 5,			\
+ {PCI_DEVICE_ID_NCR_53C875, 0x0f, "875",  6, 16, 5,			\
+ FE_WIDE|FE_ULTRA|FE_DBLR|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
+ ,									\
+ {PCI_DEVICE_ID_NCR_53C875, 0xff, "876",  6, 16, 5,			\
 FE_WIDE|FE_ULTRA|FE_DBLR|FE_CACHE0_SET|FE_BOF|FE_DFS|FE_LDSTR|FE_PFEN|FE_RAM}\
 ,									\
 {PCI_DEVICE_ID_NCR_53C875J,0xff, "875J", 6, 16, 5,			\

--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -65,7 +65,8 @@
 #define SD_MINOR_NUMBER(i)	((i) & 255)
 #define MKDEV_SD_PARTITION(i)	MKDEV(SD_MAJOR_NUMBER(i), (i) & 255)
 #define MKDEV_SD(index)		MKDEV_SD_PARTITION((index) << 4)
-#define N_USED_SD_MAJORS	((sd_template.dev_max + SCSI_DISKS_PER_MAJOR - 1) / SCSI_DISKS_PER_MAJOR)
+#define N_USED_SCSI_DISKS  (sd_template.dev_max + SCSI_DISKS_PER_MAJOR - 1)
+#define N_USED_SD_MAJORS   (N_USED_SCSI_DISKS / SCSI_DISKS_PER_MAJOR)

 #define MAX_RETRIES 5

@@ -1765,7 +1766,7 @@ void cleanup_module( void)
    scsi_unregister_module(MODULE_SCSI_DEV, &sd_template);

    for (i=0; i <= sd_template.dev_max / SCSI_DISKS_PER_MAJOR; i++) 
-    unregister_blkdev(SD_MAJOR(i),"sd");
+	unregister_blkdev(SD_MAJOR(i),"sd");
    
    sd_registered--;
    if( rscsi_disks != NULL )
@@ -1783,13 +1784,13 @@ void cleanup_module( void)

 	for (sdgd = gendisk_head; sdgd; sdgd = sdgd->next)
 	{
-	    if (sdgd->next >= sd_gendisks && sdgd->next <= LAST_SD_GENDISK)
+	    if (sdgd->next >= sd_gendisks && sdgd->next <= LAST_SD_GENDISK.max_nr)
 	    	    removed++, sdgd->next = sdgd->next->next;
 	    else sdgd = sdgd->next;
 	}
-	if (removed != N_USED_SCSI_DISKS)
+	if (removed != N_USED_SD_MAJORS)
 	    printk("%s %d sd_gendisks in disk chain",
-	    	removed > N_USED_SCSI_DISKS ? "total" : "just", removed);
+		removed > N_USED_SD_MAJORS ? "total" : "just", removed);
 	
    }


--- a/drivers/sound/es1370.c
+++ b/drivers/sound/es1370.c
@@ -101,6 +101,7 @@

 /*****************************************************************************/
      
+#include <linux/config.h>
 #include <linux/version.h>
 #include <linux/module.h>
 #include <linux/string.h>

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -626,7 +626,7 @@ struct dentry * open_namei(const char * pathname, int flag, int mode)
 	if (!inode)
 		goto exit;

-	error = -EACCES;
+	error = -ELOOP;
 	if (S_ISLNK(inode->i_mode))
 		goto exit;
 	

--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -14,6 +14,7 @@
 * Copyright (C) 1995, 1996, 1997 Olaf Kirch <okir@monad.swb.de>
 */

+#include <linux/config.h>
 #include <linux/version.h>
 #include <linux/sched.h>
 #include <linux/errno.h>

--- a/fs/nls/nls_base.c
+++ b/fs/nls/nls_base.c
@@ -49,7 +49,6 @@ utf8_mbtowc(__u16 *p, const __u8 *s, int n)
 	int c0, c, nc;
 	struct utf8_table *t;
  
-	printk("utf8_mbtowc\n");
 	nc = 0;
 	c0 = *s;
 	l = c0;
@@ -80,11 +79,9 @@ utf8_mbstowcs(__u16 *pwcs, const __u8 *s, int n)
 	const __u8 *ip;
 	int size;

-	printk("\nutf8_mbstowcs: n=%d\n", n);
 	op = pwcs;
 	ip = s;
 	while (*ip && n > 0) {
-		printk(" %02x", *ip);
 		if (*ip & 0x80) {
 			size = utf8_mbtowc(op, ip, n);
 			if (size == -1) {

--- a/fs/select.c
+++ b/fs/select.c
@@ -130,22 +130,20 @@ int do_select(int n, fd_set_buffer *fds, unsigned long timeout)
 	int retval;
 	int i;

-	lock_kernel();
-
 	wait = NULL;
 	current->timeout = timeout;
 	if (timeout) {
-		struct poll_table_entry *entry = (struct poll_table_entry *)
-			__get_free_page(GFP_KERNEL);
-		if (!entry) {
-			retval = -ENOMEM;
-			goto out_nowait;
-		}
+		struct poll_table_entry *entry = (struct poll_table_entry *) __get_free_page(GFP_KERNEL);
+		if (!entry)
+			return -ENOMEM;
+
 		wait_table.nr = 0;
 		wait_table.entry = entry;
 		wait = &wait_table;
 	}

+	lock_kernel();
+
 	retval = max_select_fd(n, fds);
 	if (retval < 0)
 		goto out;

--- a/include/asm-mips/floppy.h
+++ b/include/asm-mips/floppy.h
@@ -11,7 +11,6 @@
 #ifndef __ASM_MIPS_FLOPPY_H
 #define __ASM_MIPS_FLOPPY_H

-#include <linux/config.h>
 #include <asm/bootinfo.h>
 #include <asm/jazz.h>
 #include <asm/jazzdma.h>

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -272,7 +272,7 @@ extern int remap_page_range(unsigned long from, unsigned long to, unsigned long
 extern int zeromap_page_range(unsigned long from, unsigned long size, pgprot_t prot);

 extern void vmtruncate(struct inode * inode, unsigned long offset);
-extern void handle_mm_fault(struct task_struct *tsk,struct vm_area_struct *vma, unsigned long address, int write_access);
+extern int handle_mm_fault(struct task_struct *tsk,struct vm_area_struct *vma, unsigned long address, int write_access);
 extern void make_pages_present(unsigned long addr, unsigned long end);

 extern int pgt_cache_water[2];
@@ -329,18 +329,11 @@ extern void put_cached_page(unsigned long);
 */
 extern int free_memory_available(void);
 extern struct task_struct * kswapd_task;
-
-extern inline void kswapd_notify(unsigned int gfp_mask)
-{
-	if (kswapd_task) {
-		wake_up_process(kswapd_task);
-		if (gfp_mask & __GFP_WAIT) {
-			current->policy |= SCHED_YIELD;
-			schedule();
-		}
-	}
-}
-
+#define wakeup_kswapd() do { \
+	if (kswapd_task->state & TASK_INTERRUPTIBLE) \
+		wake_up_process(kswapd_task); \
+} while (0)
+			
 /* vma is the first one with  address < vma->vm_end,
 * and even  address < vma->vm_start. Have to extend vma. */
 static inline int expand_stack(struct vm_area_struct * vma, unsigned long address)

--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -83,7 +83,6 @@ static inline void remove_page_from_hash_queue(struct page * page)
 static inline void __add_page_to_hash_queue(struct page * page, struct page **p)
 {
 	page_cache_size++;
-	set_bit(PG_referenced, &page->flags);
 	page->age = PAGE_AGE_VALUE;
 	if((page->next_hash = *p) != NULL)
 		(*p)->pprev_hash = &page->next_hash;

--- a/include/linux/parport.h
+++ b/include/linux/parport.h
@@ -208,7 +208,7 @@ struct parport {
 	int number;		/* port index - the `n' in `parportn' */
 	spinlock_t pardevice_lock;
 	spinlock_t waitlist_lock;
-	spinlock_t cad_lock;
+	rwlock_t cad_lock;
 };

 /* parport_register_port registers a new parallel port at the given address (if

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -293,7 +293,7 @@ static inline void add_to_page_cache(struct page * page,
 	struct page **hash)
 {
 	atomic_inc(&page->count);
-	page->flags &= ~((1 << PG_uptodate) | (1 << PG_error));
+	page->flags = (page->flags & ~((1 << PG_uptodate) | (1 << PG_error))) | (1 << PG_referenced);
 	page->offset = offset;
 	add_page_to_inode_queue(inode, page);
 	__add_page_to_hash_queue(page, hash);
@@ -328,7 +328,6 @@ static unsigned long try_to_read_ahead(struct file * file,
 			 */
 			page = mem_map + MAP_NR(page_cache);
 			add_to_page_cache(page, inode, offset, hash);
-			set_bit(PG_referenced, &page->flags);
 			inode->i_op->readpage(file, page);
 			page_cache = 0;
 		}

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -629,7 +629,7 @@ unsigned long put_dirty_page(struct task_struct * tsk, unsigned long page, unsig
 * change only once the write actually happens. This avoids a few races,
 * and potentially makes it more efficient.
 */
-static void do_wp_page(struct task_struct * tsk, struct vm_area_struct * vma,
+static int do_wp_page(struct task_struct * tsk, struct vm_area_struct * vma,
 	unsigned long address, pte_t *page_table)
 {
 	pte_t pte;
@@ -665,30 +665,31 @@ static void do_wp_page(struct task_struct * tsk, struct vm_area_struct * vma,
 			set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot))));
 			free_page(old_page);
 			flush_tlb_page(vma, address);
-			return;
+			return 1;
 		}
 		flush_cache_page(vma, address);
 		set_pte(page_table, BAD_PAGE);
 		flush_tlb_page(vma, address);
 		free_page(old_page);
 		oom(tsk);
-		return;
+		return 0;
 	}
 	if (PageSwapCache(page_map))
 		delete_from_swap_cache(page_map);
 	flush_cache_page(vma, address);
 	set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));
 	flush_tlb_page(vma, address);
+end_wp_page:
 	if (new_page)
 		free_page(new_page);
-	return;
+	return 1;
+
 bad_wp_page:
 	printk("do_wp_page: bogus page at address %08lx (%08lx)\n",address,old_page);
 	send_sig(SIGKILL, tsk, 1);
-end_wp_page:
 	if (new_page)
 		free_page(new_page);
-	return;
+	return 0;
 }

 /*
@@ -777,30 +778,50 @@ void vmtruncate(struct inode * inode, unsigned long offset)
 }


-static inline void do_swap_page(struct task_struct * tsk, 
+static int do_swap_page(struct task_struct * tsk, 
 	struct vm_area_struct * vma, unsigned long address,
 	pte_t * page_table, pte_t entry, int write_access)
 {
-	pte_t page;
-
+	lock_kernel();
 	if (!vma->vm_ops || !vma->vm_ops->swapin) {
 		swap_in(tsk, vma, page_table, pte_val(entry), write_access);
 		flush_page_to_ram(pte_page(*page_table));
-		return;
+	} else {
+		pte_t page = vma->vm_ops->swapin(vma, address - vma->vm_start + vma->vm_offset, pte_val(entry));
+		if (pte_val(*page_table) != pte_val(entry)) {
+			free_page(pte_page(page));
+		} else {
+			if (atomic_read(&mem_map[MAP_NR(pte_page(page))].count) > 1 &&
+			    !(vma->vm_flags & VM_SHARED))
+				page = pte_wrprotect(page);
+			++vma->vm_mm->rss;
+			++tsk->maj_flt;
+			flush_page_to_ram(pte_page(page));
+			set_pte(page_table, page);
+		}
 	}
-	page = vma->vm_ops->swapin(vma, address - vma->vm_start + vma->vm_offset, pte_val(entry));
-	if (pte_val(*page_table) != pte_val(entry)) {
-		free_page(pte_page(page));
-		return;
+	unlock_kernel();
+	return 1;
+}
+
+/*
+ * This only needs the MM semaphore
+ */
+static int do_anonymous_page(struct task_struct * tsk, struct vm_area_struct * vma, pte_t *page_table, int write_access)
+{
+	pte_t entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
+	if (write_access) {
+		unsigned long page = __get_free_page(GFP_KERNEL);
+		if (!page)
+			return 0;
+		clear_page(page);
+		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+		vma->vm_mm->rss++;
+		tsk->min_flt++;
+		flush_page_to_ram(page);
 	}
-	if (atomic_read(&mem_map[MAP_NR(pte_page(page))].count) > 1 &&
-	    !(vma->vm_flags & VM_SHARED))
-		page = pte_wrprotect(page);
-	++vma->vm_mm->rss;
-	++tsk->maj_flt;
-	flush_page_to_ram(pte_page(page));
-	set_pte(page_table, page);
-	return;
+	put_page(page_table, entry);
+	return 1;
 }

 /*
@@ -811,26 +832,33 @@ static inline void do_swap_page(struct task_struct * tsk,
 *
 * As this is called only for pages that do not currently exist, we
 * do not need to flush old virtual caches or the TLB.
+ *
+ * This is called with the MM semaphore held, but without the kernel
+ * lock.
 */
-static void do_no_page(struct task_struct * tsk, struct vm_area_struct * vma,
-	unsigned long address, int write_access, pte_t *page_table, pte_t entry)
+static int do_no_page(struct task_struct * tsk, struct vm_area_struct * vma,
+	unsigned long address, int write_access, pte_t *page_table)
 {
 	unsigned long page;
+	pte_t entry;

-	if (!pte_none(entry))
-		goto swap_page;
-	address &= PAGE_MASK;
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
-		goto anonymous_page;
+		return do_anonymous_page(tsk, vma, page_table, write_access);
+
 	/*
 	 * The third argument is "no_share", which tells the low-level code
 	 * to copy, not share the page even if sharing is possible.  It's
-	 * essentially an early COW detection 
+	 * essentially an early COW detection.
+	 *
+	 * We need to grab the kernel lock for this..
 	 */
-	page = vma->vm_ops->nopage(vma, address, 
+	lock_kernel();
+	page = vma->vm_ops->nopage(vma, address & PAGE_MASK,
 		(vma->vm_flags & VM_SHARED)?0:write_access);
+	unlock_kernel();
 	if (!page)
-		goto sigbus;
+		return 0;
+
 	++tsk->maj_flt;
 	++vma->vm_mm->rss;
 	/*
@@ -852,32 +880,7 @@ static void do_no_page(struct task_struct * tsk, struct vm_area_struct * vma,
 		entry = pte_wrprotect(entry);
 	put_page(page_table, entry);
 	/* no need to invalidate: a not-present page shouldn't be cached */
-	return;
-
-anonymous_page:
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
-	if (write_access) {
-		unsigned long page = __get_free_page(GFP_KERNEL);
-		if (!page)
-			goto sigbus;
-		clear_page(page);
-		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-		vma->vm_mm->rss++;
-		tsk->min_flt++;
-		flush_page_to_ram(page);
-	}
-	put_page(page_table, entry);
-	return;
-
-sigbus:
-	force_sig(SIGBUS, current);
-	put_page(page_table, BAD_PAGE);
-	/* no need to invalidate, wasn't present */
-	return;
-
-swap_page:
-	do_swap_page(tsk, vma, address, page_table, entry, write_access);
-	return;
+	return 1;
 }

 /*
@@ -889,54 +892,54 @@ static void do_no_page(struct task_struct * tsk, struct vm_area_struct * vma,
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 */
-static inline void handle_pte_fault(struct task_struct *tsk,
+static inline int handle_pte_fault(struct task_struct *tsk,
 	struct vm_area_struct * vma, unsigned long address,
 	int write_access, pte_t * pte)
 {
 	pte_t entry = *pte;

 	if (!pte_present(entry)) {
-		do_no_page(tsk, vma, address, write_access, pte, entry);
-		return;
+		if (pte_none(entry))
+			return do_no_page(tsk, vma, address, write_access, pte);
+		return do_swap_page(tsk, vma, address, pte, entry, write_access);
 	}
+
 	entry = pte_mkyoung(entry);
 	set_pte(pte, entry);
 	flush_tlb_page(vma, address);
 	if (!write_access)
-		return;
+		return 1;
+
 	if (pte_write(entry)) {
 		entry = pte_mkdirty(entry);
 		set_pte(pte, entry);
 		flush_tlb_page(vma, address);
-		return;
+		return 1;
 	}
-	do_wp_page(tsk, vma, address, pte);
+	return do_wp_page(tsk, vma, address, pte);
 }

 /*
 * By the time we get here, we already hold the mm semaphore
 */
-void handle_mm_fault(struct task_struct *tsk, struct vm_area_struct * vma,
+int handle_mm_fault(struct task_struct *tsk, struct vm_area_struct * vma,
 	unsigned long address, int write_access)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
-	pte_t *pte;

 	pgd = pgd_offset(vma->vm_mm, address);
 	pmd = pmd_alloc(pgd, address);
-	if (!pmd)
-		goto no_memory;
-	pte = pte_alloc(pmd, address);
-	if (!pte)
-		goto no_memory;
-	lock_kernel();
-	handle_pte_fault(tsk, vma, address, write_access, pte);
-	unlock_kernel();
-	update_mmu_cache(vma, address, *pte);
-	return;
-no_memory:
-	oom(tsk);
+	if (pmd) {
+		pte_t * pte = pte_alloc(pmd, address);
+		if (pte) {
+			if (handle_pte_fault(tsk, vma, address, write_access, pte)) {
+				update_mmu_cache(vma, address, *pte);
+				return 1;
+			}
+		}
+	}
+	return 0;
 }

 /*

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -269,11 +269,16 @@ unsigned long __get_free_pages(int gfp_mask, unsigned long order)

 	/*
 	 * If we failed to find anything, we'll return NULL, but we'll
-	 * wake up kswapd _now_ and even wait for it synchronously if
-	 * we can.. This way we'll at least make some forward progress
+	 * wake up kswapd _now_ and even yield to it if we can..
+	 * This way we'll at least make some forward progress
 	 * over time.
 	 */
-	kswapd_notify(gfp_mask);
+	wakeup_kswapd();
+	if (gfp_mask & __GFP_WAIT) {
+		current->policy |= SCHED_YIELD;
+		schedule();
+	}
+
 nopage:
 	return 0;
 }

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -118,8 +118,13 @@ static inline int try_to_swap_out(struct task_struct * tsk, struct vm_area_struc
 	}
 	
 	if (pte_young(pte)) {
+		/*
+		 * Transfer the "accessed" bit from the page
+		 * tables to the global page map.
+		 */
 		set_pte(page_table, pte_mkold(pte));
-		touch_page(page_map);
+		set_bit(PG_referenced, &page_map->flags);
+
 		/* 
 		 * We should test here to see if we want to recover any
 		 * swap cache page here.  We do this if the page seeing
@@ -132,10 +137,6 @@ static inline int try_to_swap_out(struct task_struct * tsk, struct vm_area_struc
 		return 0;
 	}

-	age_page(page_map);
-	if (page_map->age)
-		return 0;
-
 	if (pte_dirty(pte)) {
 		if (vma->vm_ops && vma->vm_ops->swapout) {
 			pid_t pid = tsk->pid;
@@ -305,8 +306,9 @@ static inline int swap_out_pgd(struct task_struct * tsk, struct vm_area_struct *
 }

 static int swap_out_vma(struct task_struct * tsk, struct vm_area_struct * vma,
-	pgd_t *pgdir, unsigned long start, int gfp_mask)
+	unsigned long address, int gfp_mask)
 {
+	pgd_t *pgdir;
 	unsigned long end;

 	/* Don't swap out areas like shared memory which have their
@@ -314,12 +316,14 @@ static int swap_out_vma(struct task_struct * tsk, struct vm_area_struct * vma,
 	if (vma->vm_flags & (VM_SHM | VM_LOCKED))
 		return 0;

+	pgdir = pgd_offset(tsk->mm, address);
+
 	end = vma->vm_end;
-	while (start < end) {
-		int result = swap_out_pgd(tsk, vma, pgdir, start, end, gfp_mask);
+	while (address < end) {
+		int result = swap_out_pgd(tsk, vma, pgdir, address, end, gfp_mask);
 		if (result)
 			return result;
-		start = (start + PGDIR_SIZE) & PGDIR_MASK;
+		address = (address + PGDIR_SIZE) & PGDIR_MASK;
 		pgdir++;
 	}
 	return 0;
@@ -339,22 +343,23 @@ static int swap_out_process(struct task_struct * p, int gfp_mask)
 	 * Find the proper vm-area
 	 */
 	vma = find_vma(p->mm, address);
-	if (!vma) {
-		p->swap_address = 0;
-		return 0;
+	if (vma) {
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		for (;;) {
+			int result = swap_out_vma(p, vma, address, gfp_mask);
+			if (result)
+				return result;
+			vma = vma->vm_next;
+			if (!vma)
+				break;
+			address = vma->vm_start;
+		}
 	}
-	if (address < vma->vm_start)
-		address = vma->vm_start;

-	for (;;) {
-		int result = swap_out_vma(p, vma, pgd_offset(p->mm, address), address, gfp_mask);
-		if (result)
-			return result;
-		vma = vma->vm_next;
-		if (!vma)
-			break;
-		address = vma->vm_start;
-	}
+	/* We didn't find anything for the process */
+	p->swap_cnt = 0;
 	p->swap_address = 0;
 	return 0;
 }
@@ -415,20 +420,12 @@ static int swap_out(unsigned int priority, int gfp_mask)
 		}
 		pbest->swap_cnt--;

-		switch (swap_out_process(pbest, gfp_mask)) {
-		case 0:
-			/*
-			 * Clear swap_cnt so we don't look at this task
-			 * again until we've tried all of the others.
-			 * (We didn't block, so the task is still here.)
-			 */
-			pbest->swap_cnt = 0;
-			break;
-		case 1:
+		/*
+		 * Nonzero means we cleared out something, but only "1" means
+		 * that we actually free'd up a page as a result.
+		 */
+		if (swap_out_process(pbest, gfp_mask) == 1)
 			return 1;
-		default:
-			break;
-		};
 	}
 out:
 	return 0;
@@ -540,7 +537,7 @@ int kswapd(void *unused)
 	init_swap_timer();
 	kswapd_task = current;
 	while (1) {
-		int tries;
+		unsigned long start_time;

 		current->state = TASK_INTERRUPTIBLE;
 		flush_signals(current);
@@ -548,36 +545,12 @@ int kswapd(void *unused)
 		schedule();
 		swapstats.wakeups++;

-		/*
-		 * Do the background pageout: be
-		 * more aggressive if we're really
-		 * low on free memory.
-		 *
-		 * We try page_daemon.tries_base times, divided by
-		 * an 'urgency factor'. In practice this will mean
-		 * a value of pager_daemon.tries_base / 8 or 4 = 64
-		 * or 128 pages at a time.
-		 * This gives us 64 (or 128) * 4k * 4 (times/sec) =
-		 * 1 (or 2) MB/s swapping bandwidth in low-priority
-		 * background paging. This number rises to 8 MB/s
-		 * when the priority is highest (but then we'll be
-		 * woken up more often and the rate will be even
-		 * higher).
-		 */
-		tries = pager_daemon.tries_base;
-		tries >>= 4*free_memory_available();
-
+		start_time = jiffies;
 		do {
 			do_try_to_free_page(0);
-			/*
-			 * Syncing large chunks is faster than swapping
-			 * synchronously (less head movement). -- Rik.
-			 */
-			if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster)
-				run_task_queue(&tq_disk);
 			if (free_memory_available() > 1)
 				break;
-		} while (--tries > 0);
+		} while (jiffies != start_time);
 	}
 	/* As if we could ever get here - maybe we want to make this killable */
 	kswapd_task = NULL;

--- a/scripts/Configure
+++ b/scripts/Configure
@@ -53,6 +53,9 @@
 #
 # 090398 Axel Boldt (boldt@math.ucsb.edu) - allow for empty lines in help
 # texts.
+#
+# 102598 Michael Chastain (mec@shout.net) - put temporary files in
+# current directory, not in /tmp.

 #
 # Make sure we're really running bash.
@@ -506,9 +509,9 @@ if [ -f $DEFAULTS ]; then
  echo "# Using defaults found in" $DEFAULTS
  echo "#"
  . $DEFAULTS
-  sed -e 's/# \(.*\) is not.*/\1=n/' < $DEFAULTS > /tmp/conf.$$
-  . /tmp/conf.$$
-  rm /tmp/conf.$$
+  sed -e 's/# \(.*\) is not.*/\1=n/' < $DEFAULTS > .config-is-not.$$
+  . .config-is-not.$$
+  rm .config-is-not.$$
 else
  echo "#"
  echo "# No defaults found"