Commit 93d33a48 authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] laptop mode

From: Bart Samwel <bart@samwel.tk>

Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
In this mode the kernel will attempt to avoid spinning disks up.

Algorithm: the idea is to hold dirty data in memory for a long time, but to
flush everything which has been accumulated if the disk happens to spin up
for other reasons.

- Whenever a disk request completes (read or write), schedule a timer a few
  seconds hence.  If the timer was already pending, reset it to a few seconds
  hence.

- When the timer expires, write back the whole world.  We use
  sync_filesystems() for this because it will force ext3 journal commits as
  well.

- In balance_dirty_pages(), kick off background writeback when we hit the
  high threshold (dirty_ratio), not when we hit the low threshold.  This has
  the effect of causing "lumpy" writeback which is something I spent a year
  fixing, but in laptop mode, it is desirable.

- In try_to_free_pages(), only kick pdflush if the VM is getting into
  distress: we want to keep scanning for clean pages, deferring writeback.

- In page reclaim, avoid writing back the odd random dirty page off the
  LRU: only start I/O if the scanning is working harder.

The effect is to perform a sync() a few seconds after all I/O has ceased.

The value which was written into /proc/sys/vm/laptop-mode determines, in
seconds, the delay between the final I/O and the flush.

Additionally, the patch adds tools which help answer the question "why the
heck does my disk spin up all the time?".  The user may set
/proc/sys/vm/block_dump to a non-zero value and the kernel will print out
information which will identify the process which is performing disk reads or
which is dirtying pagecache.

The user should probably disable syslogd before setting block-dump.
parent 77fe0a19
How to conserve battery power using laptop-mode
-----------------------------------------------
Document Author: Bart Samwel (bart@samwel.tk)
Date created: January 2, 2004
Last modified: April 3, 2004
Introduction
------------
Laptopmode is used to minimize the time that the hard disk needs to be spun up,
to conserve battery power on laptops. It has been reported to cause significant
power savings.
Contents
--------
* Introduction
* The short story
* Caveats
* The details
* Tips & Tricks
* Control script
* ACPI integration
* Monitoring tool
The short story
---------------
If you just want to use it, run the laptop_mode control script (which is included
at the end of this document) as follows:
# laptop_mode start
Then set your harddisk spindown time to a relatively low value with hdparm:
hdparm -S 4 /dev/hda
The value -S 4 means 20 seconds idle time before spindown. Your harddisk will
now only spin up when a disk cache miss occurs, or at least once every 10
minutes to write back any pending changes.
To stop laptop_mode, run "laptop_mode stop".
Caveats
-------
* The downside of laptop mode is that you have a chance of losing up
to 10 minutes of work. If you cannot afford this, don't use it! It's
wise to turn OFF laptop mode when you're almost out of battery --
although this will make the battery run out faster, at least you'll
lose less work when it actually runs out. I'm still looking for someone
to submit instructions on how to turn off laptop mode when battery is low,
e.g., using ACPI events. I don't have a laptop myself, so if you do and
you care to contribute such instructions, please do.
* Most desktop hard drives have a very limited lifetime measured in spindown
cycles, typically about 50.000 times (it's usually listed on the spec sheet).
Check your drive's rating, and don't wear down your drive's lifetime if you
don't need to.
* If you mount some of your ext3/reiserfs filesystems with the -n option, then
the control script will not be able to remount them correctly. You must set
DO_REMOUNTS=0 in the control script, otherwise it will remount them with the
wrong options -- or it will fail because it cannot write to /etc/mtab.
* If you have your filesystems listed as type "auto" in fstab, like I did, then
the control script will not recognize them as filesystems that need remounting.
* If you have XFS, make SURE that you set the XFS_HZ value in the control script
correctly, to the value of HZ of your running kernel. Laptop mode will not
work correctly if it is set too low, and you may lose data if it is set too
high. The reason for this problem is that XFS does not export its sysctl
variables in centisecs (like most other subsystems do) but in "jiffies",
which is an internal kernel measure. Once this is fixed things will get better.
The details
-----------
Laptop-mode is controlled by the flag /proc/sys/vm/laptop_mode. When this
flag is set, any physical disk read operation (that might have caused the
hard disk to spin up) causes Linux to flush all dirty blocks. The result
of this is that after a disk has spun down, it will not be spun up anymore
to write dirty blocks, because those blocks had already been written
immediately after the most recent read operation
To increase the effectiveness of the laptop_mode strategy, the laptop_mode
control script increases dirty_expire_centisecs and dirty_writeback_centisecs in
/proc/sys/vm to about 10 minutes (by default), which means that pages that are
dirtied are not forced to be written to disk as often. The control script also
changes the dirty background ratio, so that background writeback of dirty pages
is not done anymore. Combined with a higher commit value (also 10 minutes) for
ext3 or ReiserFS filesystems (also done automatically by the control script),
this results in concentration of disk activity in a small time interval which
occurs only once every 10 minutes, or whenever the disk is forced to spin up by
a cache miss. The disk can then be spun down in the periods of inactivity.
If you want to find out which process caused the disk to spin up, you can
gather information by setting the flag /proc/sys/vm/block_dump. When this flag
is set, Linux reports all disk read and write operations that take place, and
all block dirtyings done to files. This makes it possible to debug why a disk
needs to spin up, and to increase battery life even more. The output of
block_dump is written to the kernel output, and it can be retrieved using
"dmesg". When you use block_dump, you may want to turn off klogd, otherwise
the output of block_dump will be logged, causing disk activity that is not
normally there.
If 10 minutes is too much or too little downtime for you, you can configure
this downtime as follows. In the control script, set the MAX_AGE value to the
maximum number of seconds of disk downtime that you would like. You should
then set your filesystem's commit interval to the same value. The dirty ratio
is also configurable from the control script.
If you don't like the idea of the control script remounting your filesystems
for you, you can change DO_REMOUNTS to 0 in the script.
Thanks to Kiko Piris, the control script can be used to enable laptop mode on
both the Linux 2.4 and 2.6 series.
Tips & Tricks
-------------
* Bartek Kania reports getting up to 50 minutes of extra battery life (on top
of his regular 3 to 3.5 hours) using very aggressive power management (hdparm
-B1) and a spindown time of 5 seconds (hdparm -S1).
* You can spin down the disk while playing MP3, by setting the disk readahead
to 8MB (hdparm -a 16384). Effectively, the disk will read a complete MP3 at
once, and will then spin down while the MP3 is playing. (Thanks to Bartek
Kania.)
* Drew Scott Daniels observed: "I don't know why, but when I decrease the number
of colours that my display uses it consumes less battery power. I've seen
this on powerbooks too. I hope that this is a piece of information that
might be useful to the Laptop Mode patch or it's users."
* One thing which will cause disks to spin up is not-present application
and dynamic library text pages. The kernel will load program text off disk
on-demand, so each time you invoke an application feature for the first
time, the kernel needs to spin the disk up to go and fetch that part of the
application.
So it is useful to increase the disk readahead parameter greatly, so that
the kernel will pull all of the executable's pages into memory on the first
pagefault.
The supplied script does this.
* In syslog.conf, you can prefix entries with a dash ``-'' to omit syncing the
file after every logging. When you're using laptop-mode and your disk doesn't
spin down, this is a likely culprit.
* Richard Atterer observed that laptop mode does not work well with noflushd
(http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode
from doing its thing.
Control script
--------------
Please note that this control script works for the Linux 2.4 and 2.6 series.
--------------------CONTROL SCRIPT BEGIN------------------------------------------
#! /bin/sh
# start or stop laptop_mode, best run by a power management daemon when
# ac gets connected/disconnected from a laptop
#
# install as /sbin/laptop_mode
#
# Contributors to this script: Kiko Piris
# Bart Samwel
# Micha Feigin
# Andrew Morton
# Dax Kelson
#
# Original Linux 2.4 version by: Jens Axboe
# Remove an option (the first parameter) of the form option=<number> from
# a mount options string (the rest of the parameters).
parse_mount_opts () {
OPT="$1"
shift
echo "$*" | \
sed 's/.*/,&,/' | \
sed 's/,'"$OPT"'=[0-9]*,/,/g' | \
sed 's/,,*/,/g' | \
sed 's/^,//' | \
sed 's/,$//' | \
cat -
}
# Remove an option (the first parameter) without any arguments from
# a mount option string (the rest of the parameters).
parse_nonumber_mount_opts () {
OPT="$1"
shift
echo "$*" | \
sed 's/.*/,&,/' | \
sed 's/,'"$OPT"',/,/g' | \
sed 's/,,*/,/g' | \
sed 's/^,//' | \
sed 's/,$//' | \
cat -
}
# Find out the state of a yes/no option (e.g. "atime"/"noatime") in
# fstab for a given filesystem, and use this state to replace the
# value of the option in another mount options string. The device
# is the first argument, the option name the second, and the default
# value the third. The remainder is the mount options string.
#
# Example:
# parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime
#
# If fstab contains, say, "rw" for this filesystem, then the result
# will be "defaults,atime".
parse_yesno_opts_wfstab () {
L_DEV=$1
shift
OPT=$1
shift
DEF_OPT=$1
shift
L_OPTS="$*"
PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)"
PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)"
# Watch for a default atime in fstab
FSTAB_OPTS="$(cat /etc/fstab | sed 's/ / /g' | grep ^\ *"$L_DEV " | awk '{ print $4 }')"
if [ -z "$(echo "$FSTAB_OPTS" | grep "$OPT")" ] ; then
# option not specified in fstab -- choose the default.
echo "$PARSEDOPTS1,$DEF_OPT"
else
# option specified in fstab: extract the value and use it
if [ -z "$(echo "$FSTAB_OPTS" | grep "no$OPT")" ] ; then
# no$OPT not found -- so we must have $OPT.
echo "$PARSEDOPTS1,$OPT"
else
echo "$PARSEDOPTS1,no$OPT"
fi
fi
}
# Find out the state of a numbered option (e.g. "commit=NNN") in
# fstab for a given filesystem, and use this state to replace the
# value of the option in another mount options string. The device
# is the first argument, and the option name the second. The
# remainder is the mount options string in which the replacement
# must be done.
#
# Example:
# parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7
#
# If fstab contains, say, "commit=3,rw" for this filesystem, then the
# result will be "rw,commit=3".
parse_mount_opts_wfstab () {
L_DEV=$1
shift
OPT=$1
shift
L_OPTS="$*"
PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)"
# Watch for a default commit in fstab
FSTAB_OPTS="$(cat /etc/fstab | sed 's/ / /g' | grep ^\ *"$L_DEV " | awk '{ print $4 }')"
if [ -z "$(echo "$FSTAB_OPTS" | grep "$OPT=")" ] ; then
# option not specified in fstab: set it to 0
echo "$PARSEDOPTS1,$OPT=0"
else
# option specified in fstab: extract the value, and use it
echo -n "$PARSEDOPTS1,$OPT="
echo "$FSTAB_OPTS" | \
sed 's/.*/,&,/' | \
sed 's/.*,'"$OPT"'=//' | \
sed 's/,.*//' | \
cat -
fi
}
KLEVEL="$(uname -r | cut -c1-3)"
case "$KLEVEL" in
"2.4"|"2.6")
true
;;
*)
echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')"
exit 1
;;
esac
# Shall we remount journaled fs. with appropiate commit interval? (1=yes)
DO_REMOUNTS=1
# age time, in seconds. should be put into a sysconfig file
MAX_AGE=600
# Dirty synchronous ratio. At this percentage of dirty pages the process which
# calls write() does its own writeback
DIRTY_RATIO=40
#
# Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
# exceeded, the kernel will wake pdflush which will then reduce the amount
# of dirty memory to dirty_background_ratio. Set this nice and low, so once
# some writeout has commenced, we do a lot of it.
#
DIRTY_BACKGROUND_RATIO=5
READAHEAD=4096 # kilobytes
# kernel default dirty buffer age
DEF_AGE=30
DEF_UPDATE=5
DEF_DIRTY_BACKGROUND_RATIO=10
DEF_DIRTY_RATIO=40
DEF_XFS_AGE_BUFFER=15
DEF_XFS_SYNC_INTERVAL=30
# This must be adjusted manually to the value of HZ in the running kernel,
# until the XFS people change their external interfaces to work in centisecs
# like the rest of the external world. Unfortunately this cannot be automated. :(
XFS_HZ=1000
if [ ! -e /proc/sys/vm/laptop_mode ]; then
echo "Kernel is not patched with laptop_mode patch."
exit 1
fi
if [ ! -w /proc/sys/vm/laptop_mode ]; then
echo "You do not have enough privileges to enable laptop_mode."
exit 1
fi
case "$1" in
start)
AGE=$((100*$MAX_AGE))
XFS_AGE=$(($XFS_HZ*$MAX_AGE))
echo -n "Starting laptop_mode"
if [ -d /proc/sys/vm/pagebuf ] ; then
# This only needs to be set, not reset -- it is only used when
# laptop mode is enabled.
echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age
echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
# The same goes for these.
echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer
echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then
# But not for these -- they are also used in normal
# operation.
echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer
echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval
fi
case "$KLEVEL" in
"2.4")
echo "1" > /proc/sys/vm/laptop_mode
echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush
;;
"2.6")
echo "5" > /proc/sys/vm/laptop_mode
echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs
echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs
echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
;;
esac
if [ $DO_REMOUNTS -eq 1 ]; then
cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
PARSEDOPTS="$(parse_mount_opts "$OPTS")"
case "$FST" in
"ext3"|"reiserfs")
PARSEDOPTS="$(parse_mount_opts commit "$OPTS")"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE,noatime
;;
"xfs")
mount $DEV -t $FST $MP -o remount,$OPTS,noatime
;;
esac
if [ -b $DEV ] ; then
blockdev --setra $(($READAHEAD * 2)) $DEV
fi
done
fi
echo "."
;;
stop)
U_AGE=$((100*$DEF_UPDATE))
B_AGE=$((100*$DEF_AGE))
echo -n "Stopping laptop_mode"
echo "0" > /proc/sys/vm/laptop_mode
if [ -f /proc/sys/fs/xfs/age_buffer ] && [ ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
# These need to be restored though, if there are no lm_*.
echo "$(($XFS_HZ*$DEF_XFS_AGE_BUFFER))" > /proc/sys/fs/xfs/age_buffer
echo "$(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL))" > /proc/sys/fs/xfs/sync_interval
fi
case "$KLEVEL" in
"2.4")
echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush
;;
"2.6")
echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs
echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs
echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
;;
esac
if [ $DO_REMOUNTS -eq 1 ]; then
cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
# Reset commit and atime options to defaults.
case "$FST" in
"ext3"|"reiserfs")
PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)"
PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
;;
"xfs")
PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
;;
esac
if [ -b $DEV ] ; then
blockdev --setra 256 $DEV
fi
done
fi
echo "."
;;
*)
echo "Usage: $0 {start|stop}"
;;
esac
exit 0
--------------------CONTROL SCRIPT END--------------------------------------------
ACPI integration
----------------
Dax Kelson submitted this so that the ACPI acpid daemon will
kick off the laptop_mode script and run hdparm.
---------------------------/etc/acpi/events/ac_adapter BEGIN-------------------------------------------
event=ac_adapter
action=/etc/acpi/actions/battery.sh
---------------------------/etc/acpi/events/ac_adapter END-------------------------------------------
---------------------------/etc/acpi/actions/battery.sh BEGIN-------------------------------------------
#!/bin/sh
# cpu throttling
# cat /proc/acpi/processor/CPU0/throttling for more info
ACAD_THR=0
BATT_THR=2
# spindown time for HD (man hdparm for valid values)
# I prefer 2 hours for acad and 20 seconds for batt
ACAD_HD=244
BATT_HD=4
# ac/battery event handler
status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/AC/state`
case $status in
"on-line")
echo "Setting HD spindown to 2 hours"
/sbin/laptop-mode stop
/sbin/hdparm -S $ACAD_HD /dev/hda > /dev/null 2>&1
/sbin/hdparm -B 255 /dev/hda > /dev/null 2>&1
#echo -n $ACAD_CPU:$ACAD_THR > /proc/acpi/processor/CPU0/limit
exit 0
;;
"off-line")
echo "Setting HD spindown to 20 seconds"
/sbin/laptop-mode start
/sbin/hdparm -S $BATT_HD /dev/hda > /dev/null 2>&1
/sbin/hdparm -B 1 /dev/hda > /dev/null 2>&1
#echo -n $BATT_CPU:$BATT_THR > /proc/acpi/processor/CPU0/limit
exit 0
;;
esac
---------------------------/etc/acpi/actions/battery.sh END-------------------------------------------
Monitoring tool
---------------
Bartek Kania submitted this, it can be used to measure how much time your disk
spends spun up/down.
---------------------------dslm.c BEGIN-------------------------------------------
/*
* Simple Disk Sleep Monitor
* by Bartek Kania
* Licenced under the GPL
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <time.h>
#include <string.h>
#include <signal.h>
#include <sys/ioctl.h>
#include <linux/hdreg.h>
#ifdef DEBUG
#define D(x) x
#else
#define D(x)
#endif
int endit = 0;
/* Check if the disk is in powersave-mode
* Most of the code is stolen from hdparm.
* 1 = active, 0 = standby/sleep, -1 = unknown */
int check_powermode(int fd)
{
unsigned char args[4] = {WIN_CHECKPOWERMODE1,0,0,0};
int state;
if (ioctl(fd, HDIO_DRIVE_CMD, &args)
&& (args[0] = WIN_CHECKPOWERMODE2) /* try again with 0x98 */
&& ioctl(fd, HDIO_DRIVE_CMD, &args)) {
if (errno != EIO || args[0] != 0 || args[1] != 0) {
state = -1; /* "unknown"; */
} else
state = 0; /* "sleeping"; */
} else {
state = (args[2] == 255) ? 1 : 0;
}
D(printf(" drive state is: %d\n", state));
return state;
}
char *state_name(int i)
{
if (i == -1) return "unknown";
if (i == 0) return "sleeping";
if (i == 1) return "active";
return "internal error";
}
char *myctime(time_t time)
{
char *ts = ctime(&time);
ts[strlen(ts) - 1] = 0;
return ts;
}
void measure(int fd)
{
time_t start_time;
int last_state;
time_t last_time;
int curr_state;
time_t curr_time = 0;
time_t time_diff;
time_t active_time = 0;
time_t sleep_time = 0;
time_t unknown_time = 0;
time_t total_time = 0;
int changes = 0;
float tmp;
printf("Starting measurements\n");
last_state = check_powermode(fd);
start_time = last_time = time(0);
printf(" System is in state %s\n\n", state_name(last_state));
while(!endit) {
sleep(1);
curr_state = check_powermode(fd);
if (curr_state != last_state || endit) {
changes++;
curr_time = time(0);
time_diff = curr_time - last_time;
if (last_state == 1) active_time += time_diff;
else if (last_state == 0) sleep_time += time_diff;
else unknown_time += time_diff;
last_state = curr_state;
last_time = curr_time;
printf("%s: State-change to %s\n", myctime(curr_time),
state_name(curr_state));
}
}
changes--; /* Compensate for SIGINT */
total_time = time(0) - start_time;
printf("\nTotal running time: %lus\n", curr_time - start_time);
printf(" State changed %d times\n", changes);
tmp = (float)sleep_time / (float)total_time * 100;
printf(" Time in sleep state: %lus (%.2f%%)\n", sleep_time, tmp);
tmp = (float)active_time / (float)total_time * 100;
printf(" Time in active state: %lus (%.2f%%)\n", active_time, tmp);
tmp = (float)unknown_time / (float)total_time * 100;
printf(" Time in unknown state: %lus (%.2f%%)\n", unknown_time, tmp);
}
void ender(int s)
{
endit = 1;
}
void usage()
{
puts("usage: dslm [-w <time>] <disk>");
exit(0);
}
int main(int ac, char **av)
{
int fd;
char *disk = 0;
int settle_time = 60;
/* Parse the simple command-line */
if (ac == 2)
disk = av[1];
else if (ac == 4) {
settle_time = atoi(av[2]);
disk = av[3];
} else
usage();
if (!(fd = open(disk, O_RDONLY|O_NONBLOCK))) {
printf("Can't open %s, because: %s\n", disk, strerror(errno));
exit(-1);
}
if (settle_time) {
printf("Waiting %d seconds for the system to settle down to "
"'normal'\n", settle_time);
sleep(settle_time);
} else
puts("Not waiting for system to settle down");
signal(SIGINT, ender);
measure(fd);
close(fd);
return 0;
}
---------------------------dslm.c END---------------------------------------------
......@@ -27,6 +27,7 @@
#include <linux/completion.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/writeback.h>
/*
* for max sense size
......@@ -2471,6 +2472,16 @@ int submit_bio(int rw, struct bio *bio)
mod_page_state(pgpgout, count);
else
mod_page_state(pgpgin, count);
if (unlikely(block_dump)) {
char b[BDEVNAME_SIZE];
printk("%s(%d): %s block %Lu on %s\n",
current->comm, current->pid,
(rw & WRITE) ? "WRITE" : "READ",
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev,b));
}
generic_make_request(bio);
return 1;
}
......@@ -2754,6 +2765,9 @@ void end_that_request_last(struct request *req)
struct gendisk *disk = req->rq_disk;
struct completion *waiting = req->waiting;
if (unlikely(laptop_mode))
laptop_io_completion();
if (disk && blk_fs_request(req)) {
unsigned long duration = jiffies - req->start_time;
switch (rq_data_dir(req)) {
......
......@@ -274,6 +274,8 @@ static void do_sync(unsigned long wait)
sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */
if (!wait)
printk("Emergency Sync complete\n");
if (unlikely(laptop_mode))
laptop_sync_completion();
}
asmlinkage long sys_sync(void)
......
......@@ -75,6 +75,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if ((inode->i_state & flags) == flags)
return;
if (unlikely(block_dump))
printk("%s(%d): dirtied file\n", current->comm, current->pid);
spin_lock(&inode_lock);
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
......
......@@ -159,6 +159,8 @@ enum
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
VM_MIN_FREE_KBYTES=21, /* Minimum free kilobytes to maintain */
VM_MAX_MAP_COUNT=22, /* int: Maximum number of mmaps/address-space */
VM_LAPTOP_MODE=23, /* vm laptop mode */
VM_BLOCK_DUMP=24, /* block dump mode */
};
......
......@@ -72,12 +72,16 @@ static inline void wait_on_inode(struct inode *inode)
* mm/page-writeback.c
*/
int wakeup_bdflush(long nr_pages);
void laptop_io_completion(void);
void laptop_sync_completion(void);
/* These 5 are exported to sysctl. */
/* These are exported to sysctl. */
extern int dirty_background_ratio;
extern int vm_dirty_ratio;
extern int dirty_writeback_centisecs;
extern int dirty_expire_centisecs;
extern int block_dump;
extern int laptop_mode;
struct ctl_table;
struct file;
......
......@@ -744,6 +744,26 @@ static ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec
},
{
.ctl_name = VM_LAPTOP_MODE,
.procname = "laptop_mode",
.data = &laptop_mode,
.maxlen = sizeof(laptop_mode),
.mode = 0644,
.proc_handler = &proc_dointvec,
.strategy = &sysctl_intvec,
.extra1 = &zero,
},
{
.ctl_name = VM_BLOCK_DUMP,
.procname = "block_dump",
.data = &block_dump,
.maxlen = sizeof(block_dump),
.mode = 0644,
.proc_handler = &proc_dointvec,
.strategy = &sysctl_intvec,
.extra1 = &zero,
},
{ .ctl_name = 0 }
};
......
......@@ -28,6 +28,7 @@
#include <linux/smp.h>
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/syscalls.h>
/*
* The maximum number of pages to writeout in a single bdflush/kupdate
......@@ -81,6 +82,16 @@ int dirty_writeback_centisecs = 5 * 100;
*/
int dirty_expire_centisecs = 30 * 100;
/*
* Flag that makes the machine dump writes/reads and block dirtyings.
*/
int block_dump;
/*
* Flag that puts the machine in "laptop mode".
*/
int laptop_mode;
/* End of sysctl-exported parameters */
......@@ -195,7 +206,19 @@ static void balance_dirty_pages(struct address_space *mapping)
if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)
dirty_exceeded = 0;
if (!writeback_in_progress(bdi) && nr_reclaimable > background_thresh)
if (writeback_in_progress(bdi))
return; /* pdflush is already working this queue */
/*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
* to the lower threshold. So slow writers cause minimal disk activity.
*
* In normal mode, we start background writeout at the lower
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
(!laptop_mode && (nr_reclaimable > background_thresh)))
pdflush_operation(background_writeout, 0);
}
......@@ -289,7 +312,13 @@ int wakeup_bdflush(long nr_pages)
return pdflush_operation(background_writeout, nr_pages);
}
static struct timer_list wb_timer;
static void wb_timer_fn(unsigned long unused);
static void laptop_timer_fn(unsigned long unused);
static struct timer_list wb_timer =
TIMER_INITIALIZER(wb_timer_fn, 0, 0);
static struct timer_list laptop_mode_wb_timer =
TIMER_INITIALIZER(laptop_timer_fn, 0, 0);
/*
* Periodic writeback of "old" data.
......@@ -368,7 +397,36 @@ static void wb_timer_fn(unsigned long unused)
{
if (pdflush_operation(wb_kupdate, 0) < 0)
mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
}
static void laptop_flush(unsigned long unused)
{
sys_sync();
}
static void laptop_timer_fn(unsigned long unused)
{
pdflush_operation(laptop_flush, 0);
}
/*
* We've spun up the disk and we're in laptop mode: schedule writeback
* of all dirty data a few seconds from now. If the flush is already scheduled
* then push it back - the user is still using the disk.
*/
void laptop_io_completion(void)
{
mod_timer(&laptop_mode_wb_timer, jiffies + laptop_mode * HZ);
}
/*
* We're in laptop mode and we've just synced. The sync's writes will have
* caused another writeback to be scheduled by laptop_io_completion.
* Nothing needs to be written back anymore, so we unschedule the writeback.
*/
void laptop_sync_completion(void)
{
del_timer(&laptop_mode_wb_timer);
}
/*
......@@ -429,12 +487,7 @@ void __init page_writeback_init(void)
vm_dirty_ratio *= correction;
vm_dirty_ratio /= 100;
}
init_timer(&wb_timer);
wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100;
wb_timer.data = 0;
wb_timer.function = wb_timer_fn;
add_timer(&wb_timer);
mod_timer(&wb_timer, jiffies + (dirty_writeback_centisecs * HZ) / 100);
set_ratelimit();
register_cpu_notifier(&ratelimit_nb);
}
......
......@@ -246,7 +246,8 @@ static void handle_write_error(struct address_space *mapping,
* shrink_list returns the number of reclaimed pages
*/
static int
shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
shrink_list(struct list_head *page_list, unsigned int gfp_mask,
int *nr_scanned, int do_writepage)
{
struct address_space *mapping;
LIST_HEAD(ret_pages);
......@@ -354,6 +355,8 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
goto keep_locked;
if (!may_write_to_queue(mapping->backing_dev_info))
goto keep_locked;
if (laptop_mode && !do_writepage)
goto keep_locked;
if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
......@@ -473,7 +476,7 @@ shrink_list(struct list_head *page_list, unsigned int gfp_mask, int *nr_scanned)
*/
static int
shrink_cache(struct zone *zone, unsigned int gfp_mask,
int max_scan, int *total_scanned)
int max_scan, int *total_scanned, int do_writepage)
{
LIST_HEAD(page_list);
struct pagevec pvec;
......@@ -521,7 +524,8 @@ shrink_cache(struct zone *zone, unsigned int gfp_mask,
mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
else
mod_page_state_zone(zone, pgscan_direct, nr_scan);
nr_freed = shrink_list(&page_list, gfp_mask, total_scanned);
nr_freed = shrink_list(&page_list, gfp_mask,
total_scanned, do_writepage);
*total_scanned += nr_taken;
if (current_is_kswapd())
mod_page_state(kswapd_steal, nr_freed);
......@@ -735,7 +739,7 @@ refill_inactive_zone(struct zone *zone, const int nr_pages_in,
*/
static int
shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
int *total_scanned, struct page_state *ps)
int *total_scanned, struct page_state *ps, int do_writepage)
{
unsigned long ratio;
int count;
......@@ -764,7 +768,8 @@ shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
count = atomic_read(&zone->nr_scan_inactive);
if (count >= SWAP_CLUSTER_MAX) {
atomic_set(&zone->nr_scan_inactive, 0);
return shrink_cache(zone, gfp_mask, count, total_scanned);
return shrink_cache(zone, gfp_mask, count,
total_scanned, do_writepage);
}
return 0;
}
......@@ -787,7 +792,7 @@ shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
*/
static int
shrink_caches(struct zone **zones, int priority, int *total_scanned,
int gfp_mask, struct page_state *ps)
int gfp_mask, struct page_state *ps, int do_writepage)
{
int ret = 0;
int i;
......@@ -803,7 +808,8 @@ shrink_caches(struct zone **zones, int priority, int *total_scanned,
continue; /* Let kswapd poll it */
max_scan = zone->nr_inactive >> priority;
ret += shrink_zone(zone, max_scan, gfp_mask, total_scanned, ps);
ret += shrink_zone(zone, max_scan, gfp_mask,
total_scanned, ps, do_writepage);
}
return ret;
}
......@@ -833,6 +839,8 @@ int try_to_free_pages(struct zone **zones,
int nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
int i;
unsigned long total_scanned = 0;
int do_writepage = 0;
inc_page_state(allocstall);
......@@ -840,13 +848,13 @@ int try_to_free_pages(struct zone **zones,
zones[i]->temp_priority = DEF_PRIORITY;
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int total_scanned = 0;
int scanned = 0;
struct page_state ps;
get_page_state(&ps);
nr_reclaimed += shrink_caches(zones, priority, &total_scanned,
gfp_mask, &ps);
shrink_slab(total_scanned, gfp_mask);
nr_reclaimed += shrink_caches(zones, priority, &scanned,
gfp_mask, &ps, do_writepage);
shrink_slab(scanned, gfp_mask);
if (reclaim_state) {
nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
......@@ -858,14 +866,20 @@ int try_to_free_pages(struct zone **zones,
if (!(gfp_mask & __GFP_FS))
break; /* Let the caller handle it */
/*
* Try to write back as many pages as we just scanned. Not
* sure if that makes sense, but it's an attempt to avoid
* creating IO storms unnecessarily
* Try to write back as many pages as we just scanned. This
* tends to cause slow streaming writers to write data to the
* disk smoothly, at the dirtying rate, which is nice. But
* that's undesirable in laptop mode, where we *want* lumpy
* writeout. So in laptop mode, write out the whole world.
*/
wakeup_bdflush(total_scanned);
total_scanned += scanned;
if (total_scanned > SWAP_CLUSTER_MAX + SWAP_CLUSTER_MAX/2) {
wakeup_bdflush(laptop_mode ? 0 : total_scanned);
do_writepage = 1;
}
/* Take a nap, wait for some writeback to complete */
if (total_scanned && priority < DEF_PRIORITY - 2)
if (scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
}
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
......@@ -908,6 +922,8 @@ static int balance_pgdat(pg_data_t *pgdat, int nr_pages, struct page_state *ps)
int i;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long total_scanned = 0;
unsigned long total_reclaimed = 0;
int do_writepage = 0;
inc_page_state(pageoutrun);
......@@ -969,16 +985,25 @@ static int balance_pgdat(pg_data_t *pgdat, int nr_pages, struct page_state *ps)
zone->temp_priority = priority;
max_scan = zone->nr_inactive >> priority;
reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,
&scanned, ps);
&scanned, ps, do_writepage);
total_scanned += scanned;
reclaim_state->reclaimed_slab = 0;
shrink_slab(scanned, GFP_KERNEL);
reclaimed += reclaim_state->reclaimed_slab;
total_reclaimed += reclaimed;
to_free -= reclaimed;
if (zone->all_unreclaimable)
continue;
if (zone->pages_scanned > zone->present_pages * 2)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and
* the reclaim ratio is low, start doing writepage
* even in laptop mode
*/
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > total_reclaimed+total_reclaimed/2)
do_writepage = 1;
}
if (nr_pages && to_free > 0)
continue; /* swsusp: need to do more work */
......@@ -997,7 +1022,7 @@ static int balance_pgdat(pg_data_t *pgdat, int nr_pages, struct page_state *ps)
zone->prev_priority = zone->temp_priority;
}
return nr_pages - to_free;
return total_reclaimed;
}
/*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment