Commit 0d85f8bf authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] direct IO updates

This patch is a performance and correctness update to the direct-IO
code: O_DIRECT and the raw driver.  It mainly affects IO against
blockdevs.

The direct_io code was returning -EINVAL for a filesystem hole.  Change
it to clear the userspace page instead.

There were a few restrictions and weirdnesses wrt blocksize and
alignments.  The code has been reworked so we now lay out maximum-sized
BIOs at any sector alignment.

Because of this, the raw driver has been altered to set the blockdev's
soft blocksize to the minimum possible at open() time.  Typically, 512
bytes.  There are now no performance disadvantages to using small
blocksizes, and this gives the finest possible alignment.

There is no API here for setting or querying the soft blocksize of the
raw driver (there never was, really), which could conceivably be a
problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
fd which /dev/raw/rawN returned, but that would require that
blk_ioctl() be exported to modules again.

This code is wickedly quick.  Here's an oprofile of a single 500MHz
PIII reading from four (old) scsi disks (two aic7xxx controllers) via
the raw driver.  Aggregate throughput is 72 megabytes/second:

c013363c 24       0.0896492   __set_page_dirty_buffers
c021b8cc 24       0.0896492   ahc_linux_isr
c012b5dc 25       0.0933846   kmem_cache_free
c014d894 26       0.09712     dio_bio_complete
c01cc78c 26       0.09712     number
c0123bd4 40       0.149415    follow_page
c01eed8c 46       0.171828    end_that_request_first
c01ed410 49       0.183034    blk_recount_segments
c01ed574 65       0.2428      blk_rq_map_sg
c014db38 85       0.317508    do_direct_IO
c021b090 90       0.336185    ahc_linux_run_device_queue
c010bb78 236      0.881551    timer_interrupt
c01052d8 25354    94.707      poll_idle

A testament to the efficiency of the 2.5 block layer.

And against four IDE disks on an HPT374 controller.  Throughput is 120
megabytes/sec:

c01eed8c 80       0.292462    end_that_request_first
c01fe850 87       0.318052    hpt3xx_intrproc
c01ed574 123      0.44966     blk_rq_map_sg
c01f8f10 141      0.515464    ata_select
c014db38 153      0.559333    do_direct_IO
c010bb78 235      0.859107    timer_interrupt
c01f9144 281      1.02727     ata_irq_enable
c01ff990 290      1.06017     udma_pci_init
c01fe878 308      1.12598     hpt3xx_maskproc
c02006f8 379      1.38554     idedisk_do_request
c02356a0 609      2.22637     pci_conf1_read
c01ff8dc 611      2.23368     udma_pci_start
c01ff950 922      3.37062     udma_pci_irq_status
c01f8fac 1002     3.66308     ata_status
c01ff26c 1059     3.87146     ata_start_dma
c01feb70 1141     4.17124     hpt374_udma_stop
c01f9228 3072     11.2305     ata_out_regfile
c01052d8 15193    55.5422     poll_idle

Not so good.

One problem which has been identified with O_DIRECT is the cost of
repeated calls into the mapping's get_block() callback.  Not a big
problem with ext2 but other filesystems have more complex get_block
implementations.

So what I have done is to require that callers of generic_direct_IO()
implement the new `get_blocks()' interface.  This is a small extension
to get_block().  It gets passed another argument which indicates the
maximum number of blocks which should be mapped, and it returns the
number of blocks which it did map in bh_result->b_size.  This allows
the fs to map up to 4G of disk (or of hole) in a single get_block()
invokation.

There are some other caveats and requirements of get_blocks() which are
documented in the comment block over fs/direct_io.c:get_more_blocks().

Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
mapping.  It certainly allows good speedups.  But it doesn't allow the
fs to return a scatter list of blocks - it only understands linear
chunks of disk.  I think that's really all it _should_ do.

I'll let get_blocks() sit for a while and wait for some feedback.  If
it is sufficient and nobody objects too much, I shall convert all
get_block() instances in the kernel to be get_blocks() instances.  And
I'll teach readahead (at least) to use the get_blocks() extension.

Delayed allocate writeback could use get_blocks().  As could
block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
mileage using it in mpage_writepages() because all our filesystems are
syncalloc, and nobody uses MAP_SHARED for much.

It will be tricky to use get_blocks() for writes, because if a ton of
blocks have been mapped into the file and then something goes wrong,
the kernel needs to either remove those blocks from the file or zero
them out.  The direct_io code zeroes them out.

btw, some time ago you mentioned that some drivers and/or hardware may
get upset if there are multiple simultaneous IOs in progress against
the same block.  Well, the raw driver has always allowed that to
happen.  O_DIRECT writes to blockdevs do as well now.

todo:

1) The driver will probably explode if someone runs BLKBSZSET while
   IO is in progress.  Need to use bdclaim() somewhere.

2) readv() and writev() need to become direct_io-aware.  At present
   we're doing stop-and-wait for each segment when performing
   readv/writev against the raw driver and O_DIRECT blockdevs.
parent 62b52f5c
......@@ -17,11 +17,9 @@
#include <linux/smp_lock.h>
#include <asm/uaccess.h>
#define dprintk(x...)
typedef struct raw_device_data_s {
struct block_device *binding;
int inuse, sector_size, sector_bits;
int inuse;
struct semaphore mutex;
} raw_device_data_t;
......@@ -65,15 +63,15 @@ __initcall(raw_init);
/*
* Open/close code for raw IO.
*
* Set the device's soft blocksize to the minimum possible. This gives the
* finest possible alignment and has no adverse impact on performance.
*/
int raw_open(struct inode *inode, struct file *filp)
{
int minor;
struct block_device * bdev;
int err;
int sector_size;
int sector_bits;
minor = minor(inode->i_rdev);
......@@ -87,12 +85,11 @@ int raw_open(struct inode *inode, struct file *filp)
}
down(&raw_devices[minor].mutex);
/*
* No, it is a normal raw device. All we need to do on open is
* to check that the device is bound, and force the underlying
* block device to a sector-size blocksize.
* to check that the device is bound.
*/
bdev = raw_devices[minor].binding;
err = -ENODEV;
if (!bdev)
......@@ -100,23 +97,19 @@ int raw_open(struct inode *inode, struct file *filp)
atomic_inc(&bdev->bd_count);
err = blkdev_get(bdev, filp->f_mode, 0, BDEV_RAW);
if (err)
goto out;
if (!err) {
int minsize = bdev_hardsect_size(bdev);
/*
* Don't change the blocksize if we already have users using
* this device
*/
if (raw_devices[minor].inuse++)
goto out;
sector_size = bdev_hardsect_size(bdev);
raw_devices[minor].sector_size = sector_size;
for (sector_bits = 0; !(sector_size & 1); )
sector_size>>=1, sector_bits++;
raw_devices[minor].sector_bits = sector_bits;
if (bdev) {
int ret;
ret = set_blocksize(bdev, minsize);
if (ret)
printk("%s: set_blocksize() failed: %d\n",
__FUNCTION__, ret);
}
raw_devices[minor].inuse++;
}
out:
up(&raw_devices[minor].mutex);
......@@ -137,24 +130,27 @@ int raw_release(struct inode *inode, struct file *filp)
return 0;
}
/* Forward ioctls to the underlying block device. */
int raw_ioctl(struct inode *inode,
struct file *flip,
struct file *filp,
unsigned int command,
unsigned long arg)
{
int minor = minor(inode->i_rdev), err;
int minor = minor(inode->i_rdev);
int err;
struct block_device *b;
err = -ENODEV;
if (minor < 1 && minor > 255)
return -ENODEV;
goto out;
b = raw_devices[minor].binding;
err = -EINVAL;
if (b && b->bd_inode && b->bd_op && b->bd_op->ioctl) {
if (b == NULL)
goto out;
if (b->bd_inode && b->bd_op && b->bd_op->ioctl)
err = b->bd_op->ioctl(b->bd_inode, NULL, command, arg);
}
out:
return err;
}
......@@ -164,12 +160,12 @@ int raw_ioctl(struct inode *inode,
*/
int raw_ctl_ioctl(struct inode *inode,
struct file *flip,
struct file *filp,
unsigned int command,
unsigned long arg)
{
struct raw_config_request rq;
int err = 0;
int err;
int minor;
switch (command) {
......@@ -178,26 +174,23 @@ int raw_ctl_ioctl(struct inode *inode,
/* First, find out which raw minor we want */
if (copy_from_user(&rq, (void *) arg, sizeof(rq))) {
err = -EFAULT;
break;
}
if (copy_from_user(&rq, (void *) arg, sizeof(rq)))
goto out;
minor = rq.raw_minor;
if (minor <= 0 || minor > MINORMASK) {
err = -EINVAL;
break;
}
if (minor <= 0 || minor > MINORMASK)
goto out;
if (command == RAW_SETBIND) {
/*
* This is like making block devices, so demand the
* same capability
*/
if (!capable(CAP_SYS_ADMIN)) {
err = -EPERM;
break;
}
if (!capable(CAP_SYS_ADMIN))
goto out;
/*
* For now, we don't need to check that the underlying
......@@ -206,24 +199,23 @@ int raw_ctl_ioctl(struct inode *inode,
* major/minor numbers make sense.
*/
if ((rq.block_major == 0 &&
rq.block_minor != 0) ||
rq.block_major > MAX_BLKDEV ||
rq.block_minor > MINORMASK) {
err = -EINVAL;
break;
}
if ((rq.block_major == 0 && rq.block_minor != 0) ||
rq.block_major > MAX_BLKDEV ||
rq.block_minor > MINORMASK)
goto out;
down(&raw_devices[minor].mutex);
err = -EBUSY;
if (raw_devices[minor].inuse) {
up(&raw_devices[minor].mutex);
err = -EBUSY;
break;
goto out;
}
if (raw_devices[minor].binding)
bdput(raw_devices[minor].binding);
raw_devices[minor].binding =
bdget(kdev_t_to_nr(mk_kdev(rq.block_major, rq.block_minor)));
bdget(kdev_t_to_nr(mk_kdev(rq.block_major,
rq.block_minor)));
up(&raw_devices[minor].mutex);
} else {
struct block_device *bdev;
......@@ -237,16 +229,18 @@ int raw_ctl_ioctl(struct inode *inode,
} else {
rq.block_major = rq.block_minor = 0;
}
err = copy_to_user((void *) arg, &rq, sizeof(rq));
if (err)
err = -EFAULT;
if (copy_to_user((void *) arg, &rq, sizeof(rq)))
goto out;
}
err = 0;
break;
default:
err = -EINVAL;
break;
}
out:
return err;
}
......@@ -257,7 +251,7 @@ ssize_t raw_read(struct file *filp, char * buf, size_t size, loff_t *offp)
ssize_t raw_write(struct file *filp, const char *buf, size_t size, loff_t *offp)
{
return rw_raw_dev(WRITE, filp, (char *) buf, size, offp);
return rw_raw_dev(WRITE, filp, (char *)buf, size, offp);
}
ssize_t
......
......@@ -24,14 +24,14 @@
#include <asm/uaccess.h>
static unsigned long max_block(struct block_device *bdev)
static sector_t max_block(struct block_device *bdev)
{
unsigned int retval = ~0U;
sector_t retval = ~0U;
loff_t sz = bdev->bd_inode->i_size;
if (sz) {
unsigned int size = block_size(bdev);
unsigned int sizebits = blksize_bits(size);
sector_t size = block_size(bdev);
unsigned sizebits = blksize_bits(size);
retval = (sz >> sizebits);
}
return retval;
......@@ -88,7 +88,9 @@ int sb_min_blocksize(struct super_block *sb, int size)
return sb_set_blocksize(sb, size);
}
static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer_head * bh, int create)
static int
blkdev_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
if (iblock >= max_block(inode->i_bdev))
return -EIO;
......@@ -99,12 +101,26 @@ static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer
return 0;
}
static int
blkdev_get_blocks(struct inode *inode, sector_t iblock,
unsigned long max_blocks, struct buffer_head *bh, int create)
{
if ((iblock + max_blocks) >= max_block(inode->i_bdev))
return -EIO;
bh->b_bdev = inode->i_bdev;
bh->b_blocknr = iblock;
bh->b_size = max_blocks << inode->i_blkbits;
set_buffer_mapped(bh);
return 0;
}
static int
blkdev_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count)
{
return generic_direct_IO(rw, inode, buf, offset,
count, blkdev_get_block);
count, blkdev_get_blocks);
}
static int blkdev_writepage(struct page * page)
......
This diff is collapsed.
......@@ -606,11 +606,24 @@ static int ext2_bmap(struct address_space *mapping, long block)
return generic_block_bmap(mapping,block,ext2_get_block);
}
static int
ext2_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
struct buffer_head *bh_result, int create)
{
int ret;
ret = ext2_get_block(inode, iblock, bh_result, create);
if (ret == 0)
bh_result->b_size = (1 << inode->i_blkbits);
return ret;
}
static int
ext2_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count)
{
return generic_direct_IO(rw, inode, buf, offset, count, ext2_get_block);
return generic_direct_IO(rw, inode, buf,
offset, count, ext2_get_blocks);
}
static int
......
......@@ -293,10 +293,23 @@ static int jfs_bmap(struct address_space *mapping, long block)
return generic_block_bmap(mapping, block, jfs_get_block);
}
static int
jfs_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
struct buffer_head *bh_result, int create)
{
int ret;
ret = jfs_get_block(inode, iblock, bh_result, create);
if (ret == 0)
bh_result->b_size = (1 << inode->i_blkbits);
return ret;
}
static int jfs_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count)
{
return generic_direct_IO(rw, inode, buf, offset, count, jfs_get_block);
return generic_direct_IO(rw, inode, buf,
offset, count, jfs_get_blocks);
}
struct address_space_operations jfs_aops = {
......
......@@ -211,7 +211,11 @@ extern void mnt_init(unsigned long);
extern void files_init(unsigned long);
struct buffer_head;
typedef int (get_block_t)(struct inode*,sector_t,struct buffer_head*,int);
typedef int (get_block_t)(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
typedef int (get_blocks_t)(struct inode *inode, sector_t iblock,
unsigned long max_blocks,
struct buffer_head *bh_result, int create);
#include <linux/pipe_fs_i.h>
/* #include <linux/umsdos_fs_i.h> */
......@@ -1238,7 +1242,7 @@ extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, r
ssize_t generic_file_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count);
int generic_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count, get_block_t *get_block);
loff_t offset, size_t count, get_blocks_t *get_blocks);
extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment