Commit c3cb5e19 authored by Linus Torvalds's avatar Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (48 commits)
  dm mpath: change to be request based
  dm: disable interrupt when taking map_lock
  dm: do not set QUEUE_ORDERED_DRAIN if request based
  dm: enable request based option
  dm: prepare for request based option
  dm raid1: add userspace log
  dm: calculate queue limits during resume not load
  dm log: fix create_log_context to use logical_block_size of log device
  dm target:s introduce iterate devices fn
  dm table: establish queue limits by copying table limits
  dm table: replace struct io_restrictions with struct queue_limits
  dm table: validate device logical_block_size
  dm table: ensure targets are aligned to logical_block_size
  dm ioctl: support cookies for udev
  dm: sysfs add suspended attribute
  dm table: improve warning message when devices not freed before destruction
  dm mpath: add service time load balancer
  dm mpath: add queue length load balancer
  dm mpath: add start_io and nr_bytes to path selectors
  dm snapshot: use barrier when writing exception store
  ...
parents ea94b503 f40c67f0
Device-Mapper Logging
=====================
The device-mapper logging code is used by some of the device-mapper
RAID targets to track regions of the disk that are not consistent.
A region (or portion of the address space) of the disk may be
inconsistent because a RAID stripe is currently being operated on or
a machine died while the region was being altered. In the case of
mirrors, a region would be considered dirty/inconsistent while you
are writing to it because the writes need to be replicated for all
the legs of the mirror and may not reach the legs at the same time.
Once all writes are complete, the region is considered clean again.
There is a generic logging interface that the device-mapper RAID
implementations use to perform logging operations (see
dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
logging implementations are available and provide different
capabilities. The list includes:
Type Files
==== =====
disk drivers/md/dm-log.c
core drivers/md/dm-log.c
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
The "disk" log type
-------------------
This log implementation commits the log state to disk. This way, the
logging state survives reboots/crashes.
The "core" log type
-------------------
This log implementation keeps the log state in memory. The log state
will not survive a reboot or crash, but there may be a small boost in
performance. This method can also be used if no storage device is
available for storing log state.
The "userspace" log type
------------------------
This log type simply provides a way to export the log API to userspace,
so log implementations can be done there. This is done by forwarding most
logging requests to userspace, where a daemon receives and processes the
request.
The structure used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h. Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' is used as the interface for
communication.
There are currently two userspace log implementations that leverage this
framework - "clustered_disk" and "clustered_core". These implementations
provide a cluster-coherent log for shared-storage. Device-mapper mirroring
can be used in a shared-storage environment when the cluster log implementations
are employed.
dm-queue-length
===============
dm-queue-length is a path selector module for device-mapper targets,
which selects a path with the least number of in-flight I/Os.
The path selector name is 'queue-length'.
Table parameters for each path: [<repeat_count>]
<repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.
Status for each path: <status> <fail-count> <in-flight>
<status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures.
<in-flight>: The number of in-flight I/Os on the path.
Algorithm
=========
dm-queue-length increments/decrements 'in-flight' when an I/O is
dispatched/completed respectively.
dm-queue-length selects a path with the minimum 'in-flight'.
Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
dm-service-time
===============
dm-service-time is a path selector module for device-mapper targets,
which selects a path with the shortest estimated service time for
the incoming I/O.
The service time for each path is estimated by dividing the total size
of in-flight I/Os on a path with the performance value of the path.
The performance value is a relative throughput value among all paths
in a path-group, and it can be specified as a table argument.
The path selector name is 'service-time'.
Table parameters for each path: [<repeat_count> [<relative_throughput>]]
<repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.
<relative_throughput>: The relative throughput value of the path
among all paths in the path-group.
The valid range is 0-100.
If not given, minimum value '1' is used.
If '0' is given, the path isn't selected while
other paths having a positive value are available.
Status for each path: <status> <fail-count> <in-flight-size> \
<relative_throughput>
<status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures.
<in-flight-size>: The size of in-flight I/Os on the path.
<relative_throughput>: The relative throughput value of the path
among all paths in the path-group.
Algorithm
=========
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
dispatched and substracts when completed.
Basically, dm-service-time selects a path having minimum service time
which is calculated by:
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
However, some optimizations below are used to reduce the calculation
as much as possible.
1. If the paths have the same 'relative_throughput', skip
the division and just compare the 'in-flight-size'.
2. If the paths have the same 'in-flight-size', skip the division
and just compare the 'relative_throughput'.
3. If some paths have non-zero 'relative_throughput' and others
have zero 'relative_throughput', ignore those paths with zero
'relative_throughput'.
If such optimizations can't be applied, calculate service time, and
compare service time.
If calculated service time is equal, the path having maximum
'relative_throughput' may be better. So compare 'relative_throughput'
then.
Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128
and sda has an average throughput 1GB/s and sdb has 4GB/s,
'relative_throughput' value may be '1' for sda and '4' for sdb.
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
Or '2' for sda and '8' for sdb would be also true.
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
...@@ -231,6 +231,17 @@ config DM_MIRROR ...@@ -231,6 +231,17 @@ config DM_MIRROR
Allow volume managers to mirror logical volumes, also Allow volume managers to mirror logical volumes, also
needed for live data migration tools such as 'pvmove'. needed for live data migration tools such as 'pvmove'.
config DM_LOG_USERSPACE
tristate "Mirror userspace logging (EXPERIMENTAL)"
depends on DM_MIRROR && EXPERIMENTAL && NET
select CONNECTOR
---help---
The userspace logging module provides a mechanism for
relaying the dm-dirty-log API to userspace. Log designs
which are more suited to userspace implementation (e.g.
shared storage logs) or experimental logs can be implemented
by leveraging this framework.
config DM_ZERO config DM_ZERO
tristate "Zero target" tristate "Zero target"
depends on BLK_DEV_DM depends on BLK_DEV_DM
...@@ -249,6 +260,25 @@ config DM_MULTIPATH ...@@ -249,6 +260,25 @@ config DM_MULTIPATH
---help--- ---help---
Allow volume managers to support multipath hardware. Allow volume managers to support multipath hardware.
config DM_MULTIPATH_QL
tristate "I/O Path Selector based on the number of in-flight I/Os"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path with the least number of in-flight I/Os.
If unsure, say N.
config DM_MULTIPATH_ST
tristate "I/O Path Selector based on the service time"
depends on DM_MULTIPATH
---help---
This path selector is a dynamic load balancer which selects
the path expected to complete the incoming I/O in the shortest
time.
If unsure, say N.
config DM_DELAY config DM_DELAY
tristate "I/O delaying target (EXPERIMENTAL)" tristate "I/O delaying target (EXPERIMENTAL)"
depends on BLK_DEV_DM && EXPERIMENTAL depends on BLK_DEV_DM && EXPERIMENTAL
......
...@@ -8,6 +8,8 @@ dm-multipath-y += dm-path-selector.o dm-mpath.o ...@@ -8,6 +8,8 @@ dm-multipath-y += dm-path-selector.o dm-mpath.o
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \ dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o dm-snap-persistent.o
dm-mirror-y += dm-raid1.o dm-mirror-y += dm-raid1.o
dm-log-userspace-y \
+= dm-log-userspace-base.o dm-log-userspace-transfer.o
md-mod-y += md.o bitmap.o md-mod-y += md.o bitmap.o
raid456-y += raid5.o raid456-y += raid5.o
raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \ raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \
...@@ -36,8 +38,11 @@ obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o ...@@ -36,8 +38,11 @@ obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o
obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o obj-$(CONFIG_DM_ZERO) += dm-zero.o
quiet_cmd_unroll = UNROLL $@ quiet_cmd_unroll = UNROLL $@
......
...@@ -1132,6 +1132,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1132,6 +1132,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad_crypt_queue; goto bad_crypt_queue;
} }
ti->num_flush_requests = 1;
ti->private = cc; ti->private = cc;
return 0; return 0;
...@@ -1189,6 +1190,13 @@ static int crypt_map(struct dm_target *ti, struct bio *bio, ...@@ -1189,6 +1190,13 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context) union map_info *map_context)
{ {
struct dm_crypt_io *io; struct dm_crypt_io *io;
struct crypt_config *cc;
if (unlikely(bio_empty_barrier(bio))) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
}
io = crypt_io_alloc(ti, bio, bio->bi_sector - ti->begin); io = crypt_io_alloc(ti, bio, bio->bi_sector - ti->begin);
...@@ -1305,9 +1313,17 @@ static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm, ...@@ -1305,9 +1313,17 @@ static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
} }
static int crypt_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct crypt_config *cc = ti->private;
return fn(ti, cc->dev, cc->start, data);
}
static struct target_type crypt_target = { static struct target_type crypt_target = {
.name = "crypt", .name = "crypt",
.version= {1, 6, 0}, .version = {1, 7, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = crypt_ctr, .ctr = crypt_ctr,
.dtr = crypt_dtr, .dtr = crypt_dtr,
...@@ -1318,6 +1334,7 @@ static struct target_type crypt_target = { ...@@ -1318,6 +1334,7 @@ static struct target_type crypt_target = {
.resume = crypt_resume, .resume = crypt_resume,
.message = crypt_message, .message = crypt_message,
.merge = crypt_merge, .merge = crypt_merge,
.iterate_devices = crypt_iterate_devices,
}; };
static int __init dm_crypt_init(void) static int __init dm_crypt_init(void)
......
...@@ -197,6 +197,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -197,6 +197,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
mutex_init(&dc->timer_lock); mutex_init(&dc->timer_lock);
atomic_set(&dc->may_delay, 1); atomic_set(&dc->may_delay, 1);
ti->num_flush_requests = 1;
ti->private = dc; ti->private = dc;
return 0; return 0;
...@@ -278,6 +279,7 @@ static int delay_map(struct dm_target *ti, struct bio *bio, ...@@ -278,6 +279,7 @@ static int delay_map(struct dm_target *ti, struct bio *bio,
if ((bio_data_dir(bio) == WRITE) && (dc->dev_write)) { if ((bio_data_dir(bio) == WRITE) && (dc->dev_write)) {
bio->bi_bdev = dc->dev_write->bdev; bio->bi_bdev = dc->dev_write->bdev;
if (bio_sectors(bio))
bio->bi_sector = dc->start_write + bio->bi_sector = dc->start_write +
(bio->bi_sector - ti->begin); (bio->bi_sector - ti->begin);
...@@ -316,9 +318,26 @@ static int delay_status(struct dm_target *ti, status_type_t type, ...@@ -316,9 +318,26 @@ static int delay_status(struct dm_target *ti, status_type_t type,
return 0; return 0;
} }
static int delay_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct delay_c *dc = ti->private;
int ret = 0;
ret = fn(ti, dc->dev_read, dc->start_read, data);
if (ret)
goto out;
if (dc->dev_write)
ret = fn(ti, dc->dev_write, dc->start_write, data);
out:
return ret;
}
static struct target_type delay_target = { static struct target_type delay_target = {
.name = "delay", .name = "delay",
.version = {1, 0, 2}, .version = {1, 1, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = delay_ctr, .ctr = delay_ctr,
.dtr = delay_dtr, .dtr = delay_dtr,
...@@ -326,6 +345,7 @@ static struct target_type delay_target = { ...@@ -326,6 +345,7 @@ static struct target_type delay_target = {
.presuspend = delay_presuspend, .presuspend = delay_presuspend,
.resume = delay_resume, .resume = delay_resume,
.status = delay_status, .status = delay_status,
.iterate_devices = delay_iterate_devices,
}; };
static int __init dm_delay_init(void) static int __init dm_delay_init(void)
......
...@@ -216,7 +216,7 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv, ...@@ -216,7 +216,7 @@ int dm_exception_store_create(struct dm_target *ti, int argc, char **argv,
return -EINVAL; return -EINVAL;
} }
type = get_type(argv[1]); type = get_type(&persistent);
if (!type) { if (!type) {
ti->error = "Exception store type not recognised"; ti->error = "Exception store type not recognised";
r = -EINVAL; r = -EINVAL;
......
...@@ -156,7 +156,7 @@ static inline void dm_consecutive_chunk_count_inc(struct dm_snap_exception *e) ...@@ -156,7 +156,7 @@ static inline void dm_consecutive_chunk_count_inc(struct dm_snap_exception *e)
*/ */
static inline sector_t get_dev_size(struct block_device *bdev) static inline sector_t get_dev_size(struct block_device *bdev)
{ {
return bdev->bd_inode->i_size >> SECTOR_SHIFT; return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
} }
static inline chunk_t sector_to_chunk(struct dm_exception_store *store, static inline chunk_t sector_to_chunk(struct dm_exception_store *store,
......
...@@ -22,6 +22,7 @@ struct dm_io_client { ...@@ -22,6 +22,7 @@ struct dm_io_client {
/* FIXME: can we shrink this ? */ /* FIXME: can we shrink this ? */
struct io { struct io {
unsigned long error_bits; unsigned long error_bits;
unsigned long eopnotsupp_bits;
atomic_t count; atomic_t count;
struct task_struct *sleeper; struct task_struct *sleeper;
struct dm_io_client *client; struct dm_io_client *client;
...@@ -107,8 +108,11 @@ static inline unsigned bio_get_region(struct bio *bio) ...@@ -107,8 +108,11 @@ static inline unsigned bio_get_region(struct bio *bio)
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
static void dec_count(struct io *io, unsigned int region, int error) static void dec_count(struct io *io, unsigned int region, int error)
{ {
if (error) if (error) {
set_bit(region, &io->error_bits); set_bit(region, &io->error_bits);
if (error == -EOPNOTSUPP)
set_bit(region, &io->eopnotsupp_bits);
}
if (atomic_dec_and_test(&io->count)) { if (atomic_dec_and_test(&io->count)) {
if (io->sleeper) if (io->sleeper)
...@@ -360,7 +364,9 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions, ...@@ -360,7 +364,9 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
return -EIO; return -EIO;
} }
retry:
io.error_bits = 0; io.error_bits = 0;
io.eopnotsupp_bits = 0;
atomic_set(&io.count, 1); /* see dispatch_io() */ atomic_set(&io.count, 1); /* see dispatch_io() */
io.sleeper = current; io.sleeper = current;
io.client = client; io.client = client;
...@@ -377,6 +383,11 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions, ...@@ -377,6 +383,11 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
} }
set_current_state(TASK_RUNNING); set_current_state(TASK_RUNNING);
if (io.eopnotsupp_bits && (rw & (1 << BIO_RW_BARRIER))) {
rw &= ~(1 << BIO_RW_BARRIER);
goto retry;
}
if (error_bits) if (error_bits)
*error_bits = io.error_bits; *error_bits = io.error_bits;
...@@ -397,6 +408,7 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions, ...@@ -397,6 +408,7 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,
io = mempool_alloc(client->pool, GFP_NOIO); io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0; io->error_bits = 0;
io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */ atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL; io->sleeper = NULL;
io->client = client; io->client = client;
......
...@@ -276,7 +276,7 @@ static void dm_hash_remove_all(int keep_open_devices) ...@@ -276,7 +276,7 @@ static void dm_hash_remove_all(int keep_open_devices)
up_write(&_hash_lock); up_write(&_hash_lock);
} }
static int dm_hash_rename(const char *old, const char *new) static int dm_hash_rename(uint32_t cookie, const char *old, const char *new)
{ {
char *new_name, *old_name; char *new_name, *old_name;
struct hash_cell *hc; struct hash_cell *hc;
...@@ -333,7 +333,7 @@ static int dm_hash_rename(const char *old, const char *new) ...@@ -333,7 +333,7 @@ static int dm_hash_rename(const char *old, const char *new)
dm_table_put(table); dm_table_put(table);
} }
dm_kobject_uevent(hc->md); dm_kobject_uevent(hc->md, KOBJ_CHANGE, cookie);
dm_put(hc->md); dm_put(hc->md);
up_write(&_hash_lock); up_write(&_hash_lock);
...@@ -680,6 +680,9 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size) ...@@ -680,6 +680,9 @@ static int dev_remove(struct dm_ioctl *param, size_t param_size)
__hash_remove(hc); __hash_remove(hc);
up_write(&_hash_lock); up_write(&_hash_lock);
dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr);
dm_put(md); dm_put(md);
param->data_size = 0; param->data_size = 0;
return 0; return 0;
...@@ -715,7 +718,7 @@ static int dev_rename(struct dm_ioctl *param, size_t param_size) ...@@ -715,7 +718,7 @@ static int dev_rename(struct dm_ioctl *param, size_t param_size)
return r; return r;
param->data_size = 0; param->data_size = 0;
return dm_hash_rename(param->name, new_name); return dm_hash_rename(param->event_nr, param->name, new_name);
} }
static int dev_set_geometry(struct dm_ioctl *param, size_t param_size) static int dev_set_geometry(struct dm_ioctl *param, size_t param_size)
...@@ -842,8 +845,11 @@ static int do_resume(struct dm_ioctl *param) ...@@ -842,8 +845,11 @@ static int do_resume(struct dm_ioctl *param)
if (dm_suspended(md)) if (dm_suspended(md))
r = dm_resume(md); r = dm_resume(md);
if (!r)
if (!r) {
dm_kobject_uevent(md, KOBJ_CHANGE, param->event_nr);
r = __dev_status(md, param); r = __dev_status(md, param);
}
dm_put(md); dm_put(md);
return r; return r;
...@@ -1044,6 +1050,12 @@ static int populate_table(struct dm_table *table, ...@@ -1044,6 +1050,12 @@ static int populate_table(struct dm_table *table,
next = spec->next; next = spec->next;
} }
r = dm_table_set_type(table);
if (r) {
DMWARN("unable to set table type");
return r;
}
return dm_table_complete(table); return dm_table_complete(table);
} }
...@@ -1089,6 +1101,13 @@ static int table_load(struct dm_ioctl *param, size_t param_size) ...@@ -1089,6 +1101,13 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
goto out; goto out;
} }
r = dm_table_alloc_md_mempools(t);
if (r) {
DMWARN("unable to allocate mempools for this table");
dm_table_destroy(t);
goto out;
}
down_write(&_hash_lock); down_write(&_hash_lock);
hc = dm_get_mdptr(md); hc = dm_get_mdptr(md);
if (!hc || hc->md != md) { if (!hc || hc->md != md) {
......
...@@ -53,6 +53,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -53,6 +53,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad; goto bad;
} }
ti->num_flush_requests = 1;
ti->private = lc; ti->private = lc;
return 0; return 0;
...@@ -81,6 +82,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio *bio) ...@@ -81,6 +82,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio *bio)
struct linear_c *lc = ti->private; struct linear_c *lc = ti->private;
bio->bi_bdev = lc->dev->bdev; bio->bi_bdev = lc->dev->bdev;
if (bio_sectors(bio))
bio->bi_sector = linear_map_sector(ti, bio->bi_sector); bio->bi_sector = linear_map_sector(ti, bio->bi_sector);
} }
...@@ -132,9 +134,17 @@ static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm, ...@@ -132,9 +134,17 @@ static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
} }
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct linear_c *lc = ti->private;
return fn(ti, lc->dev, lc->start, data);
}
static struct target_type linear_target = { static struct target_type linear_target = {
.name = "linear", .name = "linear",
.version= {1, 0, 3}, .version = {1, 1, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = linear_ctr, .ctr = linear_ctr,
.dtr = linear_dtr, .dtr = linear_dtr,
...@@ -142,6 +152,7 @@ static struct target_type linear_target = { ...@@ -142,6 +152,7 @@ static struct target_type linear_target = {
.status = linear_status, .status = linear_status,
.ioctl = linear_ioctl, .ioctl = linear_ioctl,
.merge = linear_merge, .merge = linear_merge,
.iterate_devices = linear_iterate_devices,
}; };
int __init dm_linear_init(void) int __init dm_linear_init(void)
......
This diff is collapsed.
/*
* Copyright (C) 2006-2009 Red Hat, Inc.
*
* This file is released under the LGPL.
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <net/sock.h>
#include <linux/workqueue.h>
#include <linux/connector.h>
#include <linux/device-mapper.h>
#include <linux/dm-log-userspace.h>
#include "dm-log-userspace-transfer.h"
static uint32_t dm_ulog_seq;
/*
* Netlink/Connector is an unreliable protocol. How long should
* we wait for a response before assuming it was lost and retrying?
* (If we do receive a response after this time, it will be discarded
* and the response to the resent request will be waited for.
*/
#define DM_ULOG_RETRY_TIMEOUT (15 * HZ)
/*
* Pre-allocated space for speed
*/
#define DM_ULOG_PREALLOCED_SIZE 512
static struct cn_msg *prealloced_cn_msg;
static struct dm_ulog_request *prealloced_ulog_tfr;
static struct cb_id ulog_cn_id = {
.idx = CN_IDX_DM,
.val = CN_VAL_DM_USERSPACE_LOG
};
static DEFINE_MUTEX(dm_ulog_lock);
struct receiving_pkg {
struct list_head list;
struct completion complete;
uint32_t seq;
int error;
size_t *data_size;
char *data;
};
static DEFINE_SPINLOCK(receiving_list_lock);
static struct list_head receiving_list;
static int dm_ulog_sendto_server(struct dm_ulog_request *tfr)
{
int r;
struct cn_msg *msg = prealloced_cn_msg;
memset(msg, 0, sizeof(struct cn_msg));
msg->id.idx = ulog_cn_id.idx;
msg->id.val = ulog_cn_id.val;
msg->ack = 0;
msg->seq = tfr->seq;
msg->len = sizeof(struct dm_ulog_request) + tfr->data_size;
r = cn_netlink_send(msg, 0, gfp_any());
return r;
}
/*
* Parameters for this function can be either msg or tfr, but not
* both. This function fills in the reply for a waiting request.
* If just msg is given, then the reply is simply an ACK from userspace
* that the request was received.
*
* Returns: 0 on success, -ENOENT on failure
*/
static int fill_pkg(struct cn_msg *msg, struct dm_ulog_request *tfr)
{
uint32_t rtn_seq = (msg) ? msg->seq : (tfr) ? tfr->seq : 0;
struct receiving_pkg *pkg;
/*
* The 'receiving_pkg' entries in this list are statically
* allocated on the stack in 'dm_consult_userspace'.
* Each process that is waiting for a reply from the user
* space server will have an entry in this list.
*
* We are safe to do it this way because the stack space
* is unique to each process, but still addressable by
* other processes.
*/
list_for_each_entry(pkg, &receiving_list, list) {
if (rtn_seq != pkg->seq)
continue;
if (msg) {
pkg->error = -msg->ack;
/*
* If we are trying again, we will need to know our
* storage capacity. Otherwise, along with the
* error code, we make explicit that we have no data.
*/
if (pkg->error != -EAGAIN)
*(pkg->data_size) = 0;
} else if (tfr->data_size > *(pkg->data_size)) {
DMERR("Insufficient space to receive package [%u] "
"(%u vs %lu)", tfr->request_type,
tfr->data_size, *(pkg->data_size));
*(pkg->data_size) = 0;
pkg->error = -ENOSPC;
} else {
pkg->error = tfr->error;
memcpy(pkg->data, tfr->data, tfr->data_size);
*(pkg->data_size) = tfr->data_size;
}
complete(&pkg->complete);
return 0;
}
return -ENOENT;
}
/*
* This is the connector callback that delivers data
* that was sent from userspace.
*/
static void cn_ulog_callback(void *data)
{
struct cn_msg *msg = (struct cn_msg *)data;
struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
spin_lock(&receiving_list_lock);
if (msg->len == 0)
fill_pkg(msg, NULL);
else if (msg->len < sizeof(*tfr))
DMERR("Incomplete message received (expected %u, got %u): [%u]",
(unsigned)sizeof(*tfr), msg->len, msg->seq);
else
fill_pkg(NULL, tfr);
spin_unlock(&receiving_list_lock);
}
/**
* dm_consult_userspace
* @uuid: log's uuid (must be DM_UUID_LEN in size)
* @request_type: found in include/linux/dm-log-userspace.h
* @data: data to tx to the server
* @data_size: size of data in bytes
* @rdata: place to put return data from server
* @rdata_size: value-result (amount of space given/amount of space used)
*
* rdata_size is undefined on failure.
*
* Memory used to communicate with userspace is zero'ed
* before populating to ensure that no unwanted bits leak
* from kernel space to user-space. All userspace log communications
* between kernel and user space go through this function.
*
* Returns: 0 on success, -EXXX on failure
**/
int dm_consult_userspace(const char *uuid, int request_type,
char *data, size_t data_size,
char *rdata, size_t *rdata_size)
{
int r = 0;
size_t dummy = 0;
int overhead_size =
sizeof(struct dm_ulog_request *) + sizeof(struct cn_msg);
struct dm_ulog_request *tfr = prealloced_ulog_tfr;
struct receiving_pkg pkg;
if (data_size > (DM_ULOG_PREALLOCED_SIZE - overhead_size)) {
DMINFO("Size of tfr exceeds preallocated size");
return -EINVAL;
}
if (!rdata_size)
rdata_size = &dummy;
resend:
/*
* We serialize the sending of requests so we can
* use the preallocated space.
*/
mutex_lock(&dm_ulog_lock);
memset(tfr, 0, DM_ULOG_PREALLOCED_SIZE - overhead_size);
memcpy(tfr->uuid, uuid, DM_UUID_LEN);
tfr->seq = dm_ulog_seq++;
/*
* Must be valid request type (all other bits set to
* zero). This reserves other bits for possible future
* use.
*/
tfr->request_type = request_type & DM_ULOG_REQUEST_MASK;
tfr->data_size = data_size;
if (data && data_size)
memcpy(tfr->data, data, data_size);
memset(&pkg, 0, sizeof(pkg));
init_completion(&pkg.complete);
pkg.seq = tfr->seq;
pkg.data_size = rdata_size;
pkg.data = rdata;
spin_lock(&receiving_list_lock);
list_add(&(pkg.list), &receiving_list);
spin_unlock(&receiving_list_lock);
r = dm_ulog_sendto_server(tfr);
mutex_unlock(&dm_ulog_lock);
if (r) {
DMERR("Unable to send log request [%u] to userspace: %d",
request_type, r);
spin_lock(&receiving_list_lock);
list_del_init(&(pkg.list));
spin_unlock(&receiving_list_lock);
goto out;
}
r = wait_for_completion_timeout(&(pkg.complete), DM_ULOG_RETRY_TIMEOUT);
spin_lock(&receiving_list_lock);
list_del_init(&(pkg.list));
spin_unlock(&receiving_list_lock);
if (!r) {
DMWARN("[%s] Request timed out: [%u/%u] - retrying",
(strlen(uuid) > 8) ?
(uuid + (strlen(uuid) - 8)) : (uuid),
request_type, pkg.seq);
goto resend;
}
r = pkg.error;
if (r == -EAGAIN)
goto resend;
out:
return r;
}
int dm_ulog_tfr_init(void)
{
int r;
void *prealloced;
INIT_LIST_HEAD(&receiving_list);
prealloced = kmalloc(DM_ULOG_PREALLOCED_SIZE, GFP_KERNEL);
if (!prealloced)
return -ENOMEM;
prealloced_cn_msg = prealloced;
prealloced_ulog_tfr = prealloced + sizeof(struct cn_msg);
r = cn_add_callback(&ulog_cn_id, "dmlogusr", cn_ulog_callback);
if (r) {
cn_del_callback(&ulog_cn_id);
return r;
}
return 0;
}
void dm_ulog_tfr_exit(void)
{
cn_del_callback(&ulog_cn_id);
kfree(prealloced_cn_msg);
}
/*
* Copyright (C) 2006-2009 Red Hat, Inc.
*
* This file is released under the LGPL.
*/
#ifndef __DM_LOG_USERSPACE_TRANSFER_H__
#define __DM_LOG_USERSPACE_TRANSFER_H__
#define DM_MSG_PREFIX "dm-log-userspace"
int dm_ulog_tfr_init(void);
void dm_ulog_tfr_exit(void);
int dm_consult_userspace(const char *uuid, int request_type,
char *data, size_t data_size,
char *rdata, size_t *rdata_size);
#endif /* __DM_LOG_USERSPACE_TRANSFER_H__ */
...@@ -412,11 +412,12 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti, ...@@ -412,11 +412,12 @@ static int create_log_context(struct dm_dirty_log *log, struct dm_target *ti,
/* /*
* Buffer holds both header and bitset. * Buffer holds both header and bitset.
*/ */
buf_size = dm_round_up((LOG_OFFSET << SECTOR_SHIFT) + buf_size =
bitset_size, dm_round_up((LOG_OFFSET << SECTOR_SHIFT) + bitset_size,
ti->limits.logical_block_size); bdev_logical_block_size(lc->header_location.
bdev));
if (buf_size > dev->bdev->bd_inode->i_size) { if (buf_size > i_size_read(dev->bdev->bd_inode)) {
DMWARN("log device %s too small: need %llu bytes", DMWARN("log device %s too small: need %llu bytes",
dev->name, (unsigned long long)buf_size); dev->name, (unsigned long long)buf_size);
kfree(lc); kfree(lc);
......
This diff is collapsed.
...@@ -56,7 +56,8 @@ struct path_selector_type { ...@@ -56,7 +56,8 @@ struct path_selector_type {
* the path fails. * the path fails.
*/ */
struct dm_path *(*select_path) (struct path_selector *ps, struct dm_path *(*select_path) (struct path_selector *ps,
unsigned *repeat_count); unsigned *repeat_count,
size_t nr_bytes);
/* /*
* Notify the selector that a path has failed. * Notify the selector that a path has failed.
...@@ -75,7 +76,10 @@ struct path_selector_type { ...@@ -75,7 +76,10 @@ struct path_selector_type {
int (*status) (struct path_selector *ps, struct dm_path *path, int (*status) (struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned int maxlen); status_type_t type, char *result, unsigned int maxlen);
int (*end_io) (struct path_selector *ps, struct dm_path *path); int (*start_io) (struct path_selector *ps, struct dm_path *path,
size_t nr_bytes);
int (*end_io) (struct path_selector *ps, struct dm_path *path,
size_t nr_bytes);
}; };
/* Register a path selector */ /* Register a path selector */
......
/*
* Copyright (C) 2004-2005 IBM Corp. All Rights Reserved.
* Copyright (C) 2006-2009 NEC Corporation.
*
* dm-queue-length.c
*
* Module Author: Stefan Bader, IBM
* Modified by: Kiyoshi Ueda, NEC
*
* This file is released under the GPL.
*
* queue-length path selector - choose a path with the least number of
* in-flight I/Os.
*/
#include "dm.h"
#include "dm-path-selector.h"
#include <linux/slab.h>
#include <linux/ctype.h>
#include <linux/errno.h>
#include <linux/module.h>
#include <asm/atomic.h>
#define DM_MSG_PREFIX "multipath queue-length"
#define QL_MIN_IO 128
#define QL_VERSION "0.1.0"
struct selector {
struct list_head valid_paths;
struct list_head failed_paths;
};
struct path_info {
struct list_head list;
struct dm_path *path;
unsigned repeat_count;
atomic_t qlen; /* the number of in-flight I/Os */
};
static struct selector *alloc_selector(void)
{
struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
if (s) {
INIT_LIST_HEAD(&s->valid_paths);
INIT_LIST_HEAD(&s->failed_paths);
}
return s;
}
static int ql_create(struct path_selector *ps, unsigned argc, char **argv)
{
struct selector *s = alloc_selector();
if (!s)
return -ENOMEM;
ps->context = s;
return 0;
}
static void ql_free_paths(struct list_head *paths)
{
struct path_info *pi, *next;
list_for_each_entry_safe(pi, next, paths, list) {
list_del(&pi->list);
kfree(pi);
}
}
static void ql_destroy(struct path_selector *ps)
{
struct selector *s = ps->context;
ql_free_paths(&s->valid_paths);
ql_free_paths(&s->failed_paths);
kfree(s);
ps->context = NULL;
}
static int ql_status(struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct path_info *pi;
/* When called with NULL path, return selector status/args. */
if (!path)
DMEMIT("0 ");
else {
pi = path->pscontext;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%d ", atomic_read(&pi->qlen));
break;
case STATUSTYPE_TABLE:
DMEMIT("%u ", pi->repeat_count);
break;
}
}
return sz;
}
static int ql_add_path(struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error)
{
struct selector *s = ps->context;
struct path_info *pi;
unsigned repeat_count = QL_MIN_IO;
/*
* Arguments: [<repeat_count>]
* <repeat_count>: The number of I/Os before switching path.
* If not given, default (QL_MIN_IO) is used.
*/
if (argc > 1) {
*error = "queue-length ps: incorrect number of arguments";
return -EINVAL;
}
if ((argc == 1) && (sscanf(argv[0], "%u", &repeat_count) != 1)) {
*error = "queue-length ps: invalid repeat count";
return -EINVAL;
}
/* Allocate the path information structure */
pi = kmalloc(sizeof(*pi), GFP_KERNEL);
if (!pi) {
*error = "queue-length ps: Error allocating path information";
return -ENOMEM;
}
pi->path = path;
pi->repeat_count = repeat_count;
atomic_set(&pi->qlen, 0);
path->pscontext = pi;
list_add_tail(&pi->list, &s->valid_paths);
return 0;
}
static void ql_fail_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move(&pi->list, &s->failed_paths);
}
static int ql_reinstate_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move_tail(&pi->list, &s->valid_paths);
return 0;
}
/*
* Select a path having the minimum number of in-flight I/Os
*/
static struct dm_path *ql_select_path(struct path_selector *ps,
unsigned *repeat_count, size_t nr_bytes)
{
struct selector *s = ps->context;
struct path_info *pi = NULL, *best = NULL;
if (list_empty(&s->valid_paths))
return NULL;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list) {
if (!best ||
(atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
best = pi;
if (!atomic_read(&best->qlen))
break;
}
if (!best)
return NULL;
*repeat_count = best->repeat_count;
return best->path;
}
static int ql_start_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_inc(&pi->qlen);
return 0;
}
static int ql_end_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_dec(&pi->qlen);
return 0;
}
static struct path_selector_type ql_ps = {
.name = "queue-length",
.module = THIS_MODULE,
.table_args = 1,
.info_args = 1,
.create = ql_create,
.destroy = ql_destroy,
.status = ql_status,
.add_path = ql_add_path,
.fail_path = ql_fail_path,
.reinstate_path = ql_reinstate_path,
.select_path = ql_select_path,
.start_io = ql_start_io,
.end_io = ql_end_io,
};
static int __init dm_ql_init(void)
{
int r = dm_register_path_selector(&ql_ps);
if (r < 0)
DMERR("register failed %d", r);
DMINFO("version " QL_VERSION " loaded");
return r;
}
static void __exit dm_ql_exit(void)
{
int r = dm_unregister_path_selector(&ql_ps);
if (r < 0)
DMERR("unregister failed %d", r);
}
module_init(dm_ql_init);
module_exit(dm_ql_exit);
MODULE_AUTHOR("Stefan Bader <Stefan.Bader at de.ibm.com>");
MODULE_DESCRIPTION(
"(C) Copyright IBM Corp. 2004,2005 All Rights Reserved.\n"
DM_NAME " path selector to balance the number of in-flight I/Os"
);
MODULE_LICENSE("GPL");
...@@ -1283,9 +1283,23 @@ static int mirror_status(struct dm_target *ti, status_type_t type, ...@@ -1283,9 +1283,23 @@ static int mirror_status(struct dm_target *ti, status_type_t type,
return 0; return 0;
} }
static int mirror_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct mirror_set *ms = ti->private;
int ret = 0;
unsigned i;
for (i = 0; !ret && i < ms->nr_mirrors; i++)
ret = fn(ti, ms->mirror[i].dev,
ms->mirror[i].offset, data);
return ret;
}
static struct target_type mirror_target = { static struct target_type mirror_target = {
.name = "mirror", .name = "mirror",
.version = {1, 0, 20}, .version = {1, 12, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = mirror_ctr, .ctr = mirror_ctr,
.dtr = mirror_dtr, .dtr = mirror_dtr,
...@@ -1295,6 +1309,7 @@ static struct target_type mirror_target = { ...@@ -1295,6 +1309,7 @@ static struct target_type mirror_target = {
.postsuspend = mirror_postsuspend, .postsuspend = mirror_postsuspend,
.resume = mirror_resume, .resume = mirror_resume,
.status = mirror_status, .status = mirror_status,
.iterate_devices = mirror_iterate_devices,
}; };
static int __init dm_mirror_init(void) static int __init dm_mirror_init(void)
......
...@@ -283,7 +283,7 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region) ...@@ -283,7 +283,7 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC); nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
if (unlikely(!nreg)) if (unlikely(!nreg))
nreg = kmalloc(sizeof(*nreg), GFP_NOIO); nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
nreg->state = rh->log->type->in_sync(rh->log, region, 1) ? nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
DM_RH_CLEAN : DM_RH_NOSYNC; DM_RH_CLEAN : DM_RH_NOSYNC;
......
...@@ -161,7 +161,7 @@ static int rr_reinstate_path(struct path_selector *ps, struct dm_path *p) ...@@ -161,7 +161,7 @@ static int rr_reinstate_path(struct path_selector *ps, struct dm_path *p)
} }
static struct dm_path *rr_select_path(struct path_selector *ps, static struct dm_path *rr_select_path(struct path_selector *ps,
unsigned *repeat_count) unsigned *repeat_count, size_t nr_bytes)
{ {
struct selector *s = (struct selector *) ps->context; struct selector *s = (struct selector *) ps->context;
struct path_info *pi = NULL; struct path_info *pi = NULL;
......
/*
* Copyright (C) 2007-2009 NEC Corporation. All Rights Reserved.
*
* Module Author: Kiyoshi Ueda
*
* This file is released under the GPL.
*
* Throughput oriented path selector.
*/
#include "dm.h"
#include "dm-path-selector.h"
#define DM_MSG_PREFIX "multipath service-time"
#define ST_MIN_IO 1
#define ST_MAX_RELATIVE_THROUGHPUT 100
#define ST_MAX_RELATIVE_THROUGHPUT_SHIFT 7
#define ST_MAX_INFLIGHT_SIZE ((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT)
#define ST_VERSION "0.2.0"
struct selector {
struct list_head valid_paths;
struct list_head failed_paths;
};
struct path_info {
struct list_head list;
struct dm_path *path;
unsigned repeat_count;
unsigned relative_throughput;
atomic_t in_flight_size; /* Total size of in-flight I/Os */
};
static struct selector *alloc_selector(void)
{
struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
if (s) {
INIT_LIST_HEAD(&s->valid_paths);
INIT_LIST_HEAD(&s->failed_paths);
}
return s;
}
static int st_create(struct path_selector *ps, unsigned argc, char **argv)
{
struct selector *s = alloc_selector();
if (!s)
return -ENOMEM;
ps->context = s;
return 0;
}
static void free_paths(struct list_head *paths)
{
struct path_info *pi, *next;
list_for_each_entry_safe(pi, next, paths, list) {
list_del(&pi->list);
kfree(pi);
}
}
static void st_destroy(struct path_selector *ps)
{
struct selector *s = ps->context;
free_paths(&s->valid_paths);
free_paths(&s->failed_paths);
kfree(s);
ps->context = NULL;
}
static int st_status(struct path_selector *ps, struct dm_path *path,
status_type_t type, char *result, unsigned maxlen)
{
unsigned sz = 0;
struct path_info *pi;
if (!path)
DMEMIT("0 ");
else {
pi = path->pscontext;
switch (type) {
case STATUSTYPE_INFO:
DMEMIT("%d %u ", atomic_read(&pi->in_flight_size),
pi->relative_throughput);
break;
case STATUSTYPE_TABLE:
DMEMIT("%u %u ", pi->repeat_count,
pi->relative_throughput);
break;
}
}
return sz;
}
static int st_add_path(struct path_selector *ps, struct dm_path *path,
int argc, char **argv, char **error)
{
struct selector *s = ps->context;
struct path_info *pi;
unsigned repeat_count = ST_MIN_IO;
unsigned relative_throughput = 1;
/*
* Arguments: [<repeat_count> [<relative_throughput>]]
* <repeat_count>: The number of I/Os before switching path.
* If not given, default (ST_MIN_IO) is used.
* <relative_throughput>: The relative throughput value of
* the path among all paths in the path-group.
* The valid range: 0-<ST_MAX_RELATIVE_THROUGHPUT>
* If not given, minimum value '1' is used.
* If '0' is given, the path isn't selected while
* other paths having a positive value are
* available.
*/
if (argc > 2) {
*error = "service-time ps: incorrect number of arguments";
return -EINVAL;
}
if (argc && (sscanf(argv[0], "%u", &repeat_count) != 1)) {
*error = "service-time ps: invalid repeat count";
return -EINVAL;
}
if ((argc == 2) &&
(sscanf(argv[1], "%u", &relative_throughput) != 1 ||
relative_throughput > ST_MAX_RELATIVE_THROUGHPUT)) {
*error = "service-time ps: invalid relative_throughput value";
return -EINVAL;
}
/* allocate the path */
pi = kmalloc(sizeof(*pi), GFP_KERNEL);
if (!pi) {
*error = "service-time ps: Error allocating path context";
return -ENOMEM;
}
pi->path = path;
pi->repeat_count = repeat_count;
pi->relative_throughput = relative_throughput;
atomic_set(&pi->in_flight_size, 0);
path->pscontext = pi;
list_add_tail(&pi->list, &s->valid_paths);
return 0;
}
static void st_fail_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move(&pi->list, &s->failed_paths);
}
static int st_reinstate_path(struct path_selector *ps, struct dm_path *path)
{
struct selector *s = ps->context;
struct path_info *pi = path->pscontext;
list_move_tail(&pi->list, &s->valid_paths);
return 0;
}
/*
* Compare the estimated service time of 2 paths, pi1 and pi2,
* for the incoming I/O.
*
* Returns:
* < 0 : pi1 is better
* 0 : no difference between pi1 and pi2
* > 0 : pi2 is better
*
* Description:
* Basically, the service time is estimated by:
* ('pi->in-flight-size' + 'incoming') / 'pi->relative_throughput'
* To reduce the calculation, some optimizations are made.
* (See comments inline)
*/
static int st_compare_load(struct path_info *pi1, struct path_info *pi2,
size_t incoming)
{
size_t sz1, sz2, st1, st2;
sz1 = atomic_read(&pi1->in_flight_size);
sz2 = atomic_read(&pi2->in_flight_size);
/*
* Case 1: Both have same throughput value. Choose less loaded path.
*/
if (pi1->relative_throughput == pi2->relative_throughput)
return sz1 - sz2;
/*
* Case 2a: Both have same load. Choose higher throughput path.
* Case 2b: One path has no throughput value. Choose the other one.
*/
if (sz1 == sz2 ||
!pi1->relative_throughput || !pi2->relative_throughput)
return pi2->relative_throughput - pi1->relative_throughput;
/*
* Case 3: Calculate service time. Choose faster path.
* Service time using pi1:
* st1 = (sz1 + incoming) / pi1->relative_throughput
* Service time using pi2:
* st2 = (sz2 + incoming) / pi2->relative_throughput
*
* To avoid the division, transform the expression to use
* multiplication.
* Because ->relative_throughput > 0 here, if st1 < st2,
* the expressions below are the same meaning:
* (sz1 + incoming) / pi1->relative_throughput <
* (sz2 + incoming) / pi2->relative_throughput
* (sz1 + incoming) * pi2->relative_throughput <
* (sz2 + incoming) * pi1->relative_throughput
* So use the later one.
*/
sz1 += incoming;
sz2 += incoming;
if (unlikely(sz1 >= ST_MAX_INFLIGHT_SIZE ||
sz2 >= ST_MAX_INFLIGHT_SIZE)) {
/*
* Size may be too big for multiplying pi->relative_throughput
* and overflow.
* To avoid the overflow and mis-selection, shift down both.
*/
sz1 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
sz2 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
}
st1 = sz1 * pi2->relative_throughput;
st2 = sz2 * pi1->relative_throughput;
if (st1 != st2)
return st1 - st2;
/*
* Case 4: Service time is equal. Choose higher throughput path.
*/
return pi2->relative_throughput - pi1->relative_throughput;
}
static struct dm_path *st_select_path(struct path_selector *ps,
unsigned *repeat_count, size_t nr_bytes)
{
struct selector *s = ps->context;
struct path_info *pi = NULL, *best = NULL;
if (list_empty(&s->valid_paths))
return NULL;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list)
if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
best = pi;
if (!best)
return NULL;
*repeat_count = best->repeat_count;
return best->path;
}
static int st_start_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_add(nr_bytes, &pi->in_flight_size);
return 0;
}
static int st_end_io(struct path_selector *ps, struct dm_path *path,
size_t nr_bytes)
{
struct path_info *pi = path->pscontext;
atomic_sub(nr_bytes, &pi->in_flight_size);
return 0;
}
static struct path_selector_type st_ps = {
.name = "service-time",
.module = THIS_MODULE,
.table_args = 2,
.info_args = 2,
.create = st_create,
.destroy = st_destroy,
.status = st_status,
.add_path = st_add_path,
.fail_path = st_fail_path,
.reinstate_path = st_reinstate_path,
.select_path = st_select_path,
.start_io = st_start_io,
.end_io = st_end_io,
};
static int __init dm_st_init(void)
{
int r = dm_register_path_selector(&st_ps);
if (r < 0)
DMERR("register failed %d", r);
DMINFO("version " ST_VERSION " loaded");
return r;
}
static void __exit dm_st_exit(void)
{
int r = dm_unregister_path_selector(&st_ps);
if (r < 0)
DMERR("unregister failed %d", r);
}
module_init(dm_st_init);
module_exit(dm_st_exit);
MODULE_DESCRIPTION(DM_NAME " throughput oriented path selector");
MODULE_AUTHOR("Kiyoshi Ueda <k-ueda@ct.jp.nec.com>");
MODULE_LICENSE("GPL");
...@@ -636,7 +636,7 @@ static void persistent_commit_exception(struct dm_exception_store *store, ...@@ -636,7 +636,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
/* /*
* Commit exceptions to disk. * Commit exceptions to disk.
*/ */
if (ps->valid && area_io(ps, WRITE)) if (ps->valid && area_io(ps, WRITE_BARRIER))
ps->valid = 0; ps->valid = 0;
/* /*
......
...@@ -678,6 +678,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -678,6 +678,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->private = s; ti->private = s;
ti->split_io = s->store->chunk_size; ti->split_io = s->store->chunk_size;
ti->num_flush_requests = 1;
return 0; return 0;
...@@ -1030,6 +1031,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio, ...@@ -1030,6 +1031,11 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
chunk_t chunk; chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL; struct dm_snap_pending_exception *pe = NULL;
if (unlikely(bio_empty_barrier(bio))) {
bio->bi_bdev = s->store->cow->bdev;
return DM_MAPIO_REMAPPED;
}
chunk = sector_to_chunk(s->store, bio->bi_sector); chunk = sector_to_chunk(s->store, bio->bi_sector);
/* Full snapshots are not usable */ /* Full snapshots are not usable */
...@@ -1338,6 +1344,8 @@ static int origin_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1338,6 +1344,8 @@ static int origin_ctr(struct dm_target *ti, unsigned int argc, char **argv)
} }
ti->private = dev; ti->private = dev;
ti->num_flush_requests = 1;
return 0; return 0;
} }
...@@ -1353,6 +1361,9 @@ static int origin_map(struct dm_target *ti, struct bio *bio, ...@@ -1353,6 +1361,9 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
struct dm_dev *dev = ti->private; struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev; bio->bi_bdev = dev->bdev;
if (unlikely(bio_empty_barrier(bio)))
return DM_MAPIO_REMAPPED;
/* Only tell snapshots if this is a write */ /* Only tell snapshots if this is a write */
return (bio_rw(bio) == WRITE) ? do_origin(dev, bio) : DM_MAPIO_REMAPPED; return (bio_rw(bio) == WRITE) ? do_origin(dev, bio) : DM_MAPIO_REMAPPED;
} }
......
...@@ -167,6 +167,7 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -167,6 +167,7 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
sc->stripes = stripes; sc->stripes = stripes;
sc->stripe_width = width; sc->stripe_width = width;
ti->split_io = chunk_size; ti->split_io = chunk_size;
ti->num_flush_requests = stripes;
sc->chunk_mask = ((sector_t) chunk_size) - 1; sc->chunk_mask = ((sector_t) chunk_size) - 1;
for (sc->chunk_shift = 0; chunk_size; sc->chunk_shift++) for (sc->chunk_shift = 0; chunk_size; sc->chunk_shift++)
...@@ -211,10 +212,18 @@ static int stripe_map(struct dm_target *ti, struct bio *bio, ...@@ -211,10 +212,18 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context) union map_info *map_context)
{ {
struct stripe_c *sc = (struct stripe_c *) ti->private; struct stripe_c *sc = (struct stripe_c *) ti->private;
sector_t offset, chunk;
uint32_t stripe;
sector_t offset = bio->bi_sector - ti->begin; if (unlikely(bio_empty_barrier(bio))) {
sector_t chunk = offset >> sc->chunk_shift; BUG_ON(map_context->flush_request >= sc->stripes);
uint32_t stripe = sector_div(chunk, sc->stripes); bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
return DM_MAPIO_REMAPPED;
}
offset = bio->bi_sector - ti->begin;
chunk = offset >> sc->chunk_shift;
stripe = sector_div(chunk, sc->stripes);
bio->bi_bdev = sc->stripe[stripe].dev->bdev; bio->bi_bdev = sc->stripe[stripe].dev->bdev;
bio->bi_sector = sc->stripe[stripe].physical_start + bio->bi_sector = sc->stripe[stripe].physical_start +
...@@ -304,15 +313,31 @@ static int stripe_end_io(struct dm_target *ti, struct bio *bio, ...@@ -304,15 +313,31 @@ static int stripe_end_io(struct dm_target *ti, struct bio *bio,
return error; return error;
} }
static int stripe_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct stripe_c *sc = ti->private;
int ret = 0;
unsigned i = 0;
do
ret = fn(ti, sc->stripe[i].dev,
sc->stripe[i].physical_start, data);
while (!ret && ++i < sc->stripes);
return ret;
}
static struct target_type stripe_target = { static struct target_type stripe_target = {
.name = "striped", .name = "striped",
.version = {1, 1, 0}, .version = {1, 2, 0},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = stripe_ctr, .ctr = stripe_ctr,
.dtr = stripe_dtr, .dtr = stripe_dtr,
.map = stripe_map, .map = stripe_map,
.end_io = stripe_end_io, .end_io = stripe_end_io,
.status = stripe_status, .status = stripe_status,
.iterate_devices = stripe_iterate_devices,
}; };
int __init dm_stripe_init(void) int __init dm_stripe_init(void)
......
...@@ -57,12 +57,21 @@ static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf) ...@@ -57,12 +57,21 @@ static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf)
return strlen(buf); return strlen(buf);
} }
static ssize_t dm_attr_suspended_show(struct mapped_device *md, char *buf)
{
sprintf(buf, "%d\n", dm_suspended(md));
return strlen(buf);
}
static DM_ATTR_RO(name); static DM_ATTR_RO(name);
static DM_ATTR_RO(uuid); static DM_ATTR_RO(uuid);
static DM_ATTR_RO(suspended);
static struct attribute *dm_attrs[] = { static struct attribute *dm_attrs[] = {
&dm_attr_name.attr, &dm_attr_name.attr,
&dm_attr_uuid.attr, &dm_attr_uuid.attr,
&dm_attr_suspended.attr,
NULL, NULL,
}; };
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -57,6 +57,7 @@ header-y += dlmconstants.h ...@@ -57,6 +57,7 @@ header-y += dlmconstants.h
header-y += dlm_device.h header-y += dlm_device.h
header-y += dlm_netlink.h header-y += dlm_netlink.h
header-y += dm-ioctl.h header-y += dm-ioctl.h
header-y += dm-log-userspace.h
header-y += dn.h header-y += dn.h
header-y += dqblk_xfs.h header-y += dqblk_xfs.h
header-y += efs_fs_sb.h header-y += efs_fs_sb.h
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment