Commit f0ba4377 authored by Mauro Carvalho Chehab's avatar Mauro Carvalho Chehab Committed by Jonathan Corbet

docs: convert docs to ReST and rename to *.rst

The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
Acked-by: default avatarMark Brown <broonie@kernel.org>
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parent 8ea61889
=============================
Guidance for writing policies Guidance for writing policies
============================= =============================
...@@ -30,7 +31,7 @@ multiqueue (mq) ...@@ -30,7 +31,7 @@ multiqueue (mq)
This policy is now an alias for smq (see below). This policy is now an alias for smq (see below).
The following tunables are accepted, but have no effect: The following tunables are accepted, but have no effect::
'sequential_threshold <#nr_sequential_ios>' 'sequential_threshold <#nr_sequential_ios>'
'random_threshold <#nr_random_ios>' 'random_threshold <#nr_random_ios>'
...@@ -56,7 +57,9 @@ mq policy's hints to be dropped. Also, performance of the cache may ...@@ -56,7 +57,9 @@ mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots degrade slightly until smq recalculates the origin device's hotspots
that should be cached. that should be cached.
Memory usage: Memory usage
^^^^^^^^^^^^
The mq policy used a lot of memory; 88 bytes per cache block on a 64 The mq policy used a lot of memory; 88 bytes per cache block on a 64
bit machine. bit machine.
...@@ -69,7 +72,9 @@ cache block). ...@@ -69,7 +72,9 @@ cache block).
All this means smq uses ~25bytes per cache block. Still a lot of All this means smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless. memory, but a substantial improvement nontheless.
Level balancing: Level balancing
^^^^^^^^^^^^^^^
mq placed entries in different levels of the multiqueue structures mq placed entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This meant the bottom based on their hit count (~ln(hit count)). This meant the bottom
levels generally had the most entries, and the top ones had very levels generally had the most entries, and the top ones had very
...@@ -94,7 +99,9 @@ is used to decide which blocks to promote. If the hotspot queue is ...@@ -94,7 +99,9 @@ is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly. levels. This lets it adapt to new IO patterns very quickly.
Performance: Performance
^^^^^^^^^^^
Testing smq shows substantially better performance than mq. Testing smq shows substantially better performance than mq.
cleaner cleaner
...@@ -105,16 +112,19 @@ The cleaner writes back all dirty blocks in a cache to decommission it. ...@@ -105,16 +112,19 @@ The cleaner writes back all dirty blocks in a cache to decommission it.
Examples Examples
======== ========
The syntax for a table is: The syntax for a table is::
cache <metadata dev> <cache dev> <origin dev> <block size> cache <metadata dev> <cache dev> <origin dev> <block size>
<#feature_args> [<feature arg>]* <#feature_args> [<feature arg>]*
<policy> <#policy_args> [<policy arg>]* <policy> <#policy_args> [<policy arg>]*
The syntax to send a message using the dmsetup command is: The syntax to send a message using the dmsetup command is::
dmsetup message <mapped device> 0 sequential_threshold 1024 dmsetup message <mapped device> 0 sequential_threshold 1024
dmsetup message <mapped device> 0 random_threshold 8 dmsetup message <mapped device> 0 random_threshold 8
Using dmsetup: Using dmsetup::
dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
/dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
creates a 128GB large mapped device named 'blah' with the creates a 128GB large mapped device named 'blah' with the
......
=====
Cache
=====
Introduction Introduction
============ ============
...@@ -24,10 +28,13 @@ scenarios (eg. a vm image server). ...@@ -24,10 +28,13 @@ scenarios (eg. a vm image server).
Glossary Glossary
======== ========
Migration - Movement of the primary copy of a logical block from one Migration
Movement of the primary copy of a logical block from one
device to the other. device to the other.
Promotion - Migration from slow device to fast device. Promotion
Demotion - Migration from fast device to slow device. Migration from slow device to fast device.
Demotion
Migration from fast device to slow device.
The origin device always contains a copy of the logical block, which The origin device always contains a copy of the logical block, which
may be out of date or kept in sync with the copy on the cache device may be out of date or kept in sync with the copy on the cache device
...@@ -169,45 +176,53 @@ Target interface ...@@ -169,45 +176,53 @@ Target interface
Constructor Constructor
----------- -----------
::
cache <metadata dev> <cache dev> <origin dev> <block size> cache <metadata dev> <cache dev> <origin dev> <block size>
<#feature args> [<feature arg>]* <#feature args> [<feature arg>]*
<policy> <#policy args> [policy args]* <policy> <#policy args> [policy args]*
metadata dev : fast device holding the persistent metadata ================ =======================================================
cache dev : fast device holding cached data blocks metadata dev fast device holding the persistent metadata
origin dev : slow device holding original data blocks cache dev fast device holding cached data blocks
block size : cache unit size in sectors origin dev slow device holding original data blocks
block size cache unit size in sectors
#feature args : number of feature arguments passed #feature args number of feature arguments passed
feature args : writethrough or passthrough (The default is writeback.) feature args writethrough or passthrough (The default is writeback.)
policy : the replacement policy to use policy the replacement policy to use
#policy args : an even number of arguments corresponding to #policy args an even number of arguments corresponding to
key/value pairs passed to the policy key/value pairs passed to the policy
policy args : key/value pairs passed to the policy policy args key/value pairs passed to the policy
E.g. 'sequential_threshold 1024' E.g. 'sequential_threshold 1024'
See cache-policies.txt for details. See cache-policies.txt for details.
================ =======================================================
Optional feature arguments are: Optional feature arguments are:
writethrough : write through caching that prohibits cache block
==================== ========================================================
writethrough write through caching that prohibits cache block
content from being different from origin block content. content from being different from origin block content.
Without this argument, the default behaviour is to write Without this argument, the default behaviour is to write
back cache block contents later for performance reasons, back cache block contents later for performance reasons,
so they may differ from the corresponding origin blocks. so they may differ from the corresponding origin blocks.
passthrough : a degraded mode useful for various cache coherency passthrough a degraded mode useful for various cache coherency
situations (e.g., rolling back snapshots of situations (e.g., rolling back snapshots of
underlying storage). Reads and writes always go to underlying storage). Reads and writes always go to
the origin. If a write goes to a cached origin the origin. If a write goes to a cached origin
block, then the cache block is invalidated. block, then the cache block is invalidated.
To enable passthrough mode the cache must be clean. To enable passthrough mode the cache must be clean.
metadata2 : use version 2 of the metadata. This stores the dirty bits metadata2 use version 2 of the metadata. This stores the dirty
in a separate btree, which improves speed of shutting bits in a separate btree, which improves speed of
down the cache. shutting down the cache.
no_discard_passdown : disable passing down discards from the cache no_discard_passdown disable passing down discards from the cache
to the origin's data device. to the origin's data device.
==================== ========================================================
A policy called 'default' is always registered. This is an alias for A policy called 'default' is always registered. This is an alias for
the policy we currently think is giving best all round performance. the policy we currently think is giving best all round performance.
...@@ -218,54 +233,61 @@ the characteristics of a specific policy, always request it by name. ...@@ -218,54 +233,61 @@ the characteristics of a specific policy, always request it by name.
Status Status
------ ------
<metadata block size> <#used metadata blocks>/<#total metadata blocks> ::
<cache block size> <#used cache blocks>/<#total cache blocks>
<#read hits> <#read misses> <#write hits> <#write misses> <metadata block size> <#used metadata blocks>/<#total metadata blocks>
<#demotions> <#promotions> <#dirty> <#features> <features>* <cache block size> <#used cache blocks>/<#total cache blocks>
<#core args> <core args>* <policy name> <#policy args> <policy args>* <#read hits> <#read misses> <#write hits> <#write misses>
<cache metadata mode> <#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
<cache metadata mode>
metadata block size : Fixed block size for each metadata block in
========================= =====================================================
metadata block size Fixed block size for each metadata block in
sectors sectors
#used metadata blocks : Number of metadata blocks used #used metadata blocks Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks #total metadata blocks Total number of metadata blocks
cache block size : Configurable block size for the cache device cache block size Configurable block size for the cache device
in sectors in sectors
#used cache blocks : Number of blocks resident in the cache #used cache blocks Number of blocks resident in the cache
#total cache blocks : Total number of cache blocks #total cache blocks Total number of cache blocks
#read hits : Number of times a READ bio has been mapped #read hits Number of times a READ bio has been mapped
to the cache to the cache
#read misses : Number of times a READ bio has been mapped #read misses Number of times a READ bio has been mapped
to the origin to the origin
#write hits : Number of times a WRITE bio has been mapped #write hits Number of times a WRITE bio has been mapped
to the cache to the cache
#write misses : Number of times a WRITE bio has been #write misses Number of times a WRITE bio has been
mapped to the origin mapped to the origin
#demotions : Number of times a block has been removed #demotions Number of times a block has been removed
from the cache from the cache
#promotions : Number of times a block has been moved to #promotions Number of times a block has been moved to
the cache the cache
#dirty : Number of blocks in the cache that differ #dirty Number of blocks in the cache that differ
from the origin from the origin
#feature args : Number of feature args to follow #feature args Number of feature args to follow
feature args : 'writethrough' (optional) feature args 'writethrough' (optional)
#core args : Number of core arguments (must be even) #core args Number of core arguments (must be even)
core args : Key/value pairs for tuning the core core args Key/value pairs for tuning the core
e.g. migration_threshold e.g. migration_threshold
policy name : Name of the policy policy name Name of the policy
#policy args : Number of policy arguments to follow (must be even) #policy args Number of policy arguments to follow (must be even)
policy args : Key/value pairs e.g. sequential_threshold policy args Key/value pairs e.g. sequential_threshold
cache metadata mode : ro if read-only, rw if read-write cache metadata mode ro if read-only, rw if read-write
In serious cases where even a read-only mode is deemed unsafe
no further I/O will be permitted and the status will just In serious cases where even a read-only mode is
contain the string 'Fail'. The userspace recovery tools deemed unsafe no further I/O will be permitted and
should then be used. the status will just contain the string 'Fail'.
needs_check : 'needs_check' if set, '-' if not set The userspace recovery tools should then be used.
A metadata operation has failed, resulting in the needs_check needs_check 'needs_check' if set, '-' if not set
flag being set in the metadata's superblock. The metadata A metadata operation has failed, resulting in the
device must be deactivated and checked/repaired before the needs_check flag being set in the metadata's
cache can be made fully operational again. '-' indicates superblock. The metadata device must be
needs_check is not set. deactivated and checked/repaired before the
cache can be made fully operational again.
'-' indicates needs_check is not set.
========================= =====================================================
Messages Messages
-------- --------
...@@ -274,11 +296,12 @@ Policies will have different tunables, specific to each one, so we ...@@ -274,11 +296,12 @@ Policies will have different tunables, specific to each one, so we
need a generic way of getting and setting these. Device-mapper need a generic way of getting and setting these. Device-mapper
messages are used. (A sysfs interface would also be possible.) messages are used. (A sysfs interface would also be possible.)
The message format is: The message format is::
<key> <value> <key> <value>
E.g. E.g.::
dmsetup message my_cache 0 sequential_threshold 1024 dmsetup message my_cache 0 sequential_threshold 1024
...@@ -290,11 +313,12 @@ of values from 5 to 9. Each cblock must be expressed as a decimal ...@@ -290,11 +313,12 @@ of values from 5 to 9. Each cblock must be expressed as a decimal
value, in the future a variant message that takes cblock ranges value, in the future a variant message that takes cblock ranges
expressed in hexadecimal may be needed to better support efficient expressed in hexadecimal may be needed to better support efficient
invalidation of larger caches. The cache must be in passthrough mode invalidation of larger caches. The cache must be in passthrough mode
when invalidate_cblocks is used. when invalidate_cblocks is used::
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
E.g. E.g.::
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
Examples Examples
...@@ -304,8 +328,10 @@ The test suite can be found here: ...@@ -304,8 +328,10 @@ The test suite can be found here:
https://github.com/jthornber/device-mapper-test-suite https://github.com/jthornber/device-mapper-test-suite
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ ::
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
mq 4 sequential_threshold 1024 random_threshold 8' mq 4 sequential_threshold 1024 random_threshold 8'
========
dm-delay dm-delay
======== ========
Device-Mapper's "delay" target delays reads and/or writes Device-Mapper's "delay" target delays reads and/or writes
and maps them to different devices. and maps them to different devices.
Parameters: Parameters::
<device> <offset> <delay> [<write_device> <write_offset> <write_delay> <device> <offset> <delay> [<write_device> <write_offset> <write_delay>
[<flush_device> <flush_offset> <flush_delay>]] [<flush_device> <flush_offset> <flush_delay>]]
...@@ -14,15 +16,16 @@ Delays are specified in milliseconds. ...@@ -14,15 +16,16 @@ Delays are specified in milliseconds.
Example scripts Example scripts
=============== ===============
[[
#!/bin/sh ::
# Create device delaying rw operation for 500ms
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed #!/bin/sh
]] # Create device delaying rw operation for 500ms
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
[[
#!/bin/sh ::
# Create device delaying only write operation for 500ms and
# splitting reads and writes to different devices $1 $2 #!/bin/sh
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed # Create device delaying only write operation for 500ms and
]] # splitting reads and writes to different devices $1 $2
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
========
dm-crypt dm-crypt
========= ========
Device-Mapper's "crypt" target provides transparent encryption of block devices Device-Mapper's "crypt" target provides transparent encryption of block devices
using the kernel crypto API. using the kernel crypto API.
...@@ -7,15 +8,20 @@ using the kernel crypto API. ...@@ -7,15 +8,20 @@ using the kernel crypto API.
For a more detailed description of supported parameters see: For a more detailed description of supported parameters see:
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
Parameters: <cipher> <key> <iv_offset> <device path> \ Parameters::
<cipher> <key> <iv_offset> <device path> \
<offset> [<#opt_params> <opt_params>] <offset> [<#opt_params> <opt_params>]
<cipher> <cipher>
Encryption cipher, encryption mode and Initial Vector (IV) generator. Encryption cipher, encryption mode and Initial Vector (IV) generator.
The cipher specifications format is: The cipher specifications format is::
cipher[:keycount]-chainmode-ivmode[:ivopts] cipher[:keycount]-chainmode-ivmode[:ivopts]
Examples:
Examples::
aes-cbc-essiv:sha256 aes-cbc-essiv:sha256
aes-xts-plain64 aes-xts-plain64
serpent-xts-plain64 serpent-xts-plain64
...@@ -25,12 +31,17 @@ Parameters: <cipher> <key> <iv_offset> <device path> \ ...@@ -25,12 +31,17 @@ Parameters: <cipher> <key> <iv_offset> <device path> \
as for the first format type. as for the first format type.
This format is mainly used for specification of authenticated modes. This format is mainly used for specification of authenticated modes.
The crypto API cipher specifications format is: The crypto API cipher specifications format is::
capi:cipher_api_spec-ivmode[:ivopts] capi:cipher_api_spec-ivmode[:ivopts]
Examples:
Examples::
capi:cbc(aes)-essiv:sha256 capi:cbc(aes)-essiv:sha256
capi:xts(aes)-plain64 capi:xts(aes)-plain64
Examples of authenticated modes:
Examples of authenticated modes::
capi:gcm(aes)-random capi:gcm(aes)-random
capi:authenc(hmac(sha256),xts(aes))-random capi:authenc(hmac(sha256),xts(aes))-random
capi:rfc7539(chacha20,poly1305)-random capi:rfc7539(chacha20,poly1305)-random
...@@ -142,21 +153,21 @@ LUKS (Linux Unified Key Setup) is now the preferred way to set up disk ...@@ -142,21 +153,21 @@ LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
encryption with dm-crypt using the 'cryptsetup' utility, see encryption with dm-crypt using the 'cryptsetup' utility, see
https://gitlab.com/cryptsetup/cryptsetup https://gitlab.com/cryptsetup/cryptsetup
[[ ::
#!/bin/sh
# Create a crypt device using dmsetup #!/bin/sh
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0" # Create a crypt device using dmsetup
]] dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
[[ ::
#!/bin/sh
# Create a crypt device using dmsetup when encryption key is stored in keyring service #!/bin/sh
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0" # Create a crypt device using dmsetup when encryption key is stored in keyring service
]] dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
[[ ::
#!/bin/sh
# Create a crypt device using cryptsetup and LUKS header with default cipher #!/bin/sh
cryptsetup luksFormat $1 # Create a crypt device using cryptsetup and LUKS header with default cipher
cryptsetup luksOpen $1 crypt1 cryptsetup luksFormat $1
]] cryptsetup luksOpen $1 crypt1
=========
dm-flakey dm-flakey
========= =========
...@@ -15,17 +16,26 @@ underlying devices. ...@@ -15,17 +16,26 @@ underlying devices.
Table parameters Table parameters
---------------- ----------------
::
<dev path> <offset> <up interval> <down interval> \ <dev path> <offset> <up interval> <down interval> \
[<num_features> [<feature arguments>]] [<num_features> [<feature arguments>]]
Mandatory parameters: Mandatory parameters:
<dev path>: Full pathname to the underlying block-device, or a
<dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number. "major:minor" device-number.
<offset>: Starting sector within the device. <offset>:
<up interval>: Number of seconds device is available. Starting sector within the device.
<down interval>: Number of seconds device returns errors. <up interval>:
Number of seconds device is available.
<down interval>:
Number of seconds device returns errors.
Optional feature parameters: Optional feature parameters:
If no feature parameters are present, during the periods of If no feature parameters are present, during the periods of
unreliability, all I/O returns errors. unreliability, all I/O returns errors.
...@@ -41,17 +51,24 @@ Optional feature parameters: ...@@ -41,17 +51,24 @@ Optional feature parameters:
During <down interval>, replace <Nth_byte> of the data of During <down interval>, replace <Nth_byte> of the data of
each matching bio with <value>. each matching bio with <value>.
<Nth_byte>: The offset of the byte to replace. <Nth_byte>:
The offset of the byte to replace.
Counting starts at 1, to replace the first byte. Counting starts at 1, to replace the first byte.
<direction>: Either 'r' to corrupt reads or 'w' to corrupt writes. <direction>:
Either 'r' to corrupt reads or 'w' to corrupt writes.
'w' is incompatible with drop_writes. 'w' is incompatible with drop_writes.
<value>: The value (from 0-255) to write. <value>:
<flags>: Perform the replacement only if bio->bi_opf has all the The value (from 0-255) to write.
<flags>:
Perform the replacement only if bio->bi_opf has all the
selected flags set. selected flags set.
Examples: Examples:
Replaces the 32nd byte of READ bios with the value 1::
corrupt_bio_byte 32 r 1 0 corrupt_bio_byte 32 r 1 0
- replaces the 32nd byte of READ bios with the value 1
Replaces the 224th byte of REQ_META (=32) bios with the value 0::
corrupt_bio_byte 224 w 0 32 corrupt_bio_byte 224 w 0 32
- replaces the 224th byte of REQ_META (=32) bios with the value 0
================================
Early creation of mapped devices Early creation of mapped devices
==================================== ================================
It is possible to configure a device-mapper device to act as the root device for It is possible to configure a device-mapper device to act as the root device for
your system in two ways. your system in two ways.
...@@ -12,15 +13,17 @@ The second is to create one or more device-mappers using the module parameter ...@@ -12,15 +13,17 @@ The second is to create one or more device-mappers using the module parameter
The format is specified as a string of data separated by commas and optionally The format is specified as a string of data separated by commas and optionally
semi-colons, where: semi-colons, where:
- a comma is used to separate fields like name, uuid, flags and table - a comma is used to separate fields like name, uuid, flags and table
(specifies one device) (specifies one device)
- a semi-colon is used to separate devices. - a semi-colon is used to separate devices.
So the format will look like this: So the format will look like this::
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+] dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
Where, Where::
<name> ::= The device name. <name> ::= The device name.
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "" <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
<minor> ::= The device minor number | "" <minor> ::= The device minor number | ""
...@@ -29,7 +32,7 @@ Where, ...@@ -29,7 +32,7 @@ Where,
<target_type> ::= "verity" | "linear" | ... (see list below) <target_type> ::= "verity" | "linear" | ... (see list below)
The dm line should be equivalent to the one used by the dmsetup tool with the The dm line should be equivalent to the one used by the dmsetup tool with the
--concise argument. `--concise` argument.
Target types Target types
============ ============
...@@ -38,32 +41,34 @@ Not all target types are available as there are serious risks in allowing ...@@ -38,32 +41,34 @@ Not all target types are available as there are serious risks in allowing
activation of certain DM targets without first using userspace tools to check activation of certain DM targets without first using userspace tools to check
the validity of associated metadata. the validity of associated metadata.
"cache": constrained, userspace should verify cache device ======================= =======================================================
"crypt": allowed `cache` constrained, userspace should verify cache device
"delay": allowed `crypt` allowed
"era": constrained, userspace should verify metadata device `delay` allowed
"flakey": constrained, meant for test `era` constrained, userspace should verify metadata device
"linear": allowed `flakey` constrained, meant for test
"log-writes": constrained, userspace should verify metadata device `linear` allowed
"mirror": constrained, userspace should verify main/mirror device `log-writes` constrained, userspace should verify metadata device
"raid": constrained, userspace should verify metadata device `mirror` constrained, userspace should verify main/mirror device
"snapshot": constrained, userspace should verify src/dst device `raid` constrained, userspace should verify metadata device
"snapshot-origin": allowed `snapshot` constrained, userspace should verify src/dst device
"snapshot-merge": constrained, userspace should verify src/dst device `snapshot-origin` allowed
"striped": allowed `snapshot-merge` constrained, userspace should verify src/dst device
"switch": constrained, userspace should verify dev path `striped` allowed
"thin": constrained, requires dm target message from userspace `switch` constrained, userspace should verify dev path
"thin-pool": constrained, requires dm target message from userspace `thin` constrained, requires dm target message from userspace
"verity": allowed `thin-pool` constrained, requires dm target message from userspace
"writecache": constrained, userspace should verify cache device `verity` allowed
"zero": constrained, not meant for rootfs `writecache` constrained, userspace should verify cache device
`zero` constrained, not meant for rootfs
======================= =======================================================
If the target is not listed above, it is constrained by default (not tested). If the target is not listed above, it is constrained by default (not tested).
Examples Examples
======== ========
An example of booting to a linear array made up of user-mode linux block An example of booting to a linear array made up of user-mode linux block
devices: devices::
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0 dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
...@@ -71,8 +76,8 @@ This will boot to a rw dm-linear target of 8192 sectors split across two block ...@@ -71,8 +76,8 @@ This will boot to a rw dm-linear target of 8192 sectors split across two block
devices identified by their major:minor numbers. After boot, udev will rename devices identified by their major:minor numbers. After boot, udev will rename
this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned. this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here An example of multiple device-mappers, with the dm-mod.create="..." contents
split on multiple lines for readability: is shown here split on multiple lines for readability::
dm-linear,,1,rw, dm-linear,,1,rw,
0 32768 linear 8:1 0, 0 32768 linear 8:1 0,
...@@ -84,30 +89,36 @@ split on multiple lines for readability: ...@@ -84,30 +89,36 @@ split on multiple lines for readability:
Other examples (per target): Other examples (per target):
"crypt": "crypt"::
dm-crypt,,8,ro, dm-crypt,,8,ro,
0 1048576 crypt aes-xts-plain64 0 1048576 crypt aes-xts-plain64
babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0 babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
/dev/sda 0 1 allow_discards /dev/sda 0 1 allow_discards
"delay": "delay"::
dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500 dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
"linear": "linear"::
dm-linear,,,rw, dm-linear,,,rw,
0 32768 linear /dev/sda1 0, 0 32768 linear /dev/sda1 0,
32768 1024000 linear /dev/sda2 0, 32768 1024000 linear /dev/sda2 0,
1056768 204800 linear /dev/sda3 0, 1056768 204800 linear /dev/sda3 0,
1261568 512000 linear /dev/sda4 0 1261568 512000 linear /dev/sda4 0
"snapshot-origin": "snapshot-origin"::
dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2 dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
"striped": "striped"::
dm-striped,,4,ro,0 1638400 striped 4 4096 dm-striped,,4,ro,0 1638400 striped 4 4096
/dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0 /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
"verity": "verity"::
dm-verity,,4,ro, dm-verity,,4,ro,
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
......
============
dm-integrity
============
The dm-integrity target emulates a block device that has additional The dm-integrity target emulates a block device that has additional
per-sector tags that can be used for storing integrity information. per-sector tags that can be used for storing integrity information.
...@@ -35,6 +39,7 @@ zeroes. If the superblock is neither valid nor zeroed, the dm-integrity ...@@ -35,6 +39,7 @@ zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
target can't be loaded. target can't be loaded.
To use the target for the first time: To use the target for the first time:
1. overwrite the superblock with zeroes 1. overwrite the superblock with zeroes
2. load the dm-integrity target with one-sector size, the kernel driver 2. load the dm-integrity target with one-sector size, the kernel driver
will format the device will format the device
...@@ -57,11 +62,14 @@ Target arguments: ...@@ -57,11 +62,14 @@ Target arguments:
the internal-hash algorithm) the internal-hash algorithm)
4. mode: 4. mode:
D - direct writes (without journal) - in this mode, journaling is
D - direct writes (without journal)
in this mode, journaling is
not used and data sectors and integrity tags are written not used and data sectors and integrity tags are written
separately. In case of crash, it is possible that the data separately. In case of crash, it is possible that the data
and integrity tag doesn't match. and integrity tag doesn't match.
J - journaled writes - data and integrity tags are written to the J - journaled writes
data and integrity tags are written to the
journal and atomicity is guaranteed. In case of crash, journal and atomicity is guaranteed. In case of crash,
either both data and tag or none of them are written. The either both data and tag or none of them are written. The
journaled mode degrades write throughput twice because the journaled mode degrades write throughput twice because the
...@@ -178,9 +186,12 @@ and the reloaded target would be non-functional. ...@@ -178,9 +186,12 @@ and the reloaded target would be non-functional.
The layout of the formatted block device: The layout of the formatted block device:
* reserved sectors (they are not used by this target, they can be used for
* reserved sectors
(they are not used by this target, they can be used for
storing LUKS metadata or for other purpose), the size of the reserved storing LUKS metadata or for other purpose), the size of the reserved
area is specified in the target arguments area is specified in the target arguments
* superblock (4kiB) * superblock (4kiB)
* magic string - identifies that the device was formatted * magic string - identifies that the device was formatted
* version * version
...@@ -192,40 +203,55 @@ The layout of the formatted block device: ...@@ -192,40 +203,55 @@ The layout of the formatted block device:
metadata and padding). The user of this target should not send metadata and padding). The user of this target should not send
bios that access data beyond the "provided data sectors" limit. bios that access data beyond the "provided data sectors" limit.
* flags * flags
SB_FLAG_HAVE_JOURNAL_MAC - a flag is set if journal_mac is used SB_FLAG_HAVE_JOURNAL_MAC
SB_FLAG_RECALCULATING - recalculating is in progress - a flag is set if journal_mac is used
SB_FLAG_DIRTY_BITMAP - journal area contains the bitmap of dirty SB_FLAG_RECALCULATING
- recalculating is in progress
SB_FLAG_DIRTY_BITMAP
- journal area contains the bitmap of dirty
blocks blocks
* log2(sectors per block) * log2(sectors per block)
* a position where recalculating finished * a position where recalculating finished
* journal * journal
The journal is divided into sections, each section contains: The journal is divided into sections, each section contains:
* metadata area (4kiB), it contains journal entries * metadata area (4kiB), it contains journal entries
every journal entry contains:
- every journal entry contains:
* logical sector (specifies where the data and tag should * logical sector (specifies where the data and tag should
be written) be written)
* last 8 bytes of data * last 8 bytes of data
* integrity tag (the size is specified in the superblock) * integrity tag (the size is specified in the superblock)
every metadata sector ends with
- every metadata sector ends with
* mac (8-bytes), all the macs in 8 metadata sectors form a * mac (8-bytes), all the macs in 8 metadata sectors form a
64-byte value. It is used to store hmac of sector 64-byte value. It is used to store hmac of sector
numbers in the journal section, to protect against a numbers in the journal section, to protect against a
possibility that the attacker tampers with sector possibility that the attacker tampers with sector
numbers in the journal. numbers in the journal.
* commit id * commit id
* data area (the size is variable; it depends on how many journal * data area (the size is variable; it depends on how many journal
entries fit into the metadata area) entries fit into the metadata area)
every sector in the data area contains:
- every sector in the data area contains:
* data (504 bytes of data, the last 8 bytes are stored in * data (504 bytes of data, the last 8 bytes are stored in
the journal entry) the journal entry)
* commit id * commit id
To test if the whole journal section was written correctly, every To test if the whole journal section was written correctly, every
512-byte sector of the journal ends with 8-byte commit id. If the 512-byte sector of the journal ends with 8-byte commit id. If the
commit id matches on all sectors in a journal section, then it is commit id matches on all sectors in a journal section, then it is
assumed that the section was written correctly. If the commit id assumed that the section was written correctly. If the commit id
doesn't match, the section was written partially and it should not doesn't match, the section was written partially and it should not
be replayed. be replayed.
* one or more runs of interleaved tags and data. Each run contains:
* one or more runs of interleaved tags and data.
Each run contains:
* tag area - it contains integrity tags. There is one tag for each * tag area - it contains integrity tags. There is one tag for each
sector in the data area sector in the data area
* data area - it contains data sectors. The number of data sectors * data area - it contains data sectors. The number of data sectors
......
=====
dm-io dm-io
===== =====
...@@ -7,7 +8,7 @@ version. ...@@ -7,7 +8,7 @@ version.
The user must set up an io_region structure to describe the desired location The user must set up an io_region structure to describe the desired location
of the I/O. Each io_region indicates a block-device along with the starting of the I/O. Each io_region indicates a block-device along with the starting
sector and size of the region. sector and size of the region::
struct io_region { struct io_region {
struct block_device *bdev; struct block_device *bdev;
...@@ -19,7 +20,7 @@ Dm-io can read from one io_region or write to one or more io_regions. Writes ...@@ -19,7 +20,7 @@ Dm-io can read from one io_region or write to one or more io_regions. Writes
to multiple regions are specified by an array of io_region structures. to multiple regions are specified by an array of io_region structures.
The first I/O service type takes a list of memory pages as the data buffer for The first I/O service type takes a list of memory pages as the data buffer for
the I/O, along with an offset into the first page. the I/O, along with an offset into the first page::
struct page_list { struct page_list {
struct page_list *next; struct page_list *next;
...@@ -35,7 +36,7 @@ the I/O, along with an offset into the first page. ...@@ -35,7 +36,7 @@ the I/O, along with an offset into the first page.
The second I/O service type takes an array of bio vectors as the data buffer The second I/O service type takes an array of bio vectors as the data buffer
for the I/O. This service can be handy if the caller has a pre-assembled bio, for the I/O. This service can be handy if the caller has a pre-assembled bio,
but wants to direct different portions of the bio to different devices. but wants to direct different portions of the bio to different devices::
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where, int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
int rw, struct bio_vec *bvec, int rw, struct bio_vec *bvec,
...@@ -47,7 +48,7 @@ but wants to direct different portions of the bio to different devices. ...@@ -47,7 +48,7 @@ but wants to direct different portions of the bio to different devices.
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
data buffer for the I/O. This service can be handy if the caller needs to do data buffer for the I/O. This service can be handy if the caller needs to do
I/O to a large region but doesn't want to allocate a large number of individual I/O to a large region but doesn't want to allocate a large number of individual
memory pages. memory pages::
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw, int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
void *data, unsigned long *error_bits); void *data, unsigned long *error_bits);
...@@ -55,11 +56,11 @@ memory pages. ...@@ -55,11 +56,11 @@ memory pages.
void *data, io_notify_fn fn, void *context); void *data, io_notify_fn fn, void *context);
Callers of the asynchronous I/O services must include the name of a completion Callers of the asynchronous I/O services must include the name of a completion
callback routine and a pointer to some context data for the I/O. callback routine and a pointer to some context data for the I/O::
typedef void (*io_notify_fn)(unsigned long error, void *context); typedef void (*io_notify_fn)(unsigned long error, void *context);
The "error" parameter in this callback, as well as the "*error" parameter in The "error" parameter in this callback, as well as the `*error` parameter in
all of the synchronous versions, is a bitset (instead of a simple error value). all of the synchronous versions, is a bitset (instead of a simple error value).
In the case of an write-I/O to multiple regions, this bitset allows dm-io to In the case of an write-I/O to multiple regions, this bitset allows dm-io to
indicate success or failure on each individual region. indicate success or failure on each individual region.
...@@ -72,4 +73,3 @@ always available in order to avoid unnecessary waiting while performing I/O. ...@@ -72,4 +73,3 @@ always available in order to avoid unnecessary waiting while performing I/O.
When the user is finished using the dm-io services, they should call When the user is finished using the dm-io services, they should call
dm_io_put() and specify the same number of pages that were given on the dm_io_put() and specify the same number of pages that were given on the
dm_io_get() call. dm_io_get() call.
=====================
Device-Mapper Logging Device-Mapper Logging
===================== =====================
The device-mapper logging code is used by some of the device-mapper The device-mapper logging code is used by some of the device-mapper
...@@ -16,11 +17,13 @@ dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different ...@@ -16,11 +17,13 @@ dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
logging implementations are available and provide different logging implementations are available and provide different
capabilities. The list includes: capabilities. The list includes:
============== ==============================================================
Type Files Type Files
==== ===== ============== ==============================================================
disk drivers/md/dm-log.c disk drivers/md/dm-log.c
core drivers/md/dm-log.c core drivers/md/dm-log.c
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
============== ==============================================================
The "disk" log type The "disk" log type
------------------- -------------------
......
===============
dm-queue-length dm-queue-length
=============== ===============
...@@ -6,12 +7,18 @@ which selects a path with the least number of in-flight I/Os. ...@@ -6,12 +7,18 @@ which selects a path with the least number of in-flight I/Os.
The path selector name is 'queue-length'. The path selector name is 'queue-length'.
Table parameters for each path: [<repeat_count>] Table parameters for each path: [<repeat_count>]
::
<repeat_count>: The number of I/Os to dispatch using the selected <repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path. path before switching to the next path.
If not given, internal default is used. To check If not given, internal default is used. To check
the default value, see the activated table. the default value, see the activated table.
Status for each path: <status> <fail-count> <in-flight> Status for each path: <status> <fail-count> <in-flight>
::
<status>: 'A' if the path is active, 'F' if the path is failed. <status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures. <fail-count>: The number of path failures.
<in-flight>: The number of in-flight I/Os on the path. <in-flight>: The number of in-flight I/Os on the path.
...@@ -29,11 +36,13 @@ Examples ...@@ -29,11 +36,13 @@ Examples
======== ========
In case that 2 paths (sda and sdb) are used with repeat_count == 128. In case that 2 paths (sda and sdb) are used with repeat_count == 128.
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \ ::
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
dmsetup create test dmsetup create test
# #
# dmsetup table # dmsetup table
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128 test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
# #
# dmsetup status # dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
=======
dm-raid dm-raid
======= =======
...@@ -8,49 +9,66 @@ interface. ...@@ -8,49 +9,66 @@ interface.
Mapping Table Interface Mapping Table Interface
----------------------- -----------------------
The target is named "raid" and it accepts the following parameters: The target is named "raid" and it accepts the following parameters::
<raid_type> <#raid_params> <raid_params> \ <raid_type> <#raid_params> <raid_params> \
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>] <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
<raid_type>: <raid_type>:
============= ===============================================================
raid0 RAID0 striping (no resilience) raid0 RAID0 striping (no resilience)
raid1 RAID1 mirroring raid1 RAID1 mirroring
raid4 RAID4 with dedicated last parity disk raid4 RAID4 with dedicated last parity disk
raid5_n RAID5 with dedicated last parity disk supporting takeover raid5_n RAID5 with dedicated last parity disk supporting takeover
Same as raid4 Same as raid4
-Transitory layout
- Transitory layout
raid5_la RAID5 left asymmetric raid5_la RAID5 left asymmetric
- rotating parity 0 with data continuation - rotating parity 0 with data continuation
raid5_ra RAID5 right asymmetric raid5_ra RAID5 right asymmetric
- rotating parity N with data continuation - rotating parity N with data continuation
raid5_ls RAID5 left symmetric raid5_ls RAID5 left symmetric
- rotating parity 0 with data restart - rotating parity 0 with data restart
raid5_rs RAID5 right symmetric raid5_rs RAID5 right symmetric
- rotating parity N with data restart - rotating parity N with data restart
raid6_zr RAID6 zero restart raid6_zr RAID6 zero restart
- rotating parity zero (left-to-right) with data restart - rotating parity zero (left-to-right) with data restart
raid6_nr RAID6 N restart raid6_nr RAID6 N restart
- rotating parity N (right-to-left) with data restart - rotating parity N (right-to-left) with data restart
raid6_nc RAID6 N continue raid6_nc RAID6 N continue
- rotating parity N (right-to-left) with data continuation - rotating parity N (right-to-left) with data continuation
raid6_n_6 RAID6 with dedicate parity disks raid6_n_6 RAID6 with dedicate parity disks
- parity and Q-syndrome on the last 2 disks; - parity and Q-syndrome on the last 2 disks;
layout for takeover from/to raid4/raid5_n layout for takeover from/to raid4/raid5_n
raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
- layout for takeover from raid5_la from/to raid6 - layout for takeover from raid5_la from/to raid6
raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
- layout for takeover from raid5_ra from/to raid6 - layout for takeover from raid5_ra from/to raid6
raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
- layout for takeover from raid5_ls from/to raid6 - layout for takeover from raid5_ls from/to raid6
raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
- layout for takeover from raid5_rs from/to raid6 - layout for takeover from raid5_rs from/to raid6
raid10 Various RAID10 inspired algorithms chosen by additional params raid10 Various RAID10 inspired algorithms chosen by additional params
(see raid10_format and raid10_copies below) (see raid10_format and raid10_copies below)
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors') - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
- RAID1E: Integrated Adjacent Stripe Mirroring - RAID1E: Integrated Adjacent Stripe Mirroring
- RAID1E: Integrated Offset Stripe Mirroring - RAID1E: Integrated Offset Stripe Mirroring
- and other similar RAID10 variants - and other similar RAID10 variants
============= ===============================================================
Reference: Chapter 4 of Reference: Chapter 4 of
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
...@@ -58,33 +76,41 @@ The target is named "raid" and it accepts the following parameters: ...@@ -58,33 +76,41 @@ The target is named "raid" and it accepts the following parameters:
<#raid_params>: The number of parameters that follow. <#raid_params>: The number of parameters that follow.
<raid_params> consists of <raid_params> consists of
Mandatory parameters: Mandatory parameters:
<chunk_size>: Chunk size in sectors. This parameter is often known as <chunk_size>:
Chunk size in sectors. This parameter is often known as
"stripe size". It is the only mandatory parameter and "stripe size". It is the only mandatory parameter and
is placed first. is placed first.
followed by optional parameters (in any order): followed by optional parameters (in any order):
[sync|nosync] Force or prevent RAID initialization. [sync|nosync]
Force or prevent RAID initialization.
[rebuild <idx>] Rebuild drive number 'idx' (first drive is 0). [rebuild <idx>]
Rebuild drive number 'idx' (first drive is 0).
[daemon_sleep <ms>] [daemon_sleep <ms>]
Interval between runs of the bitmap daemon that Interval between runs of the bitmap daemon that
clear bits. A longer interval means less bitmap I/O but clear bits. A longer interval means less bitmap I/O but
resyncing after a failure is likely to take longer. resyncing after a failure is likely to take longer.
[min_recovery_rate <kB/sec/disk>] Throttle RAID initialization [min_recovery_rate <kB/sec/disk>]
[max_recovery_rate <kB/sec/disk>] Throttle RAID initialization Throttle RAID initialization
[write_mostly <idx>] Mark drive index 'idx' write-mostly. [max_recovery_rate <kB/sec/disk>]
[max_write_behind <sectors>] See '--write-behind=' (man mdadm) Throttle RAID initialization
[stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only) [write_mostly <idx>]
Mark drive index 'idx' write-mostly.
[max_write_behind <sectors>]
See '--write-behind=' (man mdadm)
[stripe_cache <sectors>]
Stripe cache size (RAID 4/5/6 only)
[region_size <sectors>] [region_size <sectors>]
The region_size multiplied by the number of regions is the The region_size multiplied by the number of regions is the
logical size of the array. The bitmap records the device logical size of the array. The bitmap records the device
synchronisation state for each region. synchronisation state for each region.
[raid10_copies <# copies>] [raid10_copies <# copies>], [raid10_format <near|far|offset>]
[raid10_format <near|far|offset>]
These two options are used to alter the default layout of These two options are used to alter the default layout of
a RAID10 configuration. The number of copies is can be a RAID10 configuration. The number of copies is can be
specified, but the default is 2. There are also three specified, but the default is 2. There are also three
...@@ -93,13 +119,17 @@ The target is named "raid" and it accepts the following parameters: ...@@ -93,13 +119,17 @@ The target is named "raid" and it accepts the following parameters:
respect to mirroring. If these options are left unspecified, respect to mirroring. If these options are left unspecified,
or 'raid10_copies 2' and/or 'raid10_format near' are given, or 'raid10_copies 2' and/or 'raid10_format near' are given,
then the layouts for 2, 3 and 4 devices are: then the layouts for 2, 3 and 4 devices are:
======== ========== ==============
2 drives 3 drives 4 drives 2 drives 3 drives 4 drives
-------- ---------- -------------- ======== ========== ==============
A1 A1 A1 A1 A2 A1 A1 A2 A2 A1 A1 A1 A1 A2 A1 A1 A2 A2
A2 A2 A2 A3 A3 A3 A3 A4 A4 A2 A2 A2 A3 A3 A3 A3 A4 A4
A3 A3 A4 A4 A5 A5 A5 A6 A6 A3 A3 A4 A4 A5 A5 A5 A6 A6
A4 A4 A5 A6 A6 A7 A7 A8 A8 A4 A4 A5 A6 A6 A7 A7 A8 A8
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
======== ========== ==============
The 2-device layout is equivalent 2-way RAID1. The 4-device The 2-device layout is equivalent 2-way RAID1. The 4-device
layout is what a traditional RAID10 would look like. The layout is what a traditional RAID10 would look like. The
3-device layout is what might be called a 'RAID1E - Integrated 3-device layout is what might be called a 'RAID1E - Integrated
...@@ -107,8 +137,10 @@ The target is named "raid" and it accepts the following parameters: ...@@ -107,8 +137,10 @@ The target is named "raid" and it accepts the following parameters:
If 'raid10_copies 2' and 'raid10_format far', then the layouts If 'raid10_copies 2' and 'raid10_format far', then the layouts
for 2, 3 and 4 devices are: for 2, 3 and 4 devices are:
======== ============ ===================
2 drives 3 drives 4 drives 2 drives 3 drives 4 drives
-------- -------------- -------------------- ======== ============ ===================
A1 A2 A1 A2 A3 A1 A2 A3 A4 A1 A2 A1 A2 A3 A1 A2 A3 A4
A3 A4 A4 A5 A6 A5 A6 A7 A8 A3 A4 A4 A5 A6 A5 A6 A7 A8
A5 A6 A7 A8 A9 A9 A10 A11 A12 A5 A6 A7 A8 A9 A9 A10 A11 A12
...@@ -117,11 +149,14 @@ The target is named "raid" and it accepts the following parameters: ...@@ -117,11 +149,14 @@ The target is named "raid" and it accepts the following parameters:
A4 A3 A6 A4 A5 A6 A5 A8 A7 A4 A3 A6 A4 A5 A6 A5 A8 A7
A6 A5 A9 A7 A8 A10 A9 A12 A11 A6 A5 A9 A7 A8 A10 A9 A12 A11
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
======== ============ ===================
If 'raid10_copies 2' and 'raid10_format offset', then the If 'raid10_copies 2' and 'raid10_format offset', then the
layouts for 2, 3 and 4 devices are: layouts for 2, 3 and 4 devices are:
======== ========== ================
2 drives 3 drives 4 drives 2 drives 3 drives 4 drives
-------- ------------ ----------------- ======== ========== ================
A1 A2 A1 A2 A3 A1 A2 A3 A4 A1 A2 A1 A2 A3 A1 A2 A3 A4
A2 A1 A3 A1 A2 A2 A1 A4 A3 A2 A1 A3 A1 A2 A2 A1 A4 A3
A3 A4 A4 A5 A6 A5 A6 A7 A8 A3 A4 A4 A5 A6 A5 A6 A7 A8
...@@ -129,6 +164,8 @@ The target is named "raid" and it accepts the following parameters: ...@@ -129,6 +164,8 @@ The target is named "raid" and it accepts the following parameters:
A5 A6 A7 A8 A9 A9 A10 A11 A12 A5 A6 A7 A8 A9 A9 A10 A11 A12
A6 A5 A9 A7 A8 A10 A9 A12 A11 A6 A5 A9 A7 A8 A10 A9 A12 A11
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
======== ========== ================
Here we see layouts closely akin to 'RAID1E - Integrated Here we see layouts closely akin to 'RAID1E - Integrated
Offset Stripe Mirroring'. Offset Stripe Mirroring'.
...@@ -190,20 +227,23 @@ The target is named "raid" and it accepts the following parameters: ...@@ -190,20 +227,23 @@ The target is named "raid" and it accepts the following parameters:
Example Tables Example Tables
-------------- --------------
# RAID4 - 4 data drives, 1 parity (no metadata devices)
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \ ::
# RAID4 - 4 data drives, 1 parity (no metadata devices)
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
raid4 1 2048 \ raid4 1 2048 \
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
# RAID4 - 4 data drives, 1 parity (with metadata devices) # RAID4 - 4 data drives, 1 parity (with metadata devices)
# Chunk size of 1MiB, force RAID initialization, # Chunk size of 1MiB, force RAID initialization,
# min recovery rate at 20 kiB/sec/disk # min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \ 0 1960893648 raid \
raid4 4 2048 sync min_recovery_rate 20 \ raid4 4 2048 sync min_recovery_rate 20 \
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
...@@ -219,41 +259,58 @@ Arguments that can be repeated are ordered by value. ...@@ -219,41 +259,58 @@ Arguments that can be repeated are ordered by value.
'dmsetup status' yields information on the state and health of the array. 'dmsetup status' yields information on the state and health of the array.
The output is as follows (normally a single line, but expanded here for The output is as follows (normally a single line, but expanded here for
clarity): clarity)::
1: <s> <l> raid \
2: <raid_type> <#devices> <health_chars> \ 1: <s> <l> raid \
3: <sync_ratio> <sync_action> <mismatch_cnt> 2: <raid_type> <#devices> <health_chars> \
3: <sync_ratio> <sync_action> <mismatch_cnt>
Line 1 is the standard output produced by device-mapper. Line 1 is the standard output produced by device-mapper.
Line 2 & 3 are produced by the raid target and are best explained by example:
Line 2 & 3 are produced by the raid target and are best explained by example::
0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
Here we can see the RAID type is raid4, there are 5 devices - all of Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with its initial which are 'A'live, and the array is 2/490221568 complete with its initial
recovery. Here is a fuller description of the individual fields: recovery. Here is a fuller description of the individual fields:
=============== =========================================================
<raid_type> Same as the <raid_type> used to create the array. <raid_type> Same as the <raid_type> used to create the array.
<health_chars> One char for each device, indicating: 'A' = alive and <health_chars> One char for each device, indicating:
in-sync, 'a' = alive but not in-sync, 'D' = dead/failed.
- 'A' = alive and in-sync
- 'a' = alive but not in-sync
- 'D' = dead/failed.
<sync_ratio> The ratio indicating how much of the array has undergone <sync_ratio> The ratio indicating how much of the array has undergone
the process described by 'sync_action'. If the the process described by 'sync_action'. If the
'sync_action' is "check" or "repair", then the process 'sync_action' is "check" or "repair", then the process
of "resync" or "recover" can be considered complete. of "resync" or "recover" can be considered complete.
<sync_action> One of the following possible states: <sync_action> One of the following possible states:
idle - No synchronization action is being performed.
frozen - The current action has been halted. idle
resync - Array is undergoing its initial synchronization - No synchronization action is being performed.
frozen
- The current action has been halted.
resync
- Array is undergoing its initial synchronization
or is resynchronizing after an unclean shutdown or is resynchronizing after an unclean shutdown
(possibly aided by a bitmap). (possibly aided by a bitmap).
recover - A device in the array is being rebuilt or recover
- A device in the array is being rebuilt or
replaced. replaced.
check - A user-initiated full check of the array is check
- A user-initiated full check of the array is
being performed. All blocks are read and being performed. All blocks are read and
checked for consistency. The number of checked for consistency. The number of
discrepancies found are recorded in discrepancies found are recorded in
<mismatch_cnt>. No changes are made to the <mismatch_cnt>. No changes are made to the
array by this action. array by this action.
repair - The same as "check", but discrepancies are repair
- The same as "check", but discrepancies are
corrected. corrected.
reshape - The array is undergoing a reshape. reshape
- The array is undergoing a reshape.
<mismatch_cnt> The number of discrepancies found between mirror copies <mismatch_cnt> The number of discrepancies found between mirror copies
in RAID1/10 or wrong parity values found in RAID4/5/6. in RAID1/10 or wrong parity values found in RAID4/5/6.
This value is valid only after a "check" of the array This value is valid only after a "check" of the array
...@@ -261,10 +318,11 @@ recovery. Here is a fuller description of the individual fields: ...@@ -261,10 +318,11 @@ recovery. Here is a fuller description of the individual fields:
<data_offset> The current data offset to the start of the user data on <data_offset> The current data offset to the start of the user data on
each component device of a raid set (see the respective each component device of a raid set (see the respective
raid parameter to support out-of-place reshaping). raid parameter to support out-of-place reshaping).
<journal_char> 'A' - active write-through journal device. <journal_char> - 'A' - active write-through journal device.
'a' - active write-back journal device. - 'a' - active write-back journal device.
'D' - dead journal device. - 'D' - dead journal device.
'-' - no journal device. - '-' - no journal device.
=============== =========================================================
Message Interface Message Interface
...@@ -272,12 +330,15 @@ Message Interface ...@@ -272,12 +330,15 @@ Message Interface
The dm-raid target will accept certain actions through the 'message' interface. The dm-raid target will accept certain actions through the 'message' interface.
('man dmsetup' for more information on the message interface.) These actions ('man dmsetup' for more information on the message interface.) These actions
include: include:
"idle" - Halt the current sync action.
"frozen" - Freeze the current sync action. ========= ================================================
"resync" - Initiate/continue a resync. "idle" Halt the current sync action.
"recover"- Initiate/continue a recover process. "frozen" Freeze the current sync action.
"check" - Initiate a check (i.e. a "scrub") of the array. "resync" Initiate/continue a resync.
"repair" - Initiate a repair of the array. "recover" Initiate/continue a recover process.
"check" Initiate a check (i.e. a "scrub") of the array.
"repair" Initiate a repair of the array.
========= ================================================
Discard Support Discard Support
...@@ -307,48 +368,52 @@ increasingly whitelisted in the kernel and can thus be trusted. ...@@ -307,48 +368,52 @@ increasingly whitelisted in the kernel and can thus be trusted.
For trusted devices, the following dm-raid module parameter can be set For trusted devices, the following dm-raid module parameter can be set
to safely enable discard support for RAID 4/5/6: to safely enable discard support for RAID 4/5/6:
'devices_handle_discards_safely' 'devices_handle_discards_safely'
Version History Version History
--------------- ---------------
1.0.0 Initial version. Support for RAID 4/5/6
1.1.0 Added support for RAID 1 ::
1.2.0 Handle creation of arrays that contain failed devices.
1.3.0 Added support for RAID 10 1.0.0 Initial version. Support for RAID 4/5/6
1.3.1 Allow device replacement/rebuild for RAID 10 1.1.0 Added support for RAID 1
1.3.2 Fix/improve redundancy checking for RAID10 1.2.0 Handle creation of arrays that contain failed devices.
1.4.0 Non-functional change. Removes arg from mapping function. 1.3.0 Added support for RAID 10
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). 1.3.1 Allow device replacement/rebuild for RAID 10
1.4.2 Add RAID10 "far" and "offset" algorithm support. 1.3.2 Fix/improve redundancy checking for RAID10
1.5.0 Add message interface to allow manipulation of the sync_action. 1.4.0 Non-functional change. Removes arg from mapping function.
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
1.4.2 Add RAID10 "far" and "offset" algorithm support.
1.5.0 Add message interface to allow manipulation of the sync_action.
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
1.5.1 Add ability to restore transiently failed devices on resume. 1.5.1 Add ability to restore transiently failed devices on resume.
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
1.6.0 Add discard support (and devices_handle_discard_safely module param). 1.6.0 Add discard support (and devices_handle_discard_safely module param).
1.7.0 Add support for MD RAID0 mappings. 1.7.0 Add support for MD RAID0 mappings.
1.8.0 Explicitly check for compatible flags in the superblock metadata 1.8.0 Explicitly check for compatible flags in the superblock metadata
and reject to start the raid set if any are set by a newer and reject to start the raid set if any are set by a newer
target version, thus avoiding data corruption on a raid set target version, thus avoiding data corruption on a raid set
with a reshape in progress. with a reshape in progress.
1.9.0 Add support for RAID level takeover/reshape/region size 1.9.0 Add support for RAID level takeover/reshape/region size
and set size reduction. and set size reduction.
1.9.1 Fix activation of existing RAID 4/10 mapped devices 1.9.1 Fix activation of existing RAID 4/10 mapped devices
1.9.2 Don't emit '- -' on the status table line in case the constructor 1.9.2 Don't emit '- -' on the status table line in case the constructor
fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
'D' on the status line. If '- -' is passed into the constructor, emit 'D' on the status line. If '- -' is passed into the constructor, emit
'- -' on the table line and '-' as the status line health character. '- -' on the table line and '-' as the status line health character.
1.10.0 Add support for raid4/5/6 journal device 1.10.0 Add support for raid4/5/6 journal device
1.10.1 Fix data corruption on reshape request 1.10.1 Fix data corruption on reshape request
1.11.0 Fix table line argument order 1.11.0 Fix table line argument order
(wrong raid10_copies/raid10_format sequence) (wrong raid10_copies/raid10_format sequence)
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
state races. state races.
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
deadlock/potential data corruption. Update superblock when deadlock/potential data corruption. Update superblock when
specific devices are requested via rebuild. Fix RAID leg specific devices are requested via rebuild. Fix RAID leg
rebuild errors. rebuild errors.
===============
dm-service-time dm-service-time
=============== ===============
...@@ -12,24 +13,33 @@ in a path-group, and it can be specified as a table argument. ...@@ -12,24 +13,33 @@ in a path-group, and it can be specified as a table argument.
The path selector name is 'service-time'. The path selector name is 'service-time'.
Table parameters for each path: [<repeat_count> [<relative_throughput>]] Table parameters for each path:
<repeat_count>: The number of I/Os to dispatch using the selected
[<repeat_count> [<relative_throughput>]]
<repeat_count>:
The number of I/Os to dispatch using the selected
path before switching to the next path. path before switching to the next path.
If not given, internal default is used. To check If not given, internal default is used. To check
the default value, see the activated table. the default value, see the activated table.
<relative_throughput>: The relative throughput value of the path <relative_throughput>:
The relative throughput value of the path
among all paths in the path-group. among all paths in the path-group.
The valid range is 0-100. The valid range is 0-100.
If not given, minimum value '1' is used. If not given, minimum value '1' is used.
If '0' is given, the path isn't selected while If '0' is given, the path isn't selected while
other paths having a positive value are available. other paths having a positive value are available.
Status for each path: <status> <fail-count> <in-flight-size> \ Status for each path:
<relative_throughput>
<status>: 'A' if the path is active, 'F' if the path is failed. <status> <fail-count> <in-flight-size> <relative_throughput>
<fail-count>: The number of path failures. <status>:
<in-flight-size>: The size of in-flight I/Os on the path. 'A' if the path is active, 'F' if the path is failed.
<relative_throughput>: The relative throughput value of the path <fail-count>:
The number of path failures.
<in-flight-size>:
The size of in-flight I/Os on the path.
<relative_throughput>:
The relative throughput value of the path
among all paths in the path-group. among all paths in the path-group.
...@@ -39,7 +49,7 @@ Algorithm ...@@ -39,7 +49,7 @@ Algorithm
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
dispatched and subtracts when completed. dispatched and subtracts when completed.
Basically, dm-service-time selects a path having minimum service time Basically, dm-service-time selects a path having minimum service time
which is calculated by: which is calculated by::
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput' ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
...@@ -67,25 +77,25 @@ Examples ...@@ -67,25 +77,25 @@ Examples
======== ========
In case that 2 paths (sda and sdb) are used with repeat_count == 128 In case that 2 paths (sda and sdb) are used with repeat_count == 128
and sda has an average throughput 1GB/s and sdb has 4GB/s, and sda has an average throughput 1GB/s and sdb has 4GB/s,
'relative_throughput' value may be '1' for sda and '4' for sdb. 'relative_throughput' value may be '1' for sda and '4' for sdb::
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
dmsetup create test dmsetup create test
# #
# dmsetup table # dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
# #
# dmsetup status # dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
Or '2' for sda and '8' for sdb would be also true. Or '2' for sda and '8' for sdb would be also true::
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
dmsetup create test dmsetup create test
# #
# dmsetup table # dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
# #
# dmsetup status # dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8 test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
====================
device-mapper uevent
====================
The device-mapper uevent code adds the capability to device-mapper to create The device-mapper uevent code adds the capability to device-mapper to create
and send kobject uevents (uevents). Previously device-mapper events were only and send kobject uevents (uevents). Previously device-mapper events were only
available through the ioctl interface. The advantage of the uevents interface available through the ioctl interface. The advantage of the uevents interface
...@@ -6,92 +10,101 @@ the event avoiding the need to query the state of the device-mapper device after ...@@ -6,92 +10,101 @@ the event avoiding the need to query the state of the device-mapper device after
the event is received. the event is received.
There are two functions currently for device-mapper events. The first function There are two functions currently for device-mapper events. The first function
listed creates the event and the second function sends the event(s). listed creates the event and the second function sends the event(s)::
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti, void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
const char *path, unsigned nr_valid_paths) const char *path, unsigned nr_valid_paths)
void dm_send_uevents(struct list_head *events, struct kobject *kobj) void dm_send_uevents(struct list_head *events, struct kobject *kobj)
The variables added to the uevent environment are: The variables added to the uevent environment are:
Variable Name: DM_TARGET Variable Name: DM_TARGET
Uevent Action(s): KOBJ_CHANGE ------------------------
Type: string :Uevent Action(s): KOBJ_CHANGE
Description: :Type: string
Value: Name of device-mapper target that generated the event. :Description:
:Value: Name of device-mapper target that generated the event.
Variable Name: DM_ACTION Variable Name: DM_ACTION
Uevent Action(s): KOBJ_CHANGE ------------------------
Type: string :Uevent Action(s): KOBJ_CHANGE
Description: :Type: string
Value: Device-mapper specific action that caused the uevent action. :Description:
PATH_FAILED - A path has failed. :Value: Device-mapper specific action that caused the uevent action.
PATH_FAILED - A path has failed;
PATH_REINSTATED - A path has been reinstated. PATH_REINSTATED - A path has been reinstated.
Variable Name: DM_SEQNUM Variable Name: DM_SEQNUM
Uevent Action(s): KOBJ_CHANGE ------------------------
Type: unsigned integer :Uevent Action(s): KOBJ_CHANGE
Description: A sequence number for this specific device-mapper device. :Type: unsigned integer
Value: Valid unsigned integer range. :Description: A sequence number for this specific device-mapper device.
:Value: Valid unsigned integer range.
Variable Name: DM_PATH Variable Name: DM_PATH
Uevent Action(s): KOBJ_CHANGE ----------------------
Type: string :Uevent Action(s): KOBJ_CHANGE
Description: Major and minor number of the path device pertaining to this :Type: string
event. :Description: Major and minor number of the path device pertaining to this
Value: Path name in the form of "Major:Minor" event.
:Value: Path name in the form of "Major:Minor"
Variable Name: DM_NR_VALID_PATHS Variable Name: DM_NR_VALID_PATHS
Uevent Action(s): KOBJ_CHANGE --------------------------------
Type: unsigned integer :Uevent Action(s): KOBJ_CHANGE
Description: :Type: unsigned integer
Value: Valid unsigned integer range. :Description:
:Value: Valid unsigned integer range.
Variable Name: DM_NAME Variable Name: DM_NAME
Uevent Action(s): KOBJ_CHANGE ----------------------
Type: string :Uevent Action(s): KOBJ_CHANGE
Description: Name of the device-mapper device. :Type: string
Value: Name :Description: Name of the device-mapper device.
:Value: Name
Variable Name: DM_UUID Variable Name: DM_UUID
Uevent Action(s): KOBJ_CHANGE ----------------------
Type: string :Uevent Action(s): KOBJ_CHANGE
Description: UUID of the device-mapper device. :Type: string
Value: UUID. (Empty string if there isn't one.) :Description: UUID of the device-mapper device.
:Value: UUID. (Empty string if there isn't one.)
An example of the uevents generated as captured by udevmonitor is shown An example of the uevents generated as captured by udevmonitor is shown
below. below
1.) Path failure. 1.) Path failure::
UEVENT[1192521009.711215] change@/block/dm-3
ACTION=change UEVENT[1192521009.711215] change@/block/dm-3
DEVPATH=/block/dm-3 ACTION=change
SUBSYSTEM=block DEVPATH=/block/dm-3
DM_TARGET=multipath SUBSYSTEM=block
DM_ACTION=PATH_FAILED DM_TARGET=multipath
DM_SEQNUM=1 DM_ACTION=PATH_FAILED
DM_PATH=8:32 DM_SEQNUM=1
DM_NR_VALID_PATHS=0 DM_PATH=8:32
DM_NAME=mpath2 DM_NR_VALID_PATHS=0
DM_UUID=mpath-35333333000002328 DM_NAME=mpath2
MINOR=3 DM_UUID=mpath-35333333000002328
MAJOR=253 MINOR=3
SEQNUM=1130 MAJOR=253
SEQNUM=1130
2.) Path reinstate.
UEVENT[1192521132.989927] change@/block/dm-3 2.) Path reinstate::
ACTION=change
DEVPATH=/block/dm-3 UEVENT[1192521132.989927] change@/block/dm-3
SUBSYSTEM=block ACTION=change
DM_TARGET=multipath DEVPATH=/block/dm-3
DM_ACTION=PATH_REINSTATED SUBSYSTEM=block
DM_SEQNUM=2 DM_TARGET=multipath
DM_PATH=8:32 DM_ACTION=PATH_REINSTATED
DM_NR_VALID_PATHS=1 DM_SEQNUM=2
DM_NAME=mpath2 DM_PATH=8:32
DM_UUID=mpath-35333333000002328 DM_NR_VALID_PATHS=1
MINOR=3 DM_NAME=mpath2
MAJOR=253 DM_UUID=mpath-35333333000002328
SEQNUM=1131 MINOR=3
MAJOR=253
SEQNUM=1131
========
dm-zoned dm-zoned
======== ========
...@@ -133,12 +134,13 @@ A zoned block device must first be formatted using the dmzadm tool. This ...@@ -133,12 +134,13 @@ A zoned block device must first be formatted using the dmzadm tool. This
will analyze the device zone configuration, determine where to place the will analyze the device zone configuration, determine where to place the
metadata sets on the device and initialize the metadata sets. metadata sets on the device and initialize the metadata sets.
Ex: Ex::
dmzadm --format /dev/sdxx dmzadm --format /dev/sdxx
For a formatted device, the target can be created normally with the For a formatted device, the target can be created normally with the
dmsetup utility. The only parameter that dm-zoned requires is the dmsetup utility. The only parameter that dm-zoned requires is the
underlying zoned block device name. Ex: underlying zoned block device name. Ex::
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
dmsetup create dmz-`basename ${dev}`
======
dm-era
======
Introduction Introduction
============ ============
...@@ -14,12 +18,14 @@ coherency after rolling back a vendor snapshot. ...@@ -14,12 +18,14 @@ coherency after rolling back a vendor snapshot.
Constructor Constructor
=========== ===========
era <metadata dev> <origin dev> <block size> era <metadata dev> <origin dev> <block size>
metadata dev : fast device holding the persistent metadata ================ ======================================================
origin dev : device holding data blocks that may change metadata dev fast device holding the persistent metadata
block size : block size of origin data device, granularity that is origin dev device holding data blocks that may change
block size block size of origin data device, granularity that is
tracked by the target tracked by the target
================ ======================================================
Messages Messages
======== ========
...@@ -49,14 +55,16 @@ Status ...@@ -49,14 +55,16 @@ Status
<metadata block size> <#used metadata blocks>/<#total metadata blocks> <metadata block size> <#used metadata blocks>/<#total metadata blocks>
<current era> <held metadata root | '-'> <current era> <held metadata root | '-'>
metadata block size : Fixed block size for each metadata block in ========================= ==============================================
metadata block size Fixed block size for each metadata block in
sectors sectors
#used metadata blocks : Number of metadata blocks used #used metadata blocks Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks #total metadata blocks Total number of metadata blocks
current era : The current era current era The current era
held metadata root : The location, in blocks, of the metadata root held metadata root The location, in blocks, of the metadata root
that has been 'held' for userspace read that has been 'held' for userspace read
access. '-' indicates there is no held root access. '-' indicates there is no held root
========================= ==============================================
Detailed use case Detailed use case
================= =================
...@@ -88,7 +96,7 @@ Memory usage ...@@ -88,7 +96,7 @@ Memory usage
The target uses a bitset to record writes in the current era. It also The target uses a bitset to record writes in the current era. It also
has a spare bitset ready for switching over to a new era. Other than has a spare bitset ready for switching over to a new era. Other than
that it uses a few 4k blocks for updating metadata. that it uses a few 4k blocks for updating metadata::
(4 * nr_blocks) bytes + buffers (4 * nr_blocks) bytes + buffers
......
:orphan:
=============
Device Mapper
=============
.. toctree::
:maxdepth: 1
cache-policies
cache
delay
dm-crypt
dm-flakey
dm-init
dm-integrity
dm-io
dm-log
dm-queue-length
dm-raid
dm-service-time
dm-uevent
dm-zoned
era
kcopyd
linear
log-writes
persistent-data
snapshot
statistics
striped
switch
thin-provisioning
unstriped
verity
writecache
zero
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
======
kcopyd kcopyd
====== ======
...@@ -7,7 +8,7 @@ notification. It is used by dm-snapshot and dm-mirror. ...@@ -7,7 +8,7 @@ notification. It is used by dm-snapshot and dm-mirror.
Users of kcopyd must first create a client and indicate how many memory pages Users of kcopyd must first create a client and indicate how many memory pages
to set aside for their copy jobs. This is done with a call to to set aside for their copy jobs. This is done with a call to
kcopyd_client_create(). kcopyd_client_create()::
int kcopyd_client_create(unsigned int num_pages, int kcopyd_client_create(unsigned int num_pages,
struct kcopyd_client **result); struct kcopyd_client **result);
...@@ -16,7 +17,7 @@ To start a copy job, the user must set up io_region structures to describe ...@@ -16,7 +17,7 @@ To start a copy job, the user must set up io_region structures to describe
the source and destinations of the copy. Each io_region indicates a the source and destinations of the copy. Each io_region indicates a
block-device along with the starting sector and size of the region. The source block-device along with the starting sector and size of the region. The source
of the copy is given as one io_region structure, and the destinations of the of the copy is given as one io_region structure, and the destinations of the
copy are given as an array of io_region structures. copy are given as an array of io_region structures::
struct io_region { struct io_region {
struct block_device *bdev; struct block_device *bdev;
...@@ -26,7 +27,7 @@ copy are given as an array of io_region structures. ...@@ -26,7 +27,7 @@ copy are given as an array of io_region structures.
To start the copy, the user calls kcopyd_copy(), passing in the client To start the copy, the user calls kcopyd_copy(), passing in the client
pointer, pointers to the source and destination io_regions, the name of a pointer, pointers to the source and destination io_regions, the name of a
completion callback routine, and a pointer to some context data for the copy. completion callback routine, and a pointer to some context data for the copy::
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from, int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
unsigned int num_dests, struct io_region *dests, unsigned int num_dests, struct io_region *dests,
...@@ -41,7 +42,6 @@ write error occurred during the copy. ...@@ -41,7 +42,6 @@ write error occurred during the copy.
When a user is done with all their copy jobs, they should call When a user is done with all their copy jobs, they should call
kcopyd_client_destroy() to delete the kcopyd client, which will release the kcopyd_client_destroy() to delete the kcopyd client, which will release the
associated memory pages. associated memory pages::
void kcopyd_client_destroy(struct kcopyd_client *kc); void kcopyd_client_destroy(struct kcopyd_client *kc);
=========
dm-linear
=========
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
device onto a linear range of another device. This is the basic building
block of logical volume managers.
Parameters: <dev path> <offset>
<dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number.
<offset>:
Starting sector within the device.
Example scripts
===============
::
#!/bin/sh
# Create an identity mapping for a device
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
::
#!/bin/sh
# Join 2 devices together
size1=`blockdev --getsz $1`
size2=`blockdev --getsz $2`
echo "0 $size1 linear $1 0
$size1 $size2 linear $2 0" | dmsetup create joined
::
#!/usr/bin/perl -w
# Split a device into 4M chunks and then join them together in reverse order.
my $name = "reverse";
my $extent_size = 4 * 1024 * 2;
my $dev = $ARGV[0];
my $table = "";
my $count = 0;
if (!defined($dev)) {
die("Please specify a device.\n");
}
my $dev_size = `blockdev --getsz $dev`;
my $extents = int($dev_size / $extent_size) -
(($dev_size % $extent_size) ? 1 : 0);
while ($extents > 0) {
my $this_start = $count * $extent_size;
$extents--;
$count++;
my $this_offset = $extents * $extent_size;
$table .= "$this_start $extent_size linear $dev $this_offset\n";
}
`echo \"$table\" | dmsetup create $name`;
dm-linear
=========
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
device onto a linear range of another device. This is the basic building
block of logical volume managers.
Parameters: <dev path> <offset>
<dev path>: Full pathname to the underlying block-device, or a
"major:minor" device-number.
<offset>: Starting sector within the device.
Example scripts
===============
[[
#!/bin/sh
# Create an identity mapping for a device
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
]]
[[
#!/bin/sh
# Join 2 devices together
size1=`blockdev --getsz $1`
size2=`blockdev --getsz $2`
echo "0 $size1 linear $1 0
$size1 $size2 linear $2 0" | dmsetup create joined
]]
[[
#!/usr/bin/perl -w
# Split a device into 4M chunks and then join them together in reverse order.
my $name = "reverse";
my $extent_size = 4 * 1024 * 2;
my $dev = $ARGV[0];
my $table = "";
my $count = 0;
if (!defined($dev)) {
die("Please specify a device.\n");
}
my $dev_size = `blockdev --getsz $dev`;
my $extents = int($dev_size / $extent_size) -
(($dev_size % $extent_size) ? 1 : 0);
while ($extents > 0) {
my $this_start = $count * $extent_size;
$extents--;
$count++;
my $this_offset = $extents * $extent_size;
$table .= "$this_start $extent_size linear $dev $this_offset\n";
}
`echo \"$table\" | dmsetup create $name`;
]]
=============
dm-log-writes dm-log-writes
============= =============
...@@ -25,11 +26,11 @@ completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to ...@@ -25,11 +26,11 @@ completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
simulate the worst case scenario with regard to power failures. Consider the simulate the worst case scenario with regard to power failures. Consider the
following example (W means write, C means complete): following example (W means write, C means complete):
W1,W2,W3,C3,C2,Wflush,C1,Cflush W1,W2,W3,C3,C2,Wflush,C1,Cflush
The log would show the following The log would show the following:
W3,W2,flush,W1.... W3,W2,flush,W1....
Again this is to simulate what is actually on disk, this allows us to detect Again this is to simulate what is actually on disk, this allows us to detect
cases where a power failure at a particular point in time would create an cases where a power failure at a particular point in time would create an
...@@ -42,11 +43,11 @@ Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would ...@@ -42,11 +43,11 @@ Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
have all the DISCARD requests, and then the WRITE requests and then the FLUSH have all the DISCARD requests, and then the WRITE requests and then the FLUSH
request. Consider the following example: request. Consider the following example:
WRITE block 1, DISCARD block 1, FLUSH WRITE block 1, DISCARD block 1, FLUSH
If we logged DISCARD when it completed, the replay would look like this If we logged DISCARD when it completed, the replay would look like this:
DISCARD 1, WRITE 1, FLUSH DISCARD 1, WRITE 1, FLUSH
which isn't quite what happened and wouldn't be caught during the log replay. which isn't quite what happened and wouldn't be caught during the log replay.
...@@ -57,15 +58,19 @@ i) Constructor ...@@ -57,15 +58,19 @@ i) Constructor
log-writes <dev_path> <log_dev_path> log-writes <dev_path> <log_dev_path>
dev_path : Device that all of the IO will go to normally. ============= ==============================================
log_dev_path : Device where the log entries are written to. dev_path Device that all of the IO will go to normally.
log_dev_path Device where the log entries are written to.
============= ==============================================
ii) Status ii) Status
<#logged entries> <highest allocated sector> <#logged entries> <highest allocated sector>
#logged entries : Number of logged entries =========================== ========================
highest allocated sector : Highest allocated sector #logged entries Number of logged entries
highest allocated sector Highest allocated sector
=========================== ========================
iii) Messages iii) Messages
...@@ -75,7 +80,7 @@ iii) Messages ...@@ -75,7 +80,7 @@ iii) Messages
For example say you want to fsck a file system after every For example say you want to fsck a file system after every
write, but first you need to replay up to the mkfs to make sure write, but first you need to replay up to the mkfs to make sure
we're fsck'ing something reasonable, you would do something like we're fsck'ing something reasonable, you would do something like
this: this::
mkfs.btrfs -f /dev/mapper/log mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs dmsetup message log 0 mark mkfs
...@@ -97,42 +102,42 @@ Example usage ...@@ -97,42 +102,42 @@ Example usage
============= =============
Say you want to test fsync on your file system. You would do something like Say you want to test fsync on your file system. You would do something like
this: this::
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE" dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs dmsetup message log 0 mark mkfs
mount /dev/mapper/log /mnt/btrfs-test mount /dev/mapper/log /mnt/btrfs-test
<some test that does fsync at the end> <some test that does fsync at the end>
dmsetup message log 0 mark fsync dmsetup message log 0 mark fsync
md5sum /mnt/btrfs-test/foo md5sum /mnt/btrfs-test/foo
umount /mnt/btrfs-test umount /mnt/btrfs-test
dmsetup remove log dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
mount /dev/sdb /mnt/btrfs-test mount /dev/sdb /mnt/btrfs-test
md5sum /mnt/btrfs-test/foo md5sum /mnt/btrfs-test/foo
<verify md5sum's are correct> <verify md5sum's are correct>
Another option is to do a complicated file system operation and verify the file Another option is to do a complicated file system operation and verify the file
system is consistent during the entire operation. You could do this with: system is consistent during the entire operation. You could do this with:
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE" dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs dmsetup message log 0 mark mkfs
mount /dev/mapper/log /mnt/btrfs-test mount /dev/mapper/log /mnt/btrfs-test
<fsstress to dirty the fs> <fsstress to dirty the fs>
btrfs filesystem balance /mnt/btrfs-test btrfs filesystem balance /mnt/btrfs-test
umount /mnt/btrfs-test umount /mnt/btrfs-test
dmsetup remove log dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
btrfsck /dev/sdb btrfsck /dev/sdb
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
--fsck "btrfsck /dev/sdb" --check fua --fsck "btrfsck /dev/sdb" --check fua
And that will replay the log until it sees a FUA request, run the fsck command And that will replay the log until it sees a FUA request, run the fsck command
......
===============
Persistent data
===============
Introduction Introduction
============ ============
......
==============================
Device-mapper snapshot support Device-mapper snapshot support
============================== ==============================
Device-mapper allows you, without massive data copying: Device-mapper allows you, without massive data copying:
*) To create snapshots of any block device i.e. mountable, saved states of - To create snapshots of any block device i.e. mountable, saved states of
the block device which are also writable without interfering with the the block device which are also writable without interfering with the
original content; original content;
*) To create device "forks", i.e. multiple different versions of the - To create device "forks", i.e. multiple different versions of the
same data stream. same data stream.
*) To merge a snapshot of a block device back into the snapshot's origin - To merge a snapshot of a block device back into the snapshot's origin
device. device.
In the first two cases, dm copies only the chunks of data that get In the first two cases, dm copies only the chunks of data that get
changed and uses a separate copy-on-write (COW) block device for changed and uses a separate copy-on-write (COW) block device for
...@@ -22,7 +23,7 @@ the origin device. ...@@ -22,7 +23,7 @@ the origin device.
There are three dm targets available: There are three dm targets available:
snapshot, snapshot-origin, and snapshot-merge. snapshot, snapshot-origin, and snapshot-merge.
*) snapshot-origin <origin> - snapshot-origin <origin>
which will normally have one or more snapshots based on it. which will normally have one or more snapshots based on it.
Reads will be mapped directly to the backing device. For each write, the Reads will be mapped directly to the backing device. For each write, the
...@@ -30,7 +31,7 @@ original data will be saved in the <COW device> of each snapshot to keep ...@@ -30,7 +31,7 @@ original data will be saved in the <COW device> of each snapshot to keep
its visible content unchanged, at least until the <COW device> fills up. its visible content unchanged, at least until the <COW device> fills up.
*) snapshot <origin> <COW device> <persistent?> <chunksize> - snapshot <origin> <COW device> <persistent?> <chunksize>
A snapshot of the <origin> block device is created. Changed chunks of A snapshot of the <origin> block device is created. Changed chunks of
<chunksize> sectors will be stored on the <COW device>. Writes will <chunksize> sectors will be stored on the <COW device>. Writes will
...@@ -83,25 +84,25 @@ When you create the first LVM2 snapshot of a volume, four dm devices are used: ...@@ -83,25 +84,25 @@ When you create the first LVM2 snapshot of a volume, four dm devices are used:
source volume), whose table is replaced by a "snapshot-origin" mapping source volume), whose table is replaced by a "snapshot-origin" mapping
from device #1. from device #1.
A fixed naming scheme is used, so with the following commands: A fixed naming scheme is used, so with the following commands::
lvcreate -L 1G -n base volumeGroup lvcreate -L 1G -n base volumeGroup
lvcreate -L 100M --snapshot -n snap volumeGroup/base lvcreate -L 100M --snapshot -n snap volumeGroup/base
we'll have this situation (with volumes in above order): we'll have this situation (with volumes in above order)::
# dmsetup table|grep volumeGroup # dmsetup table|grep volumeGroup
volumeGroup-base-real: 0 2097152 linear 8:19 384 volumeGroup-base-real: 0 2097152 linear 8:19 384
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536 volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16 volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
volumeGroup-base: 0 2097152 snapshot-origin 254:11 volumeGroup-base: 0 2097152 snapshot-origin 254:11
# ls -lL /dev/mapper/volumeGroup-* # ls -lL /dev/mapper/volumeGroup-*
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
How snapshot-merge is used by LVM2 How snapshot-merge is used by LVM2
...@@ -114,27 +115,28 @@ merging snapshot after it completes. The "snapshot" that hands over its ...@@ -114,27 +115,28 @@ merging snapshot after it completes. The "snapshot" that hands over its
COW device to the "snapshot-merge" is deactivated (unless using lvchange COW device to the "snapshot-merge" is deactivated (unless using lvchange
--refresh); but if it is left active it will simply return I/O errors. --refresh); but if it is left active it will simply return I/O errors.
A snapshot will merge into its origin with the following command: A snapshot will merge into its origin with the following command::
lvconvert --merge volumeGroup/snap lvconvert --merge volumeGroup/snap
we'll now have this situation: we'll now have this situation::
# dmsetup table|grep volumeGroup # dmsetup table|grep volumeGroup
volumeGroup-base-real: 0 2097152 linear 8:19 384 volumeGroup-base-real: 0 2097152 linear 8:19 384
volumeGroup-base-cow: 0 204800 linear 8:19 2097536 volumeGroup-base-cow: 0 204800 linear 8:19 2097536
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16 volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
# ls -lL /dev/mapper/volumeGroup-* # ls -lL /dev/mapper/volumeGroup-*
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
How to determine when a merging is complete How to determine when a merging is complete
=========================================== ===========================================
The snapshot-merge and snapshot status lines end with: The snapshot-merge and snapshot status lines end with:
<sectors_allocated>/<total_sectors> <metadata_sectors> <sectors_allocated>/<total_sectors> <metadata_sectors>
Both <sectors_allocated> and <total_sectors> include both data and metadata. Both <sectors_allocated> and <total_sectors> include both data and metadata.
...@@ -142,35 +144,37 @@ During merging, the number of sectors allocated gets smaller and ...@@ -142,35 +144,37 @@ During merging, the number of sectors allocated gets smaller and
smaller. Merging has finished when the number of sectors holding data smaller. Merging has finished when the number of sectors holding data
is zero, in other words <sectors_allocated> == <metadata_sectors>. is zero, in other words <sectors_allocated> == <metadata_sectors>.
Here is a practical example (using a hybrid of lvm and dmsetup commands): Here is a practical example (using a hybrid of lvm and dmsetup commands)::
# lvs # lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g base volumeGroup owi-a- 4.00g
snap volumeGroup swi-a- 1.00g base 18.97 snap volumeGroup swi-a- 1.00g base 18.97
# dmsetup status volumeGroup-snap # dmsetup status volumeGroup-snap
0 8388608 snapshot 397896/2097152 1560 0 8388608 snapshot 397896/2097152 1560
^^^^ metadata sectors ^^^^ metadata sectors
# lvconvert --merge -b volumeGroup/snap # lvconvert --merge -b volumeGroup/snap
Merging of volume snap started. Merging of volume snap started.
# lvs volumeGroup/snap # lvs volumeGroup/snap
LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup Owi-a- 4.00g 17.23 base volumeGroup Owi-a- 4.00g 17.23
# dmsetup status volumeGroup-base # dmsetup status volumeGroup-base
0 8388608 snapshot-merge 281688/2097152 1104 0 8388608 snapshot-merge 281688/2097152 1104
# dmsetup status volumeGroup-base # dmsetup status volumeGroup-base
0 8388608 snapshot-merge 180480/2097152 712 0 8388608 snapshot-merge 180480/2097152 712
# dmsetup status volumeGroup-base # dmsetup status volumeGroup-base
0 8388608 snapshot-merge 16/2097152 16 0 8388608 snapshot-merge 16/2097152 16
Merging has finished. Merging has finished.
# lvs ::
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g base volumeGroup owi-a- 4.00g
=============
DM statistics DM statistics
============= =============
...@@ -11,7 +12,7 @@ Individual statistics will be collected for each step-sized area within ...@@ -11,7 +12,7 @@ Individual statistics will be collected for each step-sized area within
the range specified. the range specified.
The I/O statistics counters for each step-sized area of a region are The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats (see: in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
Documentation/iostats.txt). But two extra counters (12 and 13) are Documentation/iostats.txt). But two extra counters (12 and 13) are
provided: total time spent reading and writing. When the histogram provided: total time spent reading and writing. When the histogram
argument is used, the 14th parameter is reported that represents the argument is used, the 14th parameter is reported that represents the
...@@ -32,40 +33,45 @@ on each other's data. ...@@ -32,40 +33,45 @@ on each other's data.
The creation of DM statistics will allocate memory via kmalloc or The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much memory may be allocated by DM statistics. The admin can see how much
memory is used by reading memory is used by reading:
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
Messages Messages
======== ========
@stats_create <range> <step> @stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
[<number_of_optional_arguments> <optional_arguments>...]
[<program_id> [<aux_data>]]
Create a new region and return the region_id. Create a new region and return the region_id.
<range> <range>
"-" - whole device "-"
"<start_sector>+<length>" - a range of <length> 512-byte sectors whole device
"<start_sector>+<length>"
a range of <length> 512-byte sectors
starting with <start_sector>. starting with <start_sector>.
<step> <step>
"<area_size>" - the range is subdivided into areas each containing "<area_size>"
the range is subdivided into areas each containing
<area_size> sectors. <area_size> sectors.
"/<number_of_areas>" - the range is subdivided into the specified "/<number_of_areas>"
the range is subdivided into the specified
number of areas. number of areas.
<number_of_optional_arguments> <number_of_optional_arguments>
The number of optional arguments The number of optional arguments
<optional_arguments> <optional_arguments>
The following optional arguments are supported The following optional arguments are supported:
precise_timestamps - use precise timer with nanosecond resolution
precise_timestamps
use precise timer with nanosecond resolution
instead of the "jiffies" variable. When this argument is instead of the "jiffies" variable. When this argument is
used, the resulting times are in nanoseconds instead of used, the resulting times are in nanoseconds instead of
milliseconds. Precise timestamps are a little bit slower milliseconds. Precise timestamps are a little bit slower
to obtain than jiffies-based timestamps. to obtain than jiffies-based timestamps.
histogram:n1,n2,n3,n4,... - collect histogram of latencies. The histogram:n1,n2,n3,n4,...
collect histogram of latencies. The
numbers n1, n2, etc are times that represent the boundaries numbers n1, n2, etc are times that represent the boundaries
of the histogram. If precise_timestamps is not used, the of the histogram. If precise_timestamps is not used, the
times are in milliseconds, otherwise they are in times are in milliseconds, otherwise they are in
...@@ -96,21 +102,18 @@ Messages ...@@ -96,21 +102,18 @@ Messages
@stats_list message, but it doesn't use this value for anything. @stats_list message, but it doesn't use this value for anything.
@stats_delete <region_id> @stats_delete <region_id>
Delete the region with the specified id. Delete the region with the specified id.
<region_id> <region_id>
region_id returned from @stats_create region_id returned from @stats_create
@stats_clear <region_id> @stats_clear <region_id>
Clear all the counters except the in-flight i/o counters. Clear all the counters except the in-flight i/o counters.
<region_id> <region_id>
region_id returned from @stats_create region_id returned from @stats_create
@stats_list [<program_id>] @stats_list [<program_id>]
List all regions registered with @stats_create. List all regions registered with @stats_create.
<program_id> <program_id>
...@@ -127,7 +130,6 @@ Messages ...@@ -127,7 +130,6 @@ Messages
if they were specified when creating the region. if they were specified when creating the region.
@stats_print <region_id> [<starting_line> <number_of_lines>] @stats_print <region_id> [<starting_line> <number_of_lines>]
Print counters for each step-sized area of a region. Print counters for each step-sized area of a region.
<region_id> <region_id>
...@@ -143,10 +145,11 @@ Messages ...@@ -143,10 +145,11 @@ Messages
Output format for each step-sized area of a region: Output format for each step-sized area of a region:
<start_sector>+<length> counters <start_sector>+<length>
counters
The first 11 counters have the same meaning as The first 11 counters have the same meaning as
/sys/block/*/stat or /proc/diskstats. `/sys/block/*/stat or /proc/diskstats`.
Please refer to Documentation/iostats.txt for details. Please refer to Documentation/iostats.txt for details.
...@@ -163,11 +166,11 @@ Messages ...@@ -163,11 +166,11 @@ Messages
11. the weighted number of milliseconds spent doing I/Os 11. the weighted number of milliseconds spent doing I/Os
Additional counters: Additional counters:
12. the total time spent reading in milliseconds 12. the total time spent reading in milliseconds
13. the total time spent writing in milliseconds 13. the total time spent writing in milliseconds
@stats_print_clear <region_id> [<starting_line> <number_of_lines>] @stats_print_clear <region_id> [<starting_line> <number_of_lines>]
Atomically print and then clear all the counters except the Atomically print and then clear all the counters except the
in-flight i/o counters. Useful when the client consuming the in-flight i/o counters. Useful when the client consuming the
statistics does not want to lose any statistics (those updated statistics does not want to lose any statistics (those updated
...@@ -185,7 +188,6 @@ Messages ...@@ -185,7 +188,6 @@ Messages
If omitted, all lines are printed and then cleared. If omitted, all lines are printed and then cleared.
@stats_set_aux <region_id> <aux_data> @stats_set_aux <region_id> <aux_data>
Store auxiliary data aux_data for the specified region. Store auxiliary data aux_data for the specified region.
<region_id> <region_id>
...@@ -201,23 +203,23 @@ Examples ...@@ -201,23 +203,23 @@ Examples
======== ========
Subdivide the DM device 'vol' into 100 pieces and start collecting Subdivide the DM device 'vol' into 100 pieces and start collecting
statistics on them: statistics on them::
dmsetup message vol 0 @stats_create - /100 dmsetup message vol 0 @stats_create - /100
Set the auxiliary data string to "foo bar baz" (the escape for each Set the auxiliary data string to "foo bar baz" (the escape for each
space must also be escaped, otherwise the shell will consume them): space must also be escaped, otherwise the shell will consume them)::
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
List the statistics: List the statistics::
dmsetup message vol 0 @stats_list dmsetup message vol 0 @stats_list
Print the statistics: Print the statistics::
dmsetup message vol 0 @stats_print 0 dmsetup message vol 0 @stats_print 0
Delete the statistics: Delete the statistics::
dmsetup message vol 0 @stats_delete 0 dmsetup message vol 0 @stats_delete 0
=========
dm-stripe dm-stripe
========= =========
...@@ -8,12 +9,16 @@ potentially provide improved I/O throughput by utilizing several physical ...@@ -8,12 +9,16 @@ potentially provide improved I/O throughput by utilizing several physical
devices in parallel. devices in parallel.
Parameters: <num devs> <chunk size> [<dev path> <offset>]+ Parameters: <num devs> <chunk size> [<dev path> <offset>]+
<num devs>: Number of underlying devices. <num devs>:
<chunk size>: Size of each chunk of data. Must be at least as Number of underlying devices.
<chunk size>:
Size of each chunk of data. Must be at least as
large as the system's PAGE_SIZE. large as the system's PAGE_SIZE.
<dev path>: Full pathname to the underlying block-device, or a <dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number. "major:minor" device-number.
<offset>: Starting sector within the device. <offset>:
Starting sector within the device.
One or more underlying devices can be specified. The striped device size must One or more underlying devices can be specified. The striped device size must
be a multiple of the chunk size multiplied by the number of underlying devices. be a multiple of the chunk size multiplied by the number of underlying devices.
...@@ -22,36 +27,35 @@ be a multiple of the chunk size multiplied by the number of underlying devices. ...@@ -22,36 +27,35 @@ be a multiple of the chunk size multiplied by the number of underlying devices.
Example scripts Example scripts
=============== ===============
[[ ::
#!/usr/bin/perl -w
# Create a striped device across any number of underlying devices. The device #!/usr/bin/perl -w
# will be called "stripe_dev" and have a chunk-size of 128k. # Create a striped device across any number of underlying devices. The device
# will be called "stripe_dev" and have a chunk-size of 128k.
my $chunk_size = 128 * 2; my $chunk_size = 128 * 2;
my $dev_name = "stripe_dev"; my $dev_name = "stripe_dev";
my $num_devs = @ARGV; my $num_devs = @ARGV;
my @devs = @ARGV; my @devs = @ARGV;
my ($min_dev_size, $stripe_dev_size, $i); my ($min_dev_size, $stripe_dev_size, $i);
if (!$num_devs) { if (!$num_devs) {
die("Specify at least one device\n"); die("Specify at least one device\n");
} }
$min_dev_size = `blockdev --getsz $devs[0]`; $min_dev_size = `blockdev --getsz $devs[0]`;
for ($i = 1; $i < $num_devs; $i++) { for ($i = 1; $i < $num_devs; $i++) {
my $this_size = `blockdev --getsz $devs[$i]`; my $this_size = `blockdev --getsz $devs[$i]`;
$min_dev_size = ($min_dev_size < $this_size) ? $min_dev_size = ($min_dev_size < $this_size) ?
$min_dev_size : $this_size; $min_dev_size : $this_size;
} }
$stripe_dev_size = $min_dev_size * $num_devs; $stripe_dev_size = $min_dev_size * $num_devs;
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs); $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
$table = "0 $stripe_dev_size striped $num_devs $chunk_size"; $table = "0 $stripe_dev_size striped $num_devs $chunk_size";
for ($i = 0; $i < $num_devs; $i++) { for ($i = 0; $i < $num_devs; $i++) {
$table .= " $devs[$i] 0"; $table .= " $devs[$i] 0";
} }
`echo $table | dmsetup create $dev_name`;
]]
`echo $table | dmsetup create $dev_name`;
=========
dm-switch dm-switch
========= =========
...@@ -67,24 +68,22 @@ b-tree can achieve. ...@@ -67,24 +68,22 @@ b-tree can achieve.
Construction Parameters Construction Parameters
======================= =======================
<num_paths> <region_size> <num_optional_args> [<optional_args>...] <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
[<dev_path> <offset>]+ <num_paths>
<num_paths>
The number of paths across which to distribute the I/O. The number of paths across which to distribute the I/O.
<region_size> <region_size>
The number of 512-byte sectors in a region. Each region can be redirected The number of 512-byte sectors in a region. Each region can be redirected
to any of the available paths. to any of the available paths.
<num_optional_args> <num_optional_args>
The number of optional arguments. Currently, no optional arguments The number of optional arguments. Currently, no optional arguments
are supported and so this must be zero. are supported and so this must be zero.
<dev_path> <dev_path>
The block device that represents a specific path to the device. The block device that represents a specific path to the device.
<offset> <offset>
The offset of the start of data on the specific <dev_path> (in units The offset of the start of data on the specific <dev_path> (in units
of 512-byte sectors). This number is added to the sector number when of 512-byte sectors). This number is added to the sector number when
forwarding the request to the specific path. Typically it is zero. forwarding the request to the specific path. Typically it is zero.
...@@ -122,17 +121,21 @@ Example ...@@ -122,17 +121,21 @@ Example
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
the same size. the same size.
Create a switch device with 64kB region size: Create a switch device with 64kB region size::
dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
Set mappings for the first 7 entries to point to devices switch0, switch1, Set mappings for the first 7 entries to point to devices switch0, switch1,
switch2, switch0, switch1, switch2, switch1: switch2, switch0, switch1, switch2, switch1::
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
Set repetitive mapping. This command: Set repetitive mapping. This command::
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to:
is equivalent to::
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 :1 :2
=================
Thin provisioning
=================
Introduction Introduction
============ ============
...@@ -95,6 +99,8 @@ previously.) ...@@ -95,6 +99,8 @@ previously.)
Using an existing pool device Using an existing pool device
----------------------------- -----------------------------
::
dmsetup create pool \ dmsetup create pool \
--table "0 20971520 thin-pool $metadata_dev $data_dev \ --table "0 20971520 thin-pool $metadata_dev $data_dev \
$data_block_size $low_water_mark" $data_block_size $low_water_mark"
...@@ -154,7 +160,7 @@ Thin provisioning ...@@ -154,7 +160,7 @@ Thin provisioning
i) Creating a new thinly-provisioned volume. i) Creating a new thinly-provisioned volume.
To create a new thinly- provisioned volume you must send a message to an To create a new thinly- provisioned volume you must send a message to an
active pool device, /dev/mapper/pool in this example. active pool device, /dev/mapper/pool in this example::
dmsetup message /dev/mapper/pool 0 "create_thin 0" dmsetup message /dev/mapper/pool 0 "create_thin 0"
...@@ -164,7 +170,7 @@ i) Creating a new thinly-provisioned volume. ...@@ -164,7 +170,7 @@ i) Creating a new thinly-provisioned volume.
ii) Using a thinly-provisioned volume. ii) Using a thinly-provisioned volume.
Thinly-provisioned volumes are activated using the 'thin' target: Thinly-provisioned volumes are activated using the 'thin' target::
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
...@@ -181,6 +187,8 @@ i) Creating an internal snapshot. ...@@ -181,6 +187,8 @@ i) Creating an internal snapshot.
must suspend it before creating the snapshot to avoid corruption. must suspend it before creating the snapshot to avoid corruption.
This is NOT enforced at the moment, so please be careful! This is NOT enforced at the moment, so please be careful!
::
dmsetup suspend /dev/mapper/thin dmsetup suspend /dev/mapper/thin
dmsetup message /dev/mapper/pool 0 "create_snap 1 0" dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
dmsetup resume /dev/mapper/thin dmsetup resume /dev/mapper/thin
...@@ -198,14 +206,14 @@ ii) Using an internal snapshot. ...@@ -198,14 +206,14 @@ ii) Using an internal snapshot.
activating or removing them both. (This differs from conventional activating or removing them both. (This differs from conventional
device-mapper snapshots.) device-mapper snapshots.)
Activate it exactly the same way as any other thinly-provisioned volume: Activate it exactly the same way as any other thinly-provisioned volume::
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
External snapshots External snapshots
------------------ ------------------
You can use an external _read only_ device as an origin for a You can use an external **read only** device as an origin for a
thinly-provisioned volume. Any read to an unprovisioned area of the thinly-provisioned volume. Any read to an unprovisioned area of the
thin device will be passed through to the origin. Writes trigger thin device will be passed through to the origin. Writes trigger
the allocation of new blocks as usual. the allocation of new blocks as usual.
...@@ -223,11 +231,13 @@ i) Creating a snapshot of an external device ...@@ -223,11 +231,13 @@ i) Creating a snapshot of an external device
This is the same as creating a thin device. This is the same as creating a thin device.
You don't mention the origin at this stage. You don't mention the origin at this stage.
::
dmsetup message /dev/mapper/pool 0 "create_thin 0" dmsetup message /dev/mapper/pool 0 "create_thin 0"
ii) Using a snapshot of an external device. ii) Using a snapshot of an external device.
Append an extra parameter to the thin target specifying the origin: Append an extra parameter to the thin target specifying the origin::
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
...@@ -240,6 +250,8 @@ Deactivation ...@@ -240,6 +250,8 @@ Deactivation
All devices using a pool must be deactivated before the pool itself All devices using a pool must be deactivated before the pool itself
can be. can be.
::
dmsetup remove thin dmsetup remove thin
dmsetup remove snap dmsetup remove snap
dmsetup remove pool dmsetup remove pool
...@@ -252,25 +264,32 @@ Reference ...@@ -252,25 +264,32 @@ Reference
i) Constructor i) Constructor
::
thin-pool <metadata dev> <data dev> <data block size (sectors)> \ thin-pool <metadata dev> <data dev> <data block size (sectors)> \
<low water mark (blocks)> [<number of feature args> [<arg>]*] <low water mark (blocks)> [<number of feature args> [<arg>]*]
Optional feature arguments: Optional feature arguments:
skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. skip_block_zeroing:
Skip the zeroing of newly-provisioned blocks.
ignore_discard: Disable discard support. ignore_discard:
Disable discard support.
no_discard_passdown: Don't pass discards down to the underlying no_discard_passdown:
Don't pass discards down to the underlying
data device, but just remove the mapping. data device, but just remove the mapping.
read_only: Don't allow any changes to be made to the pool read_only:
Don't allow any changes to be made to the pool
metadata. This mode is only available after the metadata. This mode is only available after the
thin-pool has been created and first used in full thin-pool has been created and first used in full
read/write mode. It cannot be specified on initial read/write mode. It cannot be specified on initial
thin-pool creation. thin-pool creation.
error_if_no_space: Error IOs, instead of queueing, if no space. error_if_no_space:
Error IOs, instead of queueing, if no space.
Data block size must be between 64KB (128 sectors) and 1GB Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive. (2097152 sectors) inclusive.
...@@ -278,6 +297,8 @@ i) Constructor ...@@ -278,6 +297,8 @@ i) Constructor
ii) Status ii) Status
::
<transaction id> <used metadata blocks>/<total metadata blocks> <transaction id> <used metadata blocks>/<total metadata blocks>
<used data blocks>/<total data blocks> <held metadata root> <used data blocks>/<total data blocks> <held metadata root>
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
...@@ -336,13 +357,11 @@ ii) Status ...@@ -336,13 +357,11 @@ ii) Status
iii) Messages iii) Messages
create_thin <dev id> create_thin <dev id>
Create a new thinly-provisioned device. Create a new thinly-provisioned device.
<dev id> is an arbitrary unique 24-bit identifier chosen by <dev id> is an arbitrary unique 24-bit identifier chosen by
the caller. the caller.
create_snap <dev id> <origin id> create_snap <dev id> <origin id>
Create a new snapshot of another thinly-provisioned device. Create a new snapshot of another thinly-provisioned device.
<dev id> is an arbitrary unique 24-bit identifier chosen by <dev id> is an arbitrary unique 24-bit identifier chosen by
the caller. the caller.
...@@ -350,11 +369,9 @@ iii) Messages ...@@ -350,11 +369,9 @@ iii) Messages
of which the new device will be a snapshot. of which the new device will be a snapshot.
delete <dev id> delete <dev id>
Deletes a thin device. Irreversible. Deletes a thin device. Irreversible.
set_transaction_id <current id> <new id> set_transaction_id <current id> <new id>
Userland volume managers, such as LVM, need a way to Userland volume managers, such as LVM, need a way to
synchronise their external metadata with the internal metadata of the synchronise their external metadata with the internal metadata of the
pool target. The thin-pool target offers to store an pool target. The thin-pool target offers to store an
...@@ -364,14 +381,12 @@ iii) Messages ...@@ -364,14 +381,12 @@ iii) Messages
compare-and-swap message. compare-and-swap message.
reserve_metadata_snap reserve_metadata_snap
Reserve a copy of the data mapping btree for use by userland. Reserve a copy of the data mapping btree for use by userland.
This allows userland to inspect the mappings as they were when This allows userland to inspect the mappings as they were when
this message was executed. Use the pool's status command to this message was executed. Use the pool's status command to
get the root block associated with the metadata snapshot. get the root block associated with the metadata snapshot.
release_metadata_snap release_metadata_snap
Release a previously reserved copy of the data mapping btree. Release a previously reserved copy of the data mapping btree.
'thin' target 'thin' target
...@@ -379,6 +394,8 @@ iii) Messages ...@@ -379,6 +394,8 @@ iii) Messages
i) Constructor i) Constructor
::
thin <pool dev> <dev id> [<external origin dev>] thin <pool dev> <dev id> [<external origin dev>]
pool dev: pool dev:
...@@ -402,7 +419,6 @@ provisioned as and when needed. ...@@ -402,7 +419,6 @@ provisioned as and when needed.
ii) Status ii) Status
<nr mapped sectors> <highest mapped sector> <nr mapped sectors> <highest mapped sector>
If the pool has encountered device errors and failed, the status If the pool has encountered device errors and failed, the status
will just contain the string 'Fail'. The userspace recovery will just contain the string 'Fail'. The userspace recovery
tools should then be used. tools should then be used.
......
================================
Device-mapper "unstriped" target
================================
Introduction Introduction
============ ============
...@@ -34,46 +38,46 @@ striped target to combine the 4 devices into one. It then will use ...@@ -34,46 +38,46 @@ striped target to combine the 4 devices into one. It then will use
the unstriped target ontop of the striped device to access the the unstriped target ontop of the striped device to access the
individual backing loop devices. We write data to the newly exposed individual backing loop devices. We write data to the newly exposed
unstriped devices and verify the data written matches the correct unstriped devices and verify the data written matches the correct
underlying device on the striped array. underlying device on the striped array::
#!/bin/bash #!/bin/bash
MEMBER_SIZE=$((128 * 1024 * 1024)) MEMBER_SIZE=$((128 * 1024 * 1024))
NUM=4 NUM=4
SEQ_END=$((${NUM}-1)) SEQ_END=$((${NUM}-1))
CHUNK=256 CHUNK=256
BS=4096 BS=4096
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
COUNT=$((${MEMBER_SIZE} / ${BS})) COUNT=$((${MEMBER_SIZE} / ${BS}))
for i in $(seq 0 ${SEQ_END}); do for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
losetup /dev/loop${i} member-${i} losetup /dev/loop${i} member-${i}
DM_PARMS+=" /dev/loop${i} 0" DM_PARMS+=" /dev/loop${i} 0"
done done
echo $DM_PARMS | dmsetup create raid0 echo $DM_PARMS | dmsetup create raid0
for i in $(seq 0 ${SEQ_END}); do for i in $(seq 0 ${SEQ_END}); do
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
done; done;
for i in $(seq 0 ${SEQ_END}); do for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
diff /dev/mapper/set-${i} member-${i} diff /dev/mapper/set-${i} member-${i}
done; done;
for i in $(seq 0 ${SEQ_END}); do for i in $(seq 0 ${SEQ_END}); do
dmsetup remove set-${i} dmsetup remove set-${i}
done done
dmsetup remove raid0 dmsetup remove raid0
for i in $(seq 0 ${SEQ_END}); do for i in $(seq 0 ${SEQ_END}); do
losetup -d /dev/loop${i} losetup -d /dev/loop${i}
rm -f member-${i} rm -f member-${i}
done done
Another example Another example
--------------- ---------------
...@@ -81,7 +85,7 @@ Another example ...@@ -81,7 +85,7 @@ Another example
Intel NVMe drives contain two cores on the physical device. Intel NVMe drives contain two cores on the physical device.
Each core of the drive has segregated access to its LBA range. Each core of the drive has segregated access to its LBA range.
The current LBA model has a RAID 0 128k chunk on each core, resulting The current LBA model has a RAID 0 128k chunk on each core, resulting
in a 256k stripe across the two cores: in a 256k stripe across the two cores::
Core 0: Core 1: Core 0: Core 1:
__________ __________ __________ __________
...@@ -108,17 +112,24 @@ Example dmsetup usage ...@@ -108,17 +112,24 @@ Example dmsetup usage
unstriped ontop of Intel NVMe device that has 2 cores unstriped ontop of Intel NVMe device that has 2 cores
----------------------------------------------------- -----------------------------------------------------
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' ::
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
There will now be two devices that expose Intel NVMe core 0 and 1 There will now be two devices that expose Intel NVMe core 0 and 1
respectively: respectively::
/dev/mapper/nvmset0
/dev/mapper/nvmset1 /dev/mapper/nvmset0
/dev/mapper/nvmset1
unstriped ontop of striped with 4 drives using 128K chunk size unstriped ontop of striped with 4 drives using 128K chunk size
-------------------------------------------------------------- --------------------------------------------------------------
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' ::
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0' dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
=========
dm-verity dm-verity
========== =========
Device-Mapper's "verity" target provides transparent integrity checking of Device-Mapper's "verity" target provides transparent integrity checking of
block devices using a cryptographic digest provided by the kernel crypto API. block devices using a cryptographic digest provided by the kernel crypto API.
...@@ -7,6 +8,9 @@ This target is read-only. ...@@ -7,6 +8,9 @@ This target is read-only.
Construction Parameters Construction Parameters
======================= =======================
::
<version> <dev> <hash_dev> <version> <dev> <hash_dev>
<data_block_size> <hash_block_size> <data_block_size> <hash_block_size>
<num_data_blocks> <hash_start_block> <num_data_blocks> <hash_start_block>
...@@ -160,7 +164,9 @@ calculating the parent node. ...@@ -160,7 +164,9 @@ calculating the parent node.
The tree looks something like: The tree looks something like:
alg = sha256, num_blocks = 32768, block_size = 4096 alg = sha256, num_blocks = 32768, block_size = 4096
::
[ root ] [ root ]
/ . . . \ / . . . \
...@@ -189,6 +195,7 @@ block boundary) are the hash blocks which are stored a depth at a time ...@@ -189,6 +195,7 @@ block boundary) are the hash blocks which are stored a depth at a time
The full specification of kernel parameters and on-disk metadata format The full specification of kernel parameters and on-disk metadata format
is available at the cryptsetup project's wiki page is available at the cryptsetup project's wiki page
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
Status Status
...@@ -198,7 +205,8 @@ If any check failed, C (for Corruption) is returned. ...@@ -198,7 +205,8 @@ If any check failed, C (for Corruption) is returned.
Example Example
======= =======
Set up a device: Set up a device::
# dmsetup create vroot --readonly --table \ # dmsetup create vroot --readonly --table \
"0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\ "0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
"4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\ "4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
...@@ -209,11 +217,13 @@ the hash tree or activate the kernel device. This is available from ...@@ -209,11 +217,13 @@ the hash tree or activate the kernel device. This is available from
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/ the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
(as a libcryptsetup extension). (as a libcryptsetup extension).
Create hash on the device: Create hash on the device::
# veritysetup format /dev/sda1 /dev/sda2 # veritysetup format /dev/sda1 /dev/sda2
... ...
Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
Activate the device: Activate the device::
# veritysetup create vroot /dev/sda1 /dev/sda2 \ # veritysetup create vroot /dev/sda1 /dev/sda2 \
4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
=================
Writecache target
=================
The writecache target caches writes on persistent memory or on SSD. It The writecache target caches writes on persistent memory or on SSD. It
doesn't cache reads because reads are supposed to be cached in page cache doesn't cache reads because reads are supposed to be cached in page cache
in normal RAM. in normal RAM.
...@@ -6,15 +10,18 @@ When the device is constructed, the first sector should be zeroed or the ...@@ -6,15 +10,18 @@ When the device is constructed, the first sector should be zeroed or the
first sector should contain valid superblock from previous invocation. first sector should contain valid superblock from previous invocation.
Constructor parameters: Constructor parameters:
1. type of the cache device - "p" or "s" 1. type of the cache device - "p" or "s"
p - persistent memory
s - SSD - p - persistent memory
- s - SSD
2. the underlying device that will be cached 2. the underlying device that will be cached
3. the cache device 3. the cache device
4. block size (4096 is recommended; the maximum block size is the page 4. block size (4096 is recommended; the maximum block size is the page
size) size)
5. the number of optional parameters (the parameters with an argument 5. the number of optional parameters (the parameters with an argument
count as two) count as two)
start_sector n (default: 0) start_sector n (default: 0)
offset from the start of cache device in 512-byte sectors offset from the start of cache device in 512-byte sectors
high_watermark n (default: 50) high_watermark n (default: 50)
...@@ -43,6 +50,7 @@ Constructor parameters: ...@@ -43,6 +50,7 @@ Constructor parameters:
applicable only to persistent memory - don't use the FUA applicable only to persistent memory - don't use the FUA
flag when writing back data and send the FLUSH request flag when writing back data and send the FLUSH request
afterwards afterwards
- some underlying devices perform better with fua, some - some underlying devices perform better with fua, some
with nofua. The user should test it with nofua. The user should test it
...@@ -60,6 +68,7 @@ Messages: ...@@ -60,6 +68,7 @@ Messages:
flush the cache device on next suspend. Use this message flush the cache device on next suspend. Use this message
when you are going to remove the cache device. The proper when you are going to remove the cache device. The proper
sequence for removing the cache device is: sequence for removing the cache device is:
1. send the "flush_on_suspend" message 1. send the "flush_on_suspend" message
2. load an inactive table with a linear target that maps 2. load an inactive table with a linear target that maps
to the underlying device to the underlying device
......
=======
dm-zero dm-zero
======= =======
...@@ -18,20 +19,19 @@ filesystem limitations. ...@@ -18,20 +19,19 @@ filesystem limitations.
To create a sparse device, start by creating a dm-zero device that's the To create a sparse device, start by creating a dm-zero device that's the
desired size of the sparse device. For this example, we'll assume a 10TB desired size of the sparse device. For this example, we'll assume a 10TB
sparse device. sparse device::
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1 echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
Then create a snapshot of the zero device, using any available block-device as Then create a snapshot of the zero device, using any available block-device as
the COW device. The size of the COW device will determine the amount of real the COW device. The size of the COW device will determine the amount of real
space available to the sparse device. For this example, we'll assume /dev/sdb1 space available to the sparse device. For this example, we'll assume /dev/sdb1
is an available 10GB partition. is an available 10GB partition::
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \ echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
dmsetup create sparse1 dmsetup create sparse1
This will create a 10TB sparse device called /dev/mapper/sparse1 that has This will create a 10TB sparse device called /dev/mapper/sparse1 that has
10GB of actual storage space available. If more than 10GB of data is written 10GB of actual storage space available. If more than 10GB of data is written
to this device, it will start returning I/O errors. to this device, it will start returning I/O errors.
...@@ -417,9 +417,9 @@ will then have to be provided beforehand in the normal way. ...@@ -417,9 +417,9 @@ will then have to be provided beforehand in the normal way.
[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/ [DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt [DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt [DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst
[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html [FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html
......
...@@ -453,7 +453,7 @@ config DM_INIT ...@@ -453,7 +453,7 @@ config DM_INIT
Enable "dm-mod.create=" parameter to create mapped devices at init time. Enable "dm-mod.create=" parameter to create mapped devices at init time.
This option is useful to allow mounting rootfs without requiring an This option is useful to allow mounting rootfs without requiring an
initramfs. initramfs.
See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." See Documentation/device-mapper/dm-init.rst for dm-mod.create="..."
format. format.
If unsure, say N. If unsure, say N.
......
...@@ -25,7 +25,7 @@ static char *create; ...@@ -25,7 +25,7 @@ static char *create;
* Format: dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+] * Format: dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
* Table format: <start_sector> <num_sectors> <target_type> <target_args> * Table format: <start_sector> <num_sectors> <target_type> <target_args>
* *
* See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." format * See Documentation/device-mapper/dm-init.rst for dm-mod.create="..." format
* details. * details.
*/ */
......
...@@ -3558,7 +3558,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, ...@@ -3558,7 +3558,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
* v1.5.0+: * v1.5.0+:
* *
* Sync action: * Sync action:
* See Documentation/device-mapper/dm-raid.txt for * See Documentation/device-mapper/dm-raid.rst for
* information on each of these states. * information on each of these states.
*/ */
DMEMIT(" %s", sync_action); DMEMIT(" %s", sync_action);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment