Commit 2ef176f1 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'dm-3.14-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - dm-cache memory allocation failure fix
 - fix DM's Kconfig identation
 - dm-snapshot metadata corruption fix for bug introduced in 3.14-rc1
 - important refcount < 0 fix for the DM persistent data library's space
   map metadata interface which fixes corruption reported by a few
   dm-thinp users

and last but not least:

 - more extensive fixes than ideal for dm-thinp's data resize capability
   (which has had growing pain much like we've seen from -ENOSPC
   handling of filesystems that mature).

   The end result is dm-thinp now handles metadata operation failure and
   no data space error conditions much better than before.

* tag 'dm-3.14-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm space map metadata: fix refcount decrement below 0 which caused corruption
  dm thin: fix Documentation for held metadata root feature
  dm thin: fix noflush suspend IO queueing
  dm thin: fix deadlock in __requeue_bio_list
  dm thin: fix out of data space handling
  dm thin: ensure user takes action to validate data and metadata consistency
  dm thin: synchronize the pool mode during suspend
  dm snapshot: fix metadata corruption
  dm: fix Kconfig indentation
  dm cache mq: fix memory allocation failure for large cache devices
parents b053940d cebc2de4
...@@ -124,12 +124,11 @@ the default being 204800 sectors (or 100MB). ...@@ -124,12 +124,11 @@ the default being 204800 sectors (or 100MB).
Updating on-disk metadata Updating on-disk metadata
------------------------- -------------------------
On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is On-disk metadata is committed every time a FLUSH or FUA bio is written.
written. If no such requests are made then commits will occur every If no such requests are made then commits will occur every second. This
second. This means the cache behaves like a physical disk that has a means the cache behaves like a physical disk that has a volatile write
write cache (the same is true of the thin-provisioning target). If cache. If power is lost you may lose some recent writes. The metadata
power is lost you may lose some recent writes. The metadata should should always be consistent in spite of any crash.
always be consistent in spite of any crash.
The 'dirty' state for a cache block changes far too frequently for us The 'dirty' state for a cache block changes far too frequently for us
to keep updating it on the fly. So we treat it as a hint. In normal to keep updating it on the fly. So we treat it as a hint. In normal
......
...@@ -116,6 +116,35 @@ Resuming a device with a new table itself triggers an event so the ...@@ -116,6 +116,35 @@ Resuming a device with a new table itself triggers an event so the
userspace daemon can use this to detect a situation where a new table userspace daemon can use this to detect a situation where a new table
already exceeds the threshold. already exceeds the threshold.
A low water mark for the metadata device is maintained in the kernel and
will trigger a dm event if free space on the metadata device drops below
it.
Updating on-disk metadata
-------------------------
On-disk metadata is committed every time a FLUSH or FUA bio is written.
If no such requests are made then commits will occur every second. This
means the thin-provisioning target behaves like a physical disk that has
a volatile write cache. If power is lost you may lose some recent
writes. The metadata should always be consistent in spite of any crash.
If data space is exhausted the pool will either error or queue IO
according to the configuration (see: error_if_no_space). If metadata
space is exhausted or a metadata operation fails: the pool will error IO
until the pool is taken offline and repair is performed to 1) fix any
potential inconsistencies and 2) clear the flag that imposes repair.
Once the pool's metadata device is repaired it may be resized, which
will allow the pool to return to normal operation. Note that if a pool
is flagged as needing repair, the pool's data and metadata devices
cannot be resized until repair is performed. It should also be noted
that when the pool's metadata space is exhausted the current metadata
transaction is aborted. Given that the pool will cache IO whose
completion may have already been acknowledged to upper IO layers
(e.g. filesystem) it is strongly suggested that consistency checks
(e.g. fsck) be performed on those layers when repair of the pool is
required.
Thin provisioning Thin provisioning
----------------- -----------------
...@@ -258,10 +287,9 @@ ii) Status ...@@ -258,10 +287,9 @@ ii) Status
should register for the event and then check the target's status. should register for the event and then check the target's status.
held metadata root: held metadata root:
The location, in sectors, of the metadata root that has been The location, in blocks, of the metadata root that has been
'held' for userspace read access. '-' indicates there is no 'held' for userspace read access. '-' indicates there is no
held root. This feature is not yet implemented so '-' is held root.
always returned.
discard_passdown|no_discard_passdown discard_passdown|no_discard_passdown
Whether or not discards are actually being passed down to the Whether or not discards are actually being passed down to the
......
...@@ -254,16 +254,6 @@ config DM_THIN_PROVISIONING ...@@ -254,16 +254,6 @@ config DM_THIN_PROVISIONING
---help--- ---help---
Provides thin provisioning and snapshots that share a data store. Provides thin provisioning and snapshots that share a data store.
config DM_DEBUG_BLOCK_STACK_TRACING
boolean "Keep stack trace of persistent data block lock holders"
depends on STACKTRACE_SUPPORT && DM_PERSISTENT_DATA
select STACKTRACE
---help---
Enable this for messages that may help debug problems with the
block manager locking used by thin provisioning and caching.
If unsure, say N.
config DM_CACHE config DM_CACHE
tristate "Cache target (EXPERIMENTAL)" tristate "Cache target (EXPERIMENTAL)"
depends on BLK_DEV_DM depends on BLK_DEV_DM
......
...@@ -872,7 +872,7 @@ static void mq_destroy(struct dm_cache_policy *p) ...@@ -872,7 +872,7 @@ static void mq_destroy(struct dm_cache_policy *p)
{ {
struct mq_policy *mq = to_mq_policy(p); struct mq_policy *mq = to_mq_policy(p);
kfree(mq->table); vfree(mq->table);
epool_exit(&mq->cache_pool); epool_exit(&mq->cache_pool);
epool_exit(&mq->pre_cache_pool); epool_exit(&mq->pre_cache_pool);
kfree(mq); kfree(mq);
...@@ -1245,7 +1245,7 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size, ...@@ -1245,7 +1245,7 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size,
mq->nr_buckets = next_power(from_cblock(cache_size) / 2, 16); mq->nr_buckets = next_power(from_cblock(cache_size) / 2, 16);
mq->hash_bits = ffs(mq->nr_buckets) - 1; mq->hash_bits = ffs(mq->nr_buckets) - 1;
mq->table = kzalloc(sizeof(*mq->table) * mq->nr_buckets, GFP_KERNEL); mq->table = vzalloc(sizeof(*mq->table) * mq->nr_buckets);
if (!mq->table) if (!mq->table)
goto bad_alloc_table; goto bad_alloc_table;
......
...@@ -546,6 +546,9 @@ static int read_exceptions(struct pstore *ps, ...@@ -546,6 +546,9 @@ static int read_exceptions(struct pstore *ps,
r = insert_exceptions(ps, area, callback, callback_context, r = insert_exceptions(ps, area, callback, callback_context,
&full); &full);
if (!full)
memcpy(ps->area, area, ps->store->chunk_size << SECTOR_SHIFT);
dm_bufio_release(bp); dm_bufio_release(bp);
dm_bufio_forget(client, chunk); dm_bufio_forget(client, chunk);
......
...@@ -76,7 +76,7 @@ ...@@ -76,7 +76,7 @@
#define THIN_SUPERBLOCK_MAGIC 27022010 #define THIN_SUPERBLOCK_MAGIC 27022010
#define THIN_SUPERBLOCK_LOCATION 0 #define THIN_SUPERBLOCK_LOCATION 0
#define THIN_VERSION 1 #define THIN_VERSION 2
#define THIN_METADATA_CACHE_SIZE 64 #define THIN_METADATA_CACHE_SIZE 64
#define SECTOR_TO_BLOCK_SHIFT 3 #define SECTOR_TO_BLOCK_SHIFT 3
...@@ -1755,3 +1755,38 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd, ...@@ -1755,3 +1755,38 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
return r; return r;
} }
int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
{
int r;
struct dm_block *sblock;
struct thin_disk_superblock *disk_super;
down_write(&pmd->root_lock);
pmd->flags |= THIN_METADATA_NEEDS_CHECK_FLAG;
r = superblock_lock(pmd, &sblock);
if (r) {
DMERR("couldn't read superblock");
goto out;
}
disk_super = dm_block_data(sblock);
disk_super->flags = cpu_to_le32(pmd->flags);
dm_bm_unlock(sblock);
out:
up_write(&pmd->root_lock);
return r;
}
bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd)
{
bool needs_check;
down_read(&pmd->root_lock);
needs_check = pmd->flags & THIN_METADATA_NEEDS_CHECK_FLAG;
up_read(&pmd->root_lock);
return needs_check;
}
...@@ -25,6 +25,11 @@ ...@@ -25,6 +25,11 @@
/*----------------------------------------------------------------*/ /*----------------------------------------------------------------*/
/*
* Thin metadata superblock flags.
*/
#define THIN_METADATA_NEEDS_CHECK_FLAG (1 << 0)
struct dm_pool_metadata; struct dm_pool_metadata;
struct dm_thin_device; struct dm_thin_device;
...@@ -202,6 +207,12 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd, ...@@ -202,6 +207,12 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
dm_sm_threshold_fn fn, dm_sm_threshold_fn fn,
void *context); void *context);
/*
* Updates the superblock immediately.
*/
int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd);
bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
/*----------------------------------------------------------------*/ /*----------------------------------------------------------------*/
#endif #endif
This diff is collapsed.
...@@ -6,3 +6,13 @@ config DM_PERSISTENT_DATA ...@@ -6,3 +6,13 @@ config DM_PERSISTENT_DATA
---help--- ---help---
Library providing immutable on-disk data structure support for Library providing immutable on-disk data structure support for
device-mapper targets such as the thin provisioning target. device-mapper targets such as the thin provisioning target.
config DM_DEBUG_BLOCK_STACK_TRACING
boolean "Keep stack trace of persistent data block lock holders"
depends on STACKTRACE_SUPPORT && DM_PERSISTENT_DATA
select STACKTRACE
---help---
Enable this for messages that may help debug problems with the
block manager locking used by thin provisioning and caching.
If unsure, say N.
...@@ -91,6 +91,69 @@ struct block_op { ...@@ -91,6 +91,69 @@ struct block_op {
dm_block_t block; dm_block_t block;
}; };
struct bop_ring_buffer {
unsigned begin;
unsigned end;
struct block_op bops[MAX_RECURSIVE_ALLOCATIONS + 1];
};
static void brb_init(struct bop_ring_buffer *brb)
{
brb->begin = 0;
brb->end = 0;
}
static bool brb_empty(struct bop_ring_buffer *brb)
{
return brb->begin == brb->end;
}
static unsigned brb_next(struct bop_ring_buffer *brb, unsigned old)
{
unsigned r = old + 1;
return (r >= (sizeof(brb->bops) / sizeof(*brb->bops))) ? 0 : r;
}
static int brb_push(struct bop_ring_buffer *brb,
enum block_op_type type, dm_block_t b)
{
struct block_op *bop;
unsigned next = brb_next(brb, brb->end);
/*
* We don't allow the last bop to be filled, this way we can
* differentiate between full and empty.
*/
if (next == brb->begin)
return -ENOMEM;
bop = brb->bops + brb->end;
bop->type = type;
bop->block = b;
brb->end = next;
return 0;
}
static int brb_pop(struct bop_ring_buffer *brb, struct block_op *result)
{
struct block_op *bop;
if (brb_empty(brb))
return -ENODATA;
bop = brb->bops + brb->begin;
result->type = bop->type;
result->block = bop->block;
brb->begin = brb_next(brb, brb->begin);
return 0;
}
/*----------------------------------------------------------------*/
struct sm_metadata { struct sm_metadata {
struct dm_space_map sm; struct dm_space_map sm;
...@@ -101,25 +164,20 @@ struct sm_metadata { ...@@ -101,25 +164,20 @@ struct sm_metadata {
unsigned recursion_count; unsigned recursion_count;
unsigned allocated_this_transaction; unsigned allocated_this_transaction;
unsigned nr_uncommitted; struct bop_ring_buffer uncommitted;
struct block_op uncommitted[MAX_RECURSIVE_ALLOCATIONS];
struct threshold threshold; struct threshold threshold;
}; };
static int add_bop(struct sm_metadata *smm, enum block_op_type type, dm_block_t b) static int add_bop(struct sm_metadata *smm, enum block_op_type type, dm_block_t b)
{ {
struct block_op *op; int r = brb_push(&smm->uncommitted, type, b);
if (smm->nr_uncommitted == MAX_RECURSIVE_ALLOCATIONS) { if (r) {
DMERR("too many recursive allocations"); DMERR("too many recursive allocations");
return -ENOMEM; return -ENOMEM;
} }
op = smm->uncommitted + smm->nr_uncommitted++;
op->type = type;
op->block = b;
return 0; return 0;
} }
...@@ -158,11 +216,17 @@ static int out(struct sm_metadata *smm) ...@@ -158,11 +216,17 @@ static int out(struct sm_metadata *smm)
return -ENOMEM; return -ENOMEM;
} }
if (smm->recursion_count == 1 && smm->nr_uncommitted) { if (smm->recursion_count == 1) {
while (smm->nr_uncommitted && !r) { while (!brb_empty(&smm->uncommitted)) {
smm->nr_uncommitted--; struct block_op bop;
r = commit_bop(smm, smm->uncommitted +
smm->nr_uncommitted); r = brb_pop(&smm->uncommitted, &bop);
if (r) {
DMERR("bug in bop ring buffer");
break;
}
r = commit_bop(smm, &bop);
if (r) if (r)
break; break;
} }
...@@ -217,7 +281,8 @@ static int sm_metadata_get_nr_free(struct dm_space_map *sm, dm_block_t *count) ...@@ -217,7 +281,8 @@ static int sm_metadata_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b, static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b,
uint32_t *result) uint32_t *result)
{ {
int r, i; int r;
unsigned i;
struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm); struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
unsigned adjustment = 0; unsigned adjustment = 0;
...@@ -225,8 +290,10 @@ static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b, ...@@ -225,8 +290,10 @@ static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b,
* We may have some uncommitted adjustments to add. This list * We may have some uncommitted adjustments to add. This list
* should always be really short. * should always be really short.
*/ */
for (i = 0; i < smm->nr_uncommitted; i++) { for (i = smm->uncommitted.begin;
struct block_op *op = smm->uncommitted + i; i != smm->uncommitted.end;
i = brb_next(&smm->uncommitted, i)) {
struct block_op *op = smm->uncommitted.bops + i;
if (op->block != b) if (op->block != b)
continue; continue;
...@@ -254,7 +321,8 @@ static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b, ...@@ -254,7 +321,8 @@ static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b,
static int sm_metadata_count_is_more_than_one(struct dm_space_map *sm, static int sm_metadata_count_is_more_than_one(struct dm_space_map *sm,
dm_block_t b, int *result) dm_block_t b, int *result)
{ {
int r, i, adjustment = 0; int r, adjustment = 0;
unsigned i;
struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm); struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
uint32_t rc; uint32_t rc;
...@@ -262,8 +330,11 @@ static int sm_metadata_count_is_more_than_one(struct dm_space_map *sm, ...@@ -262,8 +330,11 @@ static int sm_metadata_count_is_more_than_one(struct dm_space_map *sm,
* We may have some uncommitted adjustments to add. This list * We may have some uncommitted adjustments to add. This list
* should always be really short. * should always be really short.
*/ */
for (i = 0; i < smm->nr_uncommitted; i++) { for (i = smm->uncommitted.begin;
struct block_op *op = smm->uncommitted + i; i != smm->uncommitted.end;
i = brb_next(&smm->uncommitted, i)) {
struct block_op *op = smm->uncommitted.bops + i;
if (op->block != b) if (op->block != b)
continue; continue;
...@@ -671,7 +742,7 @@ int dm_sm_metadata_create(struct dm_space_map *sm, ...@@ -671,7 +742,7 @@ int dm_sm_metadata_create(struct dm_space_map *sm,
smm->begin = superblock + 1; smm->begin = superblock + 1;
smm->recursion_count = 0; smm->recursion_count = 0;
smm->allocated_this_transaction = 0; smm->allocated_this_transaction = 0;
smm->nr_uncommitted = 0; brb_init(&smm->uncommitted);
threshold_init(&smm->threshold); threshold_init(&smm->threshold);
memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm)); memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm));
...@@ -715,7 +786,7 @@ int dm_sm_metadata_open(struct dm_space_map *sm, ...@@ -715,7 +786,7 @@ int dm_sm_metadata_open(struct dm_space_map *sm,
smm->begin = 0; smm->begin = 0;
smm->recursion_count = 0; smm->recursion_count = 0;
smm->allocated_this_transaction = 0; smm->allocated_this_transaction = 0;
smm->nr_uncommitted = 0; brb_init(&smm->uncommitted);
threshold_init(&smm->threshold); threshold_init(&smm->threshold);
memcpy(&smm->old_ll, &smm->ll, sizeof(smm->old_ll)); memcpy(&smm->old_ll, &smm->ll, sizeof(smm->old_ll));
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment