Commit 086b34c9 authored by unknown's avatar unknown

WL#3071 Maria checkpoint

Fixing bad comments (I remember my maths' teacher "one late night you'll
obey to the simplifications made by your tired neurons"; exactly
what happened here). In Checkpoint, when we flush a table's state
we must flush all log records (WAL), not only those before checkpoint
started.


storage/maria/ma_bitmap.c:
  there was a flaw in reasoning, bug does exist.
storage/maria/ma_blockrec.c:
  moving piece of comment to ma_checkpoint.c
storage/maria/ma_checkpoint.c:
  Comments.
  When checkpoint flushes a state, WAL imposes that all records up
  to this state have been flushed, not only up to checkpoint_start_log_horizon.
storage/maria/ma_recovery.c:
  finishing comment.
parent c2084d2a
......@@ -533,16 +533,25 @@ static my_bool _ma_read_bitmap_page(MARIA_SHARE *share,
Inexistent or half-created page (could be crash in the middle of
_ma_bitmap_create_first(), before appending maria_bitmap_marker).
*/
/*
We are updating data_file_length before writing any log record for the
row operation. What if now state is flushed by a checkpoint with the
new value, and crash before the checkpoint record is written, recovery
may not even open the table (no log records) so not fix
data_file_length ("WAL violation")? In fact this is ok:
- checkpoint flushes state only if share->id!=0
- so if state was flushed, table had share->id!=0, so had a
LOGREC_FILE_ID (or was in previous checkpoint record), so recovery will
meet and open it and fix data_file_length.
/**
@todo RECOVERY BUG
We are updating data_file_length before writing any log record for the
row operation. What if now state is flushed by a checkpoint with the
new value, and crash before the checkpoint record is written, recovery
may not even open the table (no log records) so not fix
data_file_length ("WAL violation")?
Scenario: assume share->id==0, then:
thread 1 (here) thread 2 (checkpoint)
update data_file_length
copy state to memory, flush log
set share->id and write FILE_ID (not flushed)
see share->id!=0 so flush state
crash
FILE_ID will be missing, Recovery will not open table and not fix
data_file_length. This bug should be fixed with other "checkpoint vs
bitmap" bugs.
One possibility will be logging a standalone LOGREC_CREATE_BITMAP in a
separate transaction (using dummy_transaction_object).
*/
share->state.state.data_file_length= end_of_page;
bzero(bitmap->map, bitmap->block_size);
......@@ -598,6 +607,12 @@ static my_bool _ma_change_bitmap_page(MARIA_HA *info,
if (bitmap->changed)
{
/**
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
disk even though it could be over-allocated with not yet any REDO-UNDO
complete group (WAL violation: no way to undo the over-allocation if
crash). See also collect_tables().
*/
if (write_changed_bitmap(info->s, bitmap))
DBUG_RETURN(1);
bitmap->changed= 0;
......
......@@ -1444,17 +1444,9 @@ static my_bool write_tail(MARIA_HA *info,
{
/*
We are modifying a state member before writing the UNDO; this is a WAL
violation: assume this setting is made, checkpoint flushes new state,
and crash happens before the UNDO is written: how to undo the bad state?
Fortunately for data_file_length this is ok: as long as we change
data_file_length after writing any REDO or UNDO we are safe:
- checkpoint flushes state only if it's older than
checkpoint_start_log_horizon, and flushes log up to that horizon first
- so if checkpoint flushed state with new data_file_length, REDO is in
log so LOGREC_FILE_ID too, recovery will meet and open the table thus
fix data_file_length to be the file's physical size.
Same property is currently true in all places of this file which change
data_file_length.
violation. But for data_file_length this is ok, as long as we change
data_file_length after writing any log record (FILE_ID/REDO/UNDO) (see
collect_tables()).
*/
info->state->data_file_length= position + block_size;
}
......
......@@ -176,27 +176,6 @@ static int really_execute_checkpoint(void)
DBUG_PRINT("info",("checkpoint_start_log_horizon (%lu,0x%lx)",
LSN_IN_PARTS(checkpoint_start_log_horizon)));
lsn_store(checkpoint_start_log_horizon_char, checkpoint_start_log_horizon);
/*
We are going to flush the state of some tables (in collect_tables()) if
it's older than checkpoint_start_log_horizon. Before, all records
describing how to undo this flushed state must be in the log
(WAL). Usually this means UNDOs. In the special case of data_file_length,
recovery just needs to open the table, so any LOGREC_FILE_ID/REDO/UNDO
allowing recovery to understand it must open a table, is enough.
*/
/**
Apart from data|key_file_length which are easily recoverable from the OS,
all other state members must be updated only when writing the UNDO;
otherwise, if updated before, if their new value is flushed by a
checkpoint and there is a crash before UNDO is written, their REDO group
will be missing or at least incomplete and skipped by recovery, so bad
state value will stay. For example, setting key_root before writing the
UNDO: the table would have old index page (they were pinned at time of
crash) and a new, thus wrong, key_root.
@todo RECOVERY BUG check that all code honours that.
*/
if (translog_flush(checkpoint_start_log_horizon))
goto err;
/*
STEP 2: fetch information about transactions.
......@@ -887,9 +866,32 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
*/
}
translog_unlock();
/**
We are going to flush these states.
Before, all records describing how to undo such state must be
in the log (WAL). Usually this means UNDOs. In the special case of
data|key_file_length, recovery just needs to open the table to fix the
length, so any LOGREC_FILE_ID/REDO/UNDO allowing recovery to
understand it must open a table, is enough; so as long as
data|key_file_length is updated after writing any log record it's ok:
if we copied new value above, it means the record was before
state_copies_horizon and we flush such record below.
Apart from data|key_file_length which are easily recoverable from the
real file's size, all other state members must be updated only when
writing the UNDO; otherwise, if updated before, if their new value is
flushed by a checkpoint and there is a crash before UNDO is written,
their REDO group will be missing or at least incomplete and skipped
by recovery, so bad state value will stay. For example, setting
key_root before writing the UNDO: the table would have old index
pages (they were pinned at time of crash) and a new, thus wrong,
key_root.
@todo RECOVERY BUG check that all code honours that.
*/
if (translog_flush(state_copies_horizon))
goto err;
/* now we have cached states and they are WAL-safe*/
state_copies_end= state_copy;
state_copy= state_copies;
/* so now we have cached states */
}
/* locate our state among these cached ones */
......@@ -911,15 +913,11 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
dfile= share->bitmap.file;
/*
Ignore table which has no logged writes (all its future log records will
be found naturally by Recovery). This also avoids flushing
a data_file_length changed too early by a client (before any log record
was written, giving no chance to recovery to meet and open the table,
see _ma_read_bitmap_page()).
Ignore obsolete shares (_before_ setting themselves to last_version=0
they already did all flush and sync; if we flush their state now we may
be flushing an obsolete state onto a newer one (assuming the table has
been reopened with a different share but of course same physical index
file).
be found naturally by Recovery). Ignore obsolete shares (_before_
setting themselves to last_version=0 they already did all flush and
sync; if we flush their state now we may be flushing an obsolete state
onto a newer one (assuming the table has been reopened with a different
share but of course same physical index file).
*/
if ((share->id != 0) && (share->last_version != 0))
{
......@@ -974,7 +972,6 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
It may also be a share which got last_version==0 since we checked
last_version; in this case, it flushed its state and the LSN test
above will catch it.
Last, see comments at start of really_execute_checkpoint().
*/
}
else
......@@ -1005,6 +1002,12 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
each checkpoint if the table was once written and then not anymore.
*/
}
/**
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
disk even though it could be over-allocated with not yet any
REDO-UNDO complete group (WAL violation: no way to undo the
over-allocation if crash); see also _ma_change_bitmap_page().
*/
sync_error|=
_ma_flush_bitmap(share); /* after that, all is in page cache */
DBUG_ASSERT(share->pagecache == maria_pagecache);
......
......@@ -1156,13 +1156,14 @@ prototype_redo_exec_hook(REDO_INSERT_ROW_HEAD)
/**
@todo RECOVERY BUG
we stamp page with UNDO's LSN. Assume an operation logs REDO-REDO-UNDO
where the two REDOs are about the same page. Then recovery applies first
REDO and skips second REDO which is wrong. Solutions:
a) when applying REDO, keep page pinned, don't stamp it, remember it;
when seeing UNDO, unpin pages and stamp them; for BLOB pages we cannot
pin them (too large for memory) so need an additional pass in REDO phase:
- find UNDO
- execute all REDOs about this UNDO but skipping REDOs for
where the two REDOs are about the same page (that is possible only with a
head or tail page, not blob page). Then recovery applies first REDO and
skips second REDO which is wrong. Solution:
a)
* when applying REDO to head or tail, keep page pinned, don't stamp it,
* when applying REDO to blob page, stamp it with UNDO's LSN
* when seeing UNDO, unpin head/tail pages and stamp them with UNDO's
LSN.
or b) when applying REDO, stamp page with REDO's LSN (=> difference in
'cmp' between run-time and recovery, need a special 'cmp'...).
*/
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment