WL#3071 Maria checkpoint

Fixing bad comments (I remember my maths' teacher "one late night you'll obey to the simplifications made by your tired neurons"; exactly what happened here). In Checkpoint, when we flush a table's state we must flush all log records (WAL), not only those before checkpoint started. storage/maria/ma_bitmap.c: there was a flaw in reasoning, bug does exist. storage/maria/ma_blockrec.c: moving piece of comment to ma_checkpoint.c storage/maria/ma_checkpoint.c: Comments. When checkpoint flushes a state, WAL imposes that all records up to this state have been flushed, not only up to checkpoint_start_log_horizon. storage/maria/ma_recovery.c: finishing comment.

WL#3071 Maria checkpoint
Fixing bad comments (I remember my maths' teacher "one late night you'll obey to the simplifications made by your tired neurons"; exactly what happened here). In Checkpoint, when we flush a table's state we must flush all log records (WAL), not only those before checkpoint started. storage/maria/ma_bitmap.c: there was a flaw in reasoning, bug does exist. storage/maria/ma_blockrec.c: moving piece of comment to ma_checkpoint.c storage/maria/ma_checkpoint.c: Comments. When checkpoint flushes a state, WAL imposes that all records up to this state have been flushed, not only up to checkpoint_start_log_horizon. storage/maria/ma_recovery.c: finishing comment.
086b34c9 · unknown · c2084d2a · 086b34c9 · 086b34c9 · 086b34c9
Commit 086b34c9 authored Nov 11, 2007 by unknown
4 changed files
--- a/storage/maria/ma_bitmap.c
+++ b/storage/maria/ma_bitmap.c
@@ -533,16 +533,25 @@ static my_bool _ma_read_bitmap_page(MARIA_SHARE *share,
      Inexistent or half-created page (could be crash in the middle of
      _ma_bitmap_create_first(), before appending maria_bitmap_marker).
    */
-    /*
-      We are updating data_file_length before writing any log record for the
-      row operation. What if now state is flushed by a checkpoint with the
-      new value, and crash before the checkpoint record is written, recovery
-      may not even open the table (no log records) so not fix
-      data_file_length ("WAL violation")? In fact this is ok:
-      - checkpoint flushes state only if share->id!=0
-      - so if state was flushed, table had share->id!=0, so had a
-      LOGREC_FILE_ID (or was in previous checkpoint record), so recovery will
-      meet and open it and fix data_file_length.
+    /**
+       @todo RECOVERY BUG
+       We are updating data_file_length before writing any log record for the
+       row operation. What if now state is flushed by a checkpoint with the
+       new value, and crash before the checkpoint record is written, recovery
+       may not even open the table (no log records) so not fix
+       data_file_length ("WAL violation")?
+       Scenario: assume share->id==0, then:
+       thread 1 (here)                thread 2 (checkpoint)
+       update data_file_length
+                                      copy state to memory, flush log
+       set share->id and write FILE_ID (not flushed)
+                                      see share->id!=0 so flush state
+                                      crash
+       FILE_ID will be missing, Recovery will not open table and not fix
+       data_file_length. This bug should be fixed with other "checkpoint vs
+       bitmap" bugs.
+       One possibility will be logging a standalone LOGREC_CREATE_BITMAP in a
+       separate transaction (using dummy_transaction_object).
    */
    share->state.state.data_file_length= end_of_page;
    bzero(bitmap->map, bitmap->block_size);
@@ -598,6 +607,12 @@ static my_bool _ma_change_bitmap_page(MARIA_HA *info,

  if (bitmap->changed)
  {
+    /**
+       @todo RECOVERY BUG this is going to flush the bitmap page possibly to
+       disk even though it could be over-allocated with not yet any REDO-UNDO
+       complete group (WAL violation: no way to undo the over-allocation if
+       crash). See also collect_tables().
+    */
    if (write_changed_bitmap(info->s, bitmap))
      DBUG_RETURN(1);
    bitmap->changed= 0;

--- a/storage/maria/ma_blockrec.c
+++ b/storage/maria/ma_blockrec.c
@@ -1444,17 +1444,9 @@ static my_bool write_tail(MARIA_HA *info,
  {
    /*
      We are modifying a state member before writing the UNDO; this is a WAL
-      violation: assume this setting is made, checkpoint flushes new state,
-      and crash happens before the UNDO is written: how to undo the bad state?
-      Fortunately for data_file_length this is ok: as long as we change
-      data_file_length after writing any REDO or UNDO we are safe:
-      - checkpoint flushes state only if it's older than
-      checkpoint_start_log_horizon, and flushes log up to that horizon first
-      - so if checkpoint flushed state with new data_file_length, REDO is in
-      log so LOGREC_FILE_ID too, recovery will meet and open the table thus
-      fix data_file_length to be the file's physical size.
-      Same property is currently true in all places of this file which change
-      data_file_length.
+      violation. But for data_file_length this is ok, as long as we change
+      data_file_length after writing any log record (FILE_ID/REDO/UNDO) (see
+      collect_tables()).
    */
    info->state->data_file_length= position + block_size;
  }

--- a/storage/maria/ma_checkpoint.c
+++ b/storage/maria/ma_checkpoint.c
@@ -176,27 +176,6 @@ static int really_execute_checkpoint(void)
  DBUG_PRINT("info",("checkpoint_start_log_horizon (%lu,0x%lx)",
                     LSN_IN_PARTS(checkpoint_start_log_horizon)));
  lsn_store(checkpoint_start_log_horizon_char, checkpoint_start_log_horizon);
-  /*
-    We are going to flush the state of some tables (in collect_tables()) if
-    it's older than checkpoint_start_log_horizon. Before, all records
-    describing how to undo this flushed state must be in the log
-    (WAL). Usually this means UNDOs. In the special case of data_file_length,
-    recovery just needs to open the table, so any LOGREC_FILE_ID/REDO/UNDO
-    allowing recovery to understand it must open a table, is enough.
-  */
-  /**
-     Apart from data|key_file_length which are easily recoverable from the OS,
-     all other state members must be updated only when writing the UNDO;
-     otherwise, if updated before, if their new value is flushed by a
-     checkpoint and there is a crash before UNDO is written, their REDO group
-     will be missing or at least incomplete and skipped by recovery, so bad
-     state value will stay. For example, setting key_root before writing the
-     UNDO: the table would have old index page (they were pinned at time of
-     crash) and a new, thus wrong, key_root.
-     @todo RECOVERY BUG check that all code honours that.
-  */
-  if (translog_flush(checkpoint_start_log_horizon))
-    goto err;

  /*
    STEP 2: fetch information about transactions.
@@ -887,9 +866,32 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
        */
      }
      translog_unlock();
+      /**
+         We are going to flush these states.
+         Before, all records describing how to undo such state must be
+         in the log (WAL). Usually this means UNDOs. In the special case of
+         data|key_file_length, recovery just needs to open the table to fix the
+         length, so any LOGREC_FILE_ID/REDO/UNDO allowing recovery to
+         understand it must open a table, is enough; so as long as
+         data|key_file_length is updated after writing any log record it's ok:
+         if we copied new value above, it means the record was before
+         state_copies_horizon and we flush such record below.
+         Apart from data|key_file_length which are easily recoverable from the
+         real file's size, all other state members must be updated only when
+         writing the UNDO; otherwise, if updated before, if their new value is
+         flushed by a checkpoint and there is a crash before UNDO is written,
+         their REDO group will be missing or at least incomplete and skipped
+         by recovery, so bad state value will stay. For example, setting
+         key_root before writing the UNDO: the table would have old index
+         pages (they were pinned at time of crash) and a new, thus wrong,
+         key_root.
+         @todo RECOVERY BUG check that all code honours that.
+      */
+      if (translog_flush(state_copies_horizon))
+        goto err;
+      /* now we have cached states and they are WAL-safe*/
      state_copies_end= state_copy;
      state_copy= state_copies;
-      /* so now we have cached states */
    }

    /* locate our state among these cached ones */
@@ -911,15 +913,11 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
    dfile= share->bitmap.file;
    /*
      Ignore table which has no logged writes (all its future log records will
-      be found naturally by Recovery). This also avoids flushing
-      a data_file_length changed too early by a client (before any log record
-      was written, giving no chance to recovery to meet and open the table,
-      see _ma_read_bitmap_page()).
-      Ignore obsolete shares (_before_ setting themselves to last_version=0
-      they already did all flush and sync; if we flush their state now we may
-      be flushing an obsolete state onto a newer one (assuming the table has
-      been reopened with a different share but of course same physical index
-      file).
+      be found naturally by Recovery). Ignore obsolete shares (_before_
+      setting themselves to last_version=0 they already did all flush and
+      sync; if we flush their state now we may be flushing an obsolete state
+      onto a newer one (assuming the table has been reopened with a different
+      share but of course same physical index file).
    */
    if ((share->id != 0) && (share->last_version != 0))
    {
@@ -974,7 +972,6 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
          It may also be a share which got last_version==0 since we checked
          last_version; in this case, it flushed its state and the LSN test
          above will catch it.
-          Last, see comments at start of really_execute_checkpoint().
        */
      }
      else
@@ -1005,6 +1002,12 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
          each checkpoint if the table was once written and then not anymore.
        */
      }
+      /**
+         @todo RECOVERY BUG this is going to flush the bitmap page possibly to
+         disk even though it could be over-allocated with not yet any
+         REDO-UNDO complete group (WAL violation: no way to undo the
+         over-allocation if crash); see also _ma_change_bitmap_page().
+      */
      sync_error|=
        _ma_flush_bitmap(share); /* after that, all is in page cache */
      DBUG_ASSERT(share->pagecache == maria_pagecache);

--- a/storage/maria/ma_recovery.c
+++ b/storage/maria/ma_recovery.c
@@ -1156,13 +1156,14 @@ prototype_redo_exec_hook(REDO_INSERT_ROW_HEAD)
  /**
     @todo RECOVERY BUG
     we stamp page with UNDO's LSN. Assume an operation logs REDO-REDO-UNDO
-     where the two REDOs are about the same page. Then recovery applies first
-     REDO and skips second REDO which is wrong. Solutions:
-     a) when applying REDO, keep page pinned, don't stamp it, remember it;
-     when seeing UNDO, unpin pages and stamp them; for BLOB pages we cannot
-     pin them (too large for memory) so need an additional pass in REDO phase:
-      - find UNDO
-      - execute all REDOs about this UNDO but skipping REDOs for
+     where the two REDOs are about the same page (that is possible only with a
+     head or tail page, not blob page). Then recovery applies first REDO and
+     skips second REDO which is wrong. Solution:
+     a)
+       * when applying REDO to head or tail, keep page pinned, don't stamp it,
+       * when applying REDO to blob page, stamp it with UNDO's LSN
+       * when seeing UNDO, unpin head/tail pages and stamp them with UNDO's
+       LSN.
     or b) when applying REDO, stamp page with REDO's LSN (=> difference in
     'cmp' between run-time and recovery, need a special 'cmp'...).
  */