Updated manual

561c18c1 · monty@donna.mysql.fi · 1afbe4fe · 37f30684 · 561c18c1 · 561c18c1
Commit 561c18c1 authored Apr 17, 2001 by monty@donna.mysql.fi
Hide whitespace changes
Inline Side-by-side

Showing with 1053 additions and 159 deletions

BitKeeper/etc/logging_ok BitKeeper/etc/logging_ok +1 -4

Docs/manual.texi Docs/manual.texi +1052 -155

No files found.
--- a/BitKeeper/etc/logging_ok
+++ b/BitKeeper/etc/logging_ok
-heikki@donna.mysql.fi
+sasha@mysql.sashanet.com
 monty@donna.mysql.fi
-paul@central.snake.net
-serg@serg.mysql.com
-tim@threads.polyesthetic.msg
--- a/Docs/manual.texi
+++ b/Docs/manual.texi
@@ -498,7 +498,7 @@ MySQL Table Types
 * HEAP::                        HEAP tables
 * BDB::                         BDB or Berkeley_db tables
 * GEMINI::                      GEMINI tables
-* INNODB::                      INNODB tables
+* InnoDB::                      InnoDB tables
 MyISAM Tables
@@ -529,12 +529,12 @@ GEMINI Tables
 * GEMINI features::             
 * GEMINI TODO::                 
-INNODB Tables
+InnoDB Tables
-* INNODB overview::             
+* InnoDB overview::             
-* INNODB start::                INNODB startup options
+* InnoDB start::                InnoDB startup options
-* Using INNODB tables::         Using INNODB tables
+* Using InnoDB tables::         Using InnoDB tables
-* INNODB restrictions::         Some restrictions on @code{INNODB} tables:
+* InnoDB restrictions::         Some restrictions on @code{InnoDB} tables:
 MySQL Tutorial
@@ -4141,12 +4141,12 @@ phone back within 48 hours to discuss @code{MySQL} related issues.
 @end itemize
 @cindex support, BDB Tables
-@cindex support, INNODB Tables
+@cindex support, InnoDB Tables
 @cindex support, GEMINI Tables
 @node Table handler support,  , Telephone support, Support
 @subsection Support for other table handlers
-To get support for @code{BDB} tables, @code{INNODB} tables or
+To get support for @code{BDB} tables, @code{InnoDB} tables or
 @code{GEMINI} tables you have to pay an additional 30% on the standard
 support price for each of the table handlers you would like to have
 support for.
@@ -4195,14 +4195,18 @@ For a list of sites from which you can obtain @strong{MySQL}, see
 @ref{Getting MySQL, , Getting @strong{MySQL}}.
 @item
-To see which platforms are supported, see @ref{Which OS}.
+To see which platforms are supported, see @ref{Which OS}. Please note that
+not all supported system are equally good for running @strong{MySQL} on them.
+On some it is much more robust and efficient than others - see  @ref{Which OS}
+for details.
 @item
 Several versions of @strong{MySQL} are available in both binary and
 source distributions.  We also provide public access to our current
 source tree for those who want to see our most recent developments and
 help us test new code.  To determine which version and type of
-distribution you should use, see @ref{Many versions}.
+distribution you should use, see @ref{Which version}. When in doubt,
+use the binary distribution.
 @item
 Installation instructions for binary and source distributions are described
@@ -4985,7 +4989,7 @@ We use GNU Autoconf, so it is possible to port @strong{MySQL} to all modern
 systems with working Posix threads and a C++ compiler.  (To compile only the
 client code, a C++ compiler is required but not threads.)  We use and develop
 the software ourselves primarily on Sun Solaris (Versions 2.5 - 2.7) and
-RedHat Linux Version 6.x.
+SuSE Linux Version 7.x.
 Note that for many operating systems, the native thread support works only
 in the latest versions. @strong{MySQL} has been reported to compile
@@ -5035,6 +5039,75 @@ Tru64 Unix
 Win95, Win98, NT, and Win2000.  @xref{Windows}.
 @end itemize
+Note that not all platforms are suited equally well for running
+@strong{MySQL}. How well a certain platform is suited for a high-load
+mission critical @strong{MySQL} server is determined by the following
+factors:
+@itemize
+@item
+General stability of the thread library. A platform may have excellent
+reputation otherwise, but if the thread library is unstable in the code
+that is called by @strong{MySQL}, even if
+everything else is perfect, @strong{MySQL} will be only as stable as the
+thread library.
+@item
+The ability of the kernel and/or thread library to take advantage of
+@strong{SMP} on
+multi-processor systems. In other words, when a process creates a thread, it
+should be possible for that thread to run on a different CPU than the original
+process.
+@item
+The ability of the kernel and/or the thread library to run many threads which
+acquire/release a mutex over a short critical region frequently without
+excessive context switches. In other words, if the implementation of
+@code{pthread_mutex_lock()} is too anxious to yield CPU, this will hurt
+@strong{MySQL} tremendously. If this issue
+is not taken care of, adding extra CPUs will actually make @strong{MySQL}
+slower.
+@item
+General file system stability/performance.
+@item
+Ability of the file system to deal with large files at all and deal with them
+efficiently, if your tables are big.
+@item
+Our level of expertise here at @strong{MySQL AB} with the platform. If we know
+a platform well, we introduce platform-specific optimizations/fixes enabled at
+compile time. We can also provide advice on configuring your system optimally
+for @strong{MySQL}.
+@item
+The amount of testing of similar configurations we have done internally.
+@item
+The number of users that have successfully run @strong{MySQL} on that
+platform in similar configurations. If this number is high, the chances of
+hitting some platform-specific surprise are much smaller.
+@end itemize
+Based on the above criterea, the best platforms for running
+@strong{MySQL} at this point are x86 with SuSE Linux 7.1, 2.4 kernel and
+ReiserFS (or any similar Linux distribution) and Sparc with Solaris 2.7
+or 2.8. FreeBSD comes third, but we really hope it will join the top
+club once the thread library is improved. We also hope that at some
+point we will be able to include all other platforms on which
+@strong{MySQL} compiles, runs ok, but not quite with the same level of
+stability and performance, into the top category. This will require some
+effort on our part in cooperation with the developers of the OS/library
+components @strong{MySQL} depends upon. If you are interested in making
+one of those components better, are in a position to influence their
+development, and need more detailed instructions on what @strong{MySQL}
+needs to run better, send an e-mail to
+@email{internals@@lists.mysql.com}.
+Please note that the comparison above is not to say that one OS is better or
+worse than the other in general. We are talking about choosing a particular OS
+for a dedicated purpose - running @strong{MySQL}, and compare platforms in that
+regard only. With this in mind, the result of this comparison
+would be different if we included more issues into it. And in some cases,
+the reason one OS is better than the other could simply be that we have put
+forth more effort into testing on and optimizing for that particular platform.
+We are just stating our observations to help you make a
+decision on which platform to use @strong{MySQL} on in your setup.
 @cindex MySQL binary distribution
 @cindex MySQL source distribution
 @cindex release numbers
@@ -5819,6 +5892,11 @@ To install the HP-UX tar.gz distribution, you must have a copy of GNU
 @node Installing source, Installing source tree, Installing binary, Installing
 @section Installing a MySQL Source Distribution
+Before you proceed with the source installation, check first to see if our
+binary is available for your platform and if it will work for you. We 
+put in a lot of effort into making sure that our binaries are built with the
+best possible options.
 You need the following tools to build and install @strong{MySQL} from source:
 @itemize @bullet
@@ -5846,6 +5924,20 @@ sometimes required.  If you have problems, we recommend trying GNU
 @code{make} 3.75 or newer.
 @end itemize
+If you are using a recent version of @strong{gcc}, recent enough to understand
+@code{-fno-exceptions} option, it is @strong{VERY IMPORTANT} that you use
+it. Otherwise, you may compile a binary that crashes randomly. We also
+recommend that you use @code{-felide-contructors} and @code{-fno-rtti} along
+with @code{-fno-exceptions}. When in doubt, do the following:
+@example
+CFLAGS="-O3" CXX=gcc CXXFLAGS="-O3 -felide-constructors -fno-exceptions -fno-rtti" ./configure --prefix=/usr/local/mysql --enable-assembler --with-mysqld-ldflags=-all-static
+@end example
+On most systems this will give you a fast and stable binary.
 @c texi2html fails to split chapters if I use strong for all of this.
 If you run into problems, @strong{PLEASE ALWAYS USE @code{mysqlbug}} when
 posting questions to @email{mysql@@lists.mysql.com}.  Even if the problem
@@ -9847,7 +9939,7 @@ If you are using Gemini tables, refer to the Gemini-specific startup options.
 @xref{GEMINI start}.
 If you are using Innodb tables, refer to the Innodb-specific startup
-options.  @xref{INNODB start}.
+options.  @xref{InnoDB start}.
 @node Automatic start, Command-line options, Starting server, Post-installation
 @subsection Starting and Stopping MySQL Automatically
@@ -11259,7 +11351,7 @@ issue. For those of our users who are concerned with or have wondered
 about transactions vis-a-vis @strong{MySQL}, there is a ``@strong{MySQL}
 way'' as we have outlined above.  For those where safety is more
 important than speed, we recommend them to use the @code{BDB},
-@code{GEMINI} or @code{INNODB} tables for all their critical
+@code{GEMINI} or @code{InnoDB} tables for all their critical
 data. @xref{Table types}.
 One final note: We are currently working on a safe replication schema
@@ -11487,11 +11579,11 @@ Entry level SQL92. ODBC levels 0-2.
 @cindex updating, tables
 @cindex @code{BDB} tables
 @cindex @code{GEMINI} tables
-@cindex @code{INNODB} tables
+@cindex @code{InnoDB} tables
 The following mostly applies only for @code{ISAM}, @code{MyISAM}, and
 @code{HEAP} tables. If you only use transaction-safe tables (@code{BDB},
-@code{GEMINI} or @code{INNODB} tables) in an an update, you can do
+@code{GEMINI} or @code{InnoDB} tables) in an an update, you can do
 @code{COMMIT} and @code{ROLLBACK} also with @strong{MySQL}.
 @xref{COMMIT}.
@@ -18511,7 +18603,7 @@ reference_option:
        RESTRICT | CASCADE | SET NULL | NO ACTION | SET DEFAULT
 table_options:
-	TYPE = @{BDB | HEAP | ISAM | INNODB | MERGE | MYISAM @}
+	TYPE = @{BDB | HEAP | ISAM | InnoDB | MERGE | MYISAM @}
 or	AUTO_INCREMENT = #
 or	AVG_ROW_LENGTH = #
 or	CHECKSUM = @{0 | 1@}
@@ -18753,7 +18845,7 @@ The different table types are:
 @item GEMINI @tab Transaction-safe tables with row-level locking @xref{GEMINI}.
 @item HEAP @tab The data for this table is only stored in memory. @xref{HEAP}.
 @item ISAM @tab The original table handler. @xref{ISAM}.
-@item INNODB @tab Transaction-safe tables with row locking. @xref{INNODB}.
+@item InnoDB @tab Transaction-safe tables with row locking. @xref{InnoDB}.
 @item MERGE @tab A collection of MyISAM tables used as one table. @xref{MERGE}.
 @item MyISAM @tab The new binary portable table handler that is replacing ISAM. @xref{MyISAM}.
 @end multitable
@@ -21166,7 +21258,7 @@ The following columns are returned:
 @item @code{Comment} @tab The comment used when creating the table (or some information why @strong{MySQL} couldn't access the table information).
 @end multitable
-@code{INNODB} tables will report the free space in the tablespace
+@code{InnoDB} tables will report the free space in the tablespace
 in the table comment.
 @node SHOW STATUS, SHOW VARIABLES, SHOW TABLE STATUS, SHOW
@@ -22320,7 +22412,7 @@ as soon as you execute an update, @strong{MySQL} will store the update on
 disk.
 If you are using transactions safe tables (like @code{BDB},
-@code{INNODB} or @code{GEMINI}), you can put @strong{MySQL} into
+@code{InnoDB} or @code{GEMINI}), you can put @strong{MySQL} into
 non-@code{autocommit} mode with the following command:
 @example
@@ -23147,7 +23239,7 @@ used them.
 @cindex @code{GEMINI} table type
 @cindex @code{HEAP} table type
 @cindex @code{ISAM} table type
-@cindex @code{INNODB} table type
+@cindex @code{InnoDB} table type
 @cindex @code{MERGE} table type
 @cindex MySQL table types
 @cindex @code{MyISAM} table type
@@ -23158,7 +23250,7 @@ used them.
 As of @strong{MySQL} Version 3.23.6, you can choose between three basic
 table formats (@code{ISAM}, @code{HEAP} and @code{MyISAM}.  Newer
 @strong{MySQL} may support additional table type (@code{BDB},
-@code{GEMINI} or @code{INNODB}), depending on how you compile it.
+@code{GEMINI} or @code{InnoDB}), depending on how you compile it.
 When you create a new table, you can tell @strong{MySQL} which table
 type it should use for the table.  @strong{MySQL} will always create a
@@ -23173,7 +23265,7 @@ You can convert tables between different types with the @code{ALTER
 TABLE} statement. @xref{ALTER TABLE, , @code{ALTER TABLE}}.
 Note that @strong{MySQL} supports two different kinds of
-tables. Transaction-safe tables (@code{BDB}, @code{INNODB} or
+tables. Transaction-safe tables (@code{BDB}, @code{InnoDB} or
 @code{GEMINI}) and not transaction-safe tables (@code{HEAP}, @code{ISAM},
 @code{MERGE}, and @code{MyISAM}).
@@ -23216,7 +23308,7 @@ of both worlds.
 * HEAP::                        HEAP tables
 * BDB::                         BDB or Berkeley_db tables
 * GEMINI::                      GEMINI tables
-* INNODB::                      INNODB tables
+* InnoDB::                      InnoDB tables
 @end menu
 @node MyISAM, MERGE, Table types, Table types
@@ -24181,7 +24273,7 @@ not trivial).
 @end itemize
 @cindex tables, @code{GEMINI}
-@node GEMINI, INNODB, BDB, Table types
+@node GEMINI, InnoDB, BDB, Table types
 @section GEMINI Tables
 @menu
@@ -24262,238 +24354,1043 @@ limited by @code{gemini_connection_limit}.  The default is 100 users.
 NuSphere is working on removing these limitations.
-@node INNODB,  , GEMINI, Table types
+@node InnoDB, , GEMINI, Table types 
-@section INNODB Tables
+@section InnoDB Tables
 @menu
-* INNODB overview::             
+* InnoDB overview::        InnoDB tables overview
-* INNODB start::                INNODB startup options
+* InnoDB start::           InnoDB startup options
-* Using INNODB tables::         Using INNODB tables
+* Creating an InnoDB database:: Creating an InnoDB database
-* INNODB restrictions::         Some restrictions on @code{INNODB} tables:
+* Using InnoDB tables::    Creating InnoDB tables
+* Adding and removing::    Adding and removing InnoDB data and log files
+* Backing up::             Backing up and recovering an InnoDB database
+* Moving::                 Moving an InnoDB database to another machine
+* InnoDB transaction model:: InnoDB transaction model
+* Implementation::         Implementation of multiversioning
+* Table and index::        Table and index structures
+* File space management::  File space management and disk i/o
+* Error handling::         Error handling
+* InnoDB restrictions::    Some restrictions on InnoDB tables
+* InnoDB contact information:: InnoDB contact information
 @end menu
-@node INNODB overview, INNODB start, INNODB, INNODB
+@node InnoDB overview, InnoDB start, InnoDB, InnoDB
-@subsection INNODB Tables overview
+@subsection InnoDB tables overview
-Innodb tables are included in the @strong{MySQL} source distribution
+InnoDB tables are included in the @strong{MySQL} source distribution
-starting from 3.23.34 and will be activated in the @strong{MySQL}-max
+starting from 3.23.34a and are activated in the @strong{MySQL -max}
 binary.
-If you have downloaded a binary version of @strong{MySQL} that includes
+If you have downloaded a binary version of MySQL that includes
-support for Innodb, simply follow the instructions for
+support for InnoDB, simply follow the instructions for
-installing a binary version of @strong{MySQL}. @xref{Installing binary}.
+installing a binary version of MySQL.
+See section 4.6 'Installing a MySQL Binary Distribution'.
-To compile @strong{MySQL} with Innodb support, download @strong{MySQL}
+To compile MySQL with InnoDB support, download MySQL-3.23.34a or newer
-3.23.34 or newer and configure @code{MySQL} with the
+and configure @code{MySQL} with the
-@code{--with-innodb} option. @xref{Installing source}.
+@code{--with-innobase} option. Starting from MySQL-3.23.37 the option
+is @code{--with-innodb}. See section
+4.7 'Installing a MySQL Source Distribution'.
 @example
-cd /path/to/source/of/mysql-3.23.34
+cd /path/to/source/of/mysql-3.23.37
 ./configure --with-innodb
 @end example
-Innodb provides @strong{MySQL} with a transaction safe table handler with
+InnoDB provides MySQL with a transaction safe table handler with
-commit, rollback, and crash recovery capabilities.  Innodb does
+commit, rollback, and crash recovery capabilities. InnoDB does
 locking on row level, and also provides an Oracle-style consistent
 non-locking read in @code{SELECTS}, which increases transaction
-concurrency. There is neither need for lock escalation in Innodb,
+concurrency. There is not need for lock escalation in InnoDB,
-because row level locks in Innodb fit in very small space.
+because row level locks in InnoDB fit in very small space.
+Technically, InnoDB is a database backend placed under MySQL. InnoDB
+has its own buffer pool for caching data and indexes in main
+memory. InnoDB stores its tables and indexes in a tablespace, which
+may consist of several files. This is different from, for example,
+@code{MyISAM} tables where each table is stored as a separate file.
-Innodb is a table handler that is under the GNU GPL License Version 2
+InnoDB is distributed under the GNU GPL License Version 2 (of June 1991).
-(of June 1991). In the source distribution of @strong{MySQL}, Innodb
+In the source distribution of MySQL, InnoDB appears as a subdirectory.
-appears as a subdirectory.
-@node INNODB start, Using INNODB tables, INNODB overview, INNODB
+@node InnoDB start
-@subsection INNODB startup options
+@subsection InnoDB startup options
+Beginning from MySQL-3.23.37 the prefix of the options is changed
+from @code{innobase_...} to @code{innodb_...}.
+To use InnoDB tables you must specify configuration parameters
+in the MySQL configuration file in the @code{[mysqld]} section of
+the configuration file @file{my.cnf}.
+Suppose you have a Windows NT machine with 128 MB RAM and a
+single 10 GB hard disk.
+Below is an example of possible configuration parameters in @file{my.cnf} for
+InnoDB:
+@example
+innodb_data_home_dir = c:\ibdata
+innodb_data_file_path = ibdata1:2000M;ibdata2:2000M
+set-variable = innodb_mirrored_log_groups=1
+innodb_log_group_home_dir = c:\iblogs
+set-variable = innodb_log_files_in_group=3
+set-variable = innodb_log_file_size=30M
+set-variable = innodb_log_buffer_size=8M
+innodb_flush_log_at_trx_commit=1
+innodb_log_arch_dir = c:\iblogs
+innodb_log_archive=0
+set-variable = innodb_buffer_pool_size=80M
+set-variable = innodb_additional_mem_pool_size=10M
+set-variable = innodb_file_io_threads=4
+set-variable = innodb_lock_wait_timeout=50
+@end example
-To use Innodb tables you must specify configuration parameters
+Suppose you have a Linux machine with 512 MB RAM and
-in the @strong{MySQL} configuration file in the @code{[mysqld]} section of
+three 20 GB hard disks (at directory paths @file{/},
-the configuration file. Below is an example of possible configuration
+@file{/dr2} and @file{/dr3}).
-parameters in my.cnf for Innodb:
+Below is an example of possible configuration parameters in @file{my.cnf} for
+InnoDB:
 @example
-innodb_data_home_dir = /usr/local/mysql/var
+innodb_data_home_dir = /
-innodb_log_group_home_dir = /usr/local/mysql/var
+innodb_data_file_path = ibdata/ibdata1:2000M;dr2/ibdata/ibdata2:2000M
-innodb_log_arch_dir = /usr/local/mysql/var
-innodb_data_file_path = ibdata1:25M;ibdata2:37M;ibdata3:100M;ibdata4:300M
 set-variable = innodb_mirrored_log_groups=1
+innodb_log_group_home_dir = /dr3
 set-variable = innodb_log_files_in_group=3
-set-variable = innodb_log_file_size=5M
+set-variable = innodb_log_file_size=50M
 set-variable = innodb_log_buffer_size=8M
 innodb_flush_log_at_trx_commit=1
+innodb_log_arch_dir = /dr3/iblogs
 innodb_log_archive=0
-set-variable = innodb_buffer_pool_size=16M
+set-variable = innodb_buffer_pool_size=400M
-set-variable = innodb_additional_mem_pool_size=2M
+set-variable = innodb_additional_mem_pool_size=20M
 set-variable = innodb_file_io_threads=4
 set-variable = innodb_lock_wait_timeout=50
 @end example
+Note that we have placed the two data files on different disks.
+The reason for the name @code{innodb_data_file_path} is that
+you can also specify paths to your data files, and
+@code{innodb_data_home_dir} is just textually catenated
+before your data file paths, adding a possible slash or
+backslash in between. InnoDB will fill the tablespace
+formed by the data files from bottom up. In some cases it will
+improve the performance of the database if all data is not placed
+on the same physical disk. Putting log files on a different disk from
+data is very often beneficial for performance.
 The meanings of the configuration parameters are the following:
 @multitable @columnfractions .30 .70
-@item @code{innodb_data_home_dir} @tab
+@item @code{innodb_data_home_dir} @tab 
-The common part of the directory path for all innodb data files.
+The common part of the directory path for all innobase data files.
-@item @code{innodb_data_file_path} @tab
+@item @code{innodb_data_file_path} @tab 
 Paths to individual data files and their sizes. The full directory path
 to each data file is acquired by concatenating innodb_data_home_dir to
 the paths specified here. The file sizes are specified in megabytes,
 hence the 'M' after the size specification above. Do not set a file size
 bigger than 4000M, and on most operating systems not bigger than 2000M.
-innodb_mirrored_log_groups Number of identical copies of log groups we
+InnoDB also understands the abbreviation 'G', 1G meaning 1024M.
+@item @code{innodb_mirrored_log_groups} @tab 
+Number of identical copies of log groups we
 keep for the database. Currently this should be set to 1.
-@item @code{innodb_log_group_home_dir} @tab
+@item @code{innodb_log_group_home_dir} @tab 
-Directory path to Innodb log files.
+Directory path to InnoDB log files.
-@item @code{innodb_log_files_in_group} @tab
+@item @code{innodb_log_files_in_group} @tab 
-Number of log files in the log group. Innodb writes to the files in a
+Number of log files in the log group. InnoDB writes to the files in a
-circular fashion.  Value 3 is recommended here.
+circular fashion. Value 3 is recommended here.
-@item @code{innodb_log_file_size} @tab
+@item @code{innodb_log_file_size} @tab 
 Size of each log file in a log group in megabytes. Sensible values range
 from 1M to the size of the buffer pool specified below. The bigger the
 value, the less checkpoint flush activity is needed in the buffer pool,
 saving disk i/o. But bigger log files also mean that recovery will be
 slower in case of a crash. File size restriction as for a data file.
-@item @code{innodb_log_buffer_size} @tab
+@item @code{innodb_log_buffer_size} @tab 
-The size of the buffer which Innodb uses to write log to the log files
+The size of the buffer which InnoDB uses to write log to the log files
 on disk.  Sensible values range from 1M to half the combined size of log
 files. A big log buffer allows large transactions to run without a need
 to write the log to disk until the transaction commit. Thus, if you have
 big transactions, making the log buffer big will save disk i/o.
-@item @code{innodb_flush_log_at_trx_commit} @tab
+@item @code{innodb_flush_log_at_trx_commit} @tab 
 Normally this is set to 1, meaning that at a transaction commit the log
 is flushed to disk, and the modifications made by the transaction become
 permanent, and survive a database crash. If you are willing to
 compromise this safety, and you are running small transactions, you may
 set this to 0 to reduce disk i/o to the logs.
-@item @code{innodb_log_arch_dir} @tab
+@item @code{innodb_log_arch_dir} @tab 
 The directory where fully written log files would be archived if we used
 log archiving.  The value of this parameter should currently be set the
 same as @code{innodb_log_group_home_dir}.
-@item  @code{innodb_log_archive} @tab
+@item @code{innodb_log_archive} @tab 
 This value should currently be set to 0.  As recovery from a backup is
-done by @strong{MySQL} using its own log files, there is currently no need
+done by MySQL using its own log files, there is currently no need to
-to archive Innodb log files.
+archive InnoDB log files.
-@item @code{innodb_buffer_pool_size} @tab
+@item @code{innodb_buffer_pool_size} @tab 
-The size of the memory buffer Innodb uses to cache data and indexes of
+The size of the memory buffer InnoDB uses to cache data and indexes of
 its tables.  The bigger you set this the less disk i/o is needed to
 access data in tables. On a dedicated database server you may set this
 parameter up to 90 % of the machine physical memory size. Do not set it
 too large, though, because competition of the physical memory may cause
 paging in the operating system.
-@item @code{innodb_additional_mem_pool_size} @tab
+@item @code{innodb_additional_mem_pool_size} @tab 
-Size of a memory pool Innodb uses to store data dictionary information
+Size of a memory pool InnoDB uses to store data dictionary information
 and other internal data structures. A sensible value for this might be
 2M, but the more tables you have in your application the more you will
-need to allocate here. If Innodb runs out of memory in this pool, it
+need to allocate here. If InnoDB runs out of memory in this pool, it
 will start to allocate memory from the operating system, and write
-warning messages to the @strong{MySQL} error log.
+warning messages to the MySQL error log.
+@item @code{innodb_file_io_threads} @tab 
-@item @code{innodb_file_io_threads} @tab
+Number of file i/o threads in InnoDB. Normally, this should be 4, but
-Number of file i/o threads in Innodb. Normally, this should be 4, but
 on Windows NT disk i/o may benefit from a larger number.
-@item @code{innodb_lock_wait_timeout} @tab
+@item @code{innodb_lock_wait_timeout} @tab 
-Timeout in seconds an Innodb transaction may wait for a lock before
+Timeout in seconds an InnoDB transaction may wait for a lock before
-being rolled back.  Innodb automatically detects transaction deadlocks
+being rolled back.  InnoDB automatically detects transaction deadlocks
 in its own lock table and rolls back the transaction. If you use
 @code{LOCK TABLES} command, or other transaction safe table handlers
-than Innodb in the same transaction, then a deadlock may arise which
+than InnoDB in the same transaction, then a deadlock may arise which
-Innodb cannot notice. In cases like this the timeout is useful to
+InnoDB cannot notice. In cases like this the timeout is useful to
 resolve the situation.
 @end multitable
+@node Creating an InnoDB database
+@subsection Creating an InnoDB database
-@node Using INNODB tables, INNODB restrictions, INNODB start, INNODB
+Suppose you have installed MySQL and have edited @file{my.cnf} so that
-@subsection Using INNODB tables
+it contains the necessary InnoDB configuration parameters.
+Before starting MySQL you should check that the directories you have
+specified for InnoDB data files and log files exist and that you have
+access rights to those directories. InnoDB
+cannot create directories, only files. Check also you have enough disk space
+for the data and log files.
-Technically, Innodb is a database backend placed under @strong{MySQL}.
+When you now start MySQL, InnoDB will start creating your data files
-Innodb has its own buffer pool for caching data and indexes in main
+and log files. InnoDB will print something like the following:
-memory. Innodb stores its tables and indexes in a tablespace, which
-may consist of several files. This is different from, for example,
+@example
-@code{MyISAM} tables where each table is stored as a separate file.
+~/mysqlm/sql > mysqld
+InnoDB: The first specified data file /home/heikki/data/ibdata1 did not exist:
+InnoDB: a new database to be created!
+InnoDB: Setting file /home/heikki/data/ibdata1 size to 134217728
+InnoDB: Database physically writes the file full: wait...
+InnoDB: Data file /home/heikki/data/ibdata2 did not exist: new to be created
+InnoDB: Setting file /home/heikki/data/ibdata2 size to 262144000
+InnoDB: Database physically writes the file full: wait...
+InnoDB: Log file /home/heikki/data/logs/ib_logfile0 did not exist: new to be c
+reated
+InnoDB: Setting log file /home/heikki/data/logs/ib_logfile0 size to 5242880
+InnoDB: Log file /home/heikki/data/logs/ib_logfile1 did not exist: new to be c
+reated
+InnoDB: Setting log file /home/heikki/data/logs/ib_logfile1 size to 5242880
+InnoDB: Log file /home/heikki/data/logs/ib_logfile2 did not exist: new to be c
+reated
+InnoDB: Setting log file /home/heikki/data/logs/ib_logfile2 size to 5242880
+InnoDB: Started
+mysqld: ready for connections
+@end example
+A new InnoDB database has now been created. You can connect to the MySQL
+server with the usual MySQL client programs like @code{mysql}.
+When you shut down the MySQL server with @file{mysqladmin shutdown},
+InnoDB output will be like the following:
+@example
+010321 18:33:34  mysqld: Normal shutdown
+010321 18:33:34  mysqld: Shutdown Complete
+InnoDB: Starting shutdown...
+InnoDB: Shutdown completed
+@end example
+You can now look at the data files and logs directories and you
+will see the files created. The log directory will also contain
+a small file named @file{ib_arch_log_0000000000}. That file
+resulted from the database creation, after which InnoDB switched off
+log archiving.
+When MySQL is again started, the output will be like the following:
+@example
+~/mysqlm/sql > mysqld
+InnoDB: Started
+mysqld: ready for connections
+@end example
+@subsubsection If something goes wrong in database creation
+If something goes wrong in an InnoDB database creation, you should delete
+all files created by InnoDB. This means all data files, all log files,
+the small archived log file, and in the case you already did create
+some InnoDB tables, delete also the corresponding @file{.frm}
+files for these tables from the MySQL database directories. Then you can
+try the InnoDB database creation again.
-To create a table in the Innodb format you must specify
+@node Using InnoDB tables
-@code{TYPE = INNODB} in the table creation SQL command:
+@subsection Creating InnoDB tables
+Suppose you have started the MySQL client with the command
+@code{mysql test}.
+To create a table in the InnoDB format you must specify
+@code{TYPE = InnoDB} in the table creation SQL command:
 @example
-CREATE TABLE CUSTOMERS (A INT, B CHAR (20), INDEX (A)) TYPE = INNODB;
+CREATE TABLE CUSTOMER (A INT, B CHAR (20), INDEX (A)) TYPE = InnoDB;
 @end example
-A consistent non-locking read is the default locking behavior when you
+This SQL command will create a table and an index on column @code{A}
-do a @code{SELECT} from an Innodb table. For a searched update and an
+into the InnoDB tablespace consisting of the data files you specified
-insert row level exclusive locking is performed.
+in @file{my.cnf}. In addition MySQL will create a file
+@file{CUSTOMER.frm} to the MySQL database directory @file{test}.
+Internally, InnoDB will add to its own data dictionary an entry
+for table @code{'test/CUSTOMER'}. Thus you can create a table
+of the same name @code{CUSTOMER} in another database of MySQL, and
+the table names will not collide inside InnoDB.
-You can query the amount of free space in the Innodb tablespace (=
+You can query the amount of free space in the InnoDB tablespace
-data files you specified in my.cnf) by issuing the table status command
+by issuing the table status command of MySQL for any table you have
-of @strong{MySQL} for any table you have created with @code{TYPE =
+created with @code{TYPE = InnoDB}. Then the amount of free
-INNODB}.  Then the amount of free space in the tablespace appears in
+space in the tablespace appears in the table comment section in the
-the table comment section in the output of SHOW. An example:
+output of @code{SHOW}. An example:
 @example
-SHOW TABLE STATUS FROM TEST LIKE 'CUSTOMER'
+SHOW TABLE STATUS FROM test LIKE 'CUSTOMER'
 @end example
-if you have created a table of name CUSTOMER in a database you have named
+Note that the statistics @code{SHOW} gives about InnoDB tables
-TEST. Note that the statistics SHOW gives about Innodb tables
 are only approximate: they are used in SQL optimization. Table and
 index reserved sizes in bytes are accurate, though.
-NOTE: DROP DATABASE does not currently work for Innodb tables!
+NOTE: @code{DROP DATABASE} does not currently work for InnoDB tables!
-You must drop the tables individually.
+You must drop the tables individually. Also take care not to delete or
+add @file{.frm} files to your InnoDB database manually: use
+@code{CREATE TABLE} and @code{DROP TABLE} commands.
+InnoDB has its own internal data dictionary, and you will get problems
+if the MySQL @file{.frm} files are out of 'sync' with the InnoDB
+internal data dictionary.
+@node Adding and removing
+@subsection Adding and removing InnoDB data and log files
+You cannot increase the size of an InnoDB data file. To add more into
+your tablespace you have to add a new data file. To do this you have to
+shut down your MySQL database, edit the @file{my.cnf} file, adding a
+new file to @code{innodb_data_file_path}, and then start MySQL
+again.
-Note that in addition to your tables, the rollback segment uses space
+Currently you cannot remove a data file from InnoDB. To decrease the
-from the tablespace.
+size of your database you have to use @code{mysqldump} to dump
+all your tables, create a new database, and import your tables to the
+new database.
-Since Innodb is a multiversioned database, it must keep information
+If you want to change the number or the size of your InnoDB log files,
-of old versions of rows in the tablespace. This information is stored
+you have to shut down MySQL and make sure that it shuts down without errors.
-in a data structure called a rollback segment, like in Oracle. In contrast
+Then copy the old log files into a safe place just in case something
-to Oracle, you do not need to configure the rollback segment in any way in
+went wrong in the shutdown and you will need them to recover the
-Innodb. If you issue SELECTs, which by default do a consistent read in
+database. Delete then the old log files from the log file directory,
-Innodb, remember to commit your transaction regularly. Otherwise
+edit @file{my.cnf}, and start MySQL again. InnoDB will tell
-the rollback segment will grow because it has to preserve the information
+you at the startup that it is creating new log files.
-needed for further consistent reads in your transaction: in Innodb
-all consistent reads within one transaction will see the same timepoint
+@node Backing up
-snapshot of the database: the reads are also 'consistent' with
+@subsection Backing up and recovering an InnoDB database
-respect to each other. 
+The key to safe database management is taking regular backups.
+To take a 'binary' backup of your database you have to do the following:
+@itemize @bullet
+@item
+Shut down your MySQL database and make sure it shuts down without errors.
+@item
+Copy all your data files into a safe place.
+@item
+Copy all your InnoDB log files to a safe place.
+@item
+Copy your @file{my.cnf} configuration file(s) to a safe place.
+@item
+Copy all the @file{.frm} files for your InnoDB tables into a
+safe place.
+@end itemize
-Some Innodb errors: If you run out of file space in the tablespace,
+There is currently no on-line or incremental backup tool available for
-you will get the @strong{MySQL} 'Table is full' error. If you want to
+InnoDB, though they are in the TODO list.
-make your tablespace bigger, you have to shut down @strong{MySQL} and 
-add a new datafile specification to @file{my.conf}, to the 
-@code{innodb_data_file_path} parameter.
-A transaction deadlock or a timeout in a lock wait will give 'Table handler
+In addition to taking the binary backups described above,
-error 1000000'.
+you should also regularly take dumps of your tables with
+@file{mysqldump}. The reason to this is that a binary file
+may be corrupted without you noticing it. Dumped tables are stored
+into text files which are human-readable and much simpler than
+database binary files. Seeing table corruption from dumped files
+is easier, and since their format is simpler, the chance for
+serious data corruption in them is smaller.
-Contact information of Innobase Oy, producer of the Innodb engine:
+A good idea is to take the dumps at the same time you take a binary
+backup of your database. You have to shut out all clients from your
+database to get a consistent snapshot of all your tables into your
+dumps. Then you can take the binary backup, and you will then have
+a consistent snapshot of your database in two formats. 
-Website: @uref{http://www.innobase.fi}.
+To be able to recover your InnoDB database to the present from the
+binary backup described above, you have to run your MySQL database
+with the general logging and log archiving of MySQL switched on. Here
+by the general logging we mean the logging mechanism of the MySQL server
+which is independent of InnoDB logs.
+To recover from a crash of your MySQL server process, the only thing
+you have to do is to restart it. InnoDB will automatically check the
+logs and perform a roll-forward of the database to the present.
+InnoDB will automatically roll back uncommitted transactions which were
+present at the time of the crash. During recovery, InnoDB will print
+out something like the following:
-@email{Heikki.Tuuri@@innobase.inet.fi}
 @example
-phone: 358-9-6969 3250 (office) 358-40-5617367 (mobile)
+~/mysqlm/sql > mysqld
-Innodb Oy Inc.
+InnoDB: Database was not shut down normally.
-World Trade Center Helsinki
+InnoDB: Starting recovery from log files...
-Aleksanterinkatu 17
+InnoDB: Starting log scan based on checkpoint at
-P.O.Box 800
+InnoDB: log sequence number 0 13674004
-00101 Helsinki
+InnoDB: Doing recovery: scanned up to log sequence number 0 13739520
-Finland
+InnoDB: Doing recovery: scanned up to log sequence number 0 13805056
+InnoDB: Doing recovery: scanned up to log sequence number 0 13870592
+InnoDB: Doing recovery: scanned up to log sequence number 0 13936128
+...
+InnoDB: Doing recovery: scanned up to log sequence number 0 20555264
+InnoDB: Doing recovery: scanned up to log sequence number 0 20620800
+InnoDB: Doing recovery: scanned up to log sequence number 0 20664692
+InnoDB: 1 uncommitted transaction(s) which must be rolled back
+InnoDB: Starting rollback of uncommitted transactions
+InnoDB: Rolling back trx no 16745
+InnoDB: Rolling back of trx no 16745 completed
+InnoDB: Rollback of uncommitted transactions completed
+InnoDB: Starting an apply batch of log records to the database...
+InnoDB: Apply batch completed
+InnoDB: Started
+mysqld: ready for connections
+@end example
+If your database gets corrupted or your disk fails, you have
+to do the recovery from a backup. In the case of corruption, you should
+first find a backup which is not corrupted. From a backup do the recovery
+from the general log files of MySQL according to instructions in the
+MySQL manual.
+@subsubsection Checkpoints
+InnoDB implements a checkpoint mechanism called a fuzzy
+checkpoint. InnoDB will flush modified database pages from the buffer
+pool in small batches, there is no need to flush the buffer pool
+in one single batch, which would in practice stop processing
+of user SQL statements for a while.
+In crash recovery InnoDB looks for a checkpoint label written
+to the log files. It knows that all modifications to the database
+before the label are already present on the disk image of the database.
+Then InnoDB scans the log files forward from the place of the checkpoint
+applying the logged modifications to the database.
+InnoDB writes to the log files in a circular fashion.
+All committed modifications which make the database pages in the buffer
+pool different from the images on disk must be available in the log files
+in case InnoDB has to do a recovery. This means that when InnoDB starts
+to reuse a log file in the circular fashion, it has to make sure that the
+database page images on disk already contain the modifications
+logged in the log file InnoDB is going to reuse. In other words, InnoDB
+has to make a checkpoint and often this involves flushing of
+modified database pages to disk.
+The above explains why making your log files very big may save
+disk i/o in checkpointing. It can make sense to set
+the total size of the log files as big as the buffer pool or even bigger.
+The drawback in big log files is that crash recovery can last longer
+because there will be more log to apply to the database.
+@node Moving
+@subsection Moving an InnoDB database to another machine
+InnoDB data and log files are binary-compatible on all platforms
+if the floating point number format on the machines is the same.
+You can move an InnoDB database simply by copying all the relevant
+files, which we already listed in the previous section on backing up
+a database. If the floating point formats on the machines are
+different but you have not used @code{FLOAT} or @code{DOUBLE}
+data types in your tables then the procedure is the same: just copy
+the relevant files. If the formats are different and your tables
+contain floating point data, you have to use @file{mysqldump}
+and @file{mysqlimport} to move those tables.
+A performance tip is to switch off the auto commit when you import
+data into your database, assuming your tablespace has enough space for
+the big rollback segment the big import transaction will generate.
+Do the commit only after importing a whole table or a segment of
+a table.
+@node InnoDB transaction model
+@subsection InnoDB transaction model
+In the InnoDB transaction model the goal has been to combine the best
+sides of a multiversioning database to traditional two-phase locking.
+InnoDB does locking on row level and runs queries by default
+as non-locking consistent reads, in the style of Oracle.
+The lock table in InnoDB is stored so space-efficiently that lock
+escalation is not needed: typically several users are allowed
+to lock every row in the database, or any random subset of the rows,
+without InnoDB running out of memory.
+In InnoDB all user activity happens inside transactions. If the
+auto commit mode is used in MySQL, then each SQL statement
+will form a single transaction. If the auto commit mode is
+switched off, then we can think that a user always has a transaction
+open. If he issues
+the SQL @code{COMMIT} or @code{ROLLBACK} statement, that
+ends the current transaction, and a new starts. Both statements
+will release all InnoDB locks that were set during the
+current transaction. A @code{COMMIT} means that the
+changes made in the current transaction are made permanent
+and become visible to other users. A @code{ROLLBACK}
+on the other hand cancels all modifications made by the current
+transaction.
+@subsubsection Consistent read
+A consistent read means that InnoDB uses its multiversioning to
+present to a query a snapshot of the database at a point in time.
+The query will see the changes made by exactly those transactions that
+committed before that point of time, and no changes made by later
+or uncommitted transactions. The exception to this rule is that the
+query will see the changes made by the transaction itself which issues
+the query.
+When a transaction issues its first consistent read, InnoDB assigns
+the snapshot, or the point of time, which all consistent reads in the
+same transaction will use. In the snapshot are all transactions that
+committed before assigning the snapshot. Thus the consistent reads
+within the same transaction will also be consistent with respect to each
+other. You can get a fresher snapshot for your queries by committing
+the current transaction and after that issuing new queries.
+Consistent read is the default mode in which InnoDB processes
+@code{SELECT} statements. A consistent read does not set any locks
+on the tables it accesses, and therefore other users are free to
+modify those tables at the same time a consistent read is being performed
+on the table.
+@subsubsection Locking reads
+A consistent read is not convenient in some circumstances.
+Suppose you want to add a new row into your table @code{CHILD},
+and make sure that the child already has a parent in table
+@code{PARENT}.
+Suppose you use a consistent read to read the table @code{PARENT}
+and indeed see the parent of the child in the table. Can you now safely
+add the child row to table @code{CHILD}? No, because it may
+happen that meanwhile some other user has deleted the parent row
+from the table @code{PARENT}, and you are not aware of that.
+The solution is to perform the @code{SELECT} in a locking
+mode, @code{IN SHARE MODE}.
+@example
+SELECT * FROM PARENT WHERE NAME = 'Jones' IN SHARE MODE;
+@end example
+Performing a read in share mode means that we read the latest
+available data, and set a shared mode lock on the rows we read.
+If the latest data belongs to a yet uncommitted transaction of another
+user, we will wait until that transaction commits.
+A shared mode lock prevents others from updating or deleting
+the row we have read. After we see that the above query returns
+the parent @code{'Jones'}, we can safely add his child
+to table @code{CHILD}, and commit our transaction.
+This example shows how to implement referential
+integrity in your application code.
+Let us look at another example: we have an integer counter field in
+a table @code{CHILD_CODES} which we use to assign
+a unique identifier to each child we add to table @code{CHILD}.
+Obviously, using a consistent read or a shared mode read
+to read the present value of the counter is not a good idea, since
+then two users of the database may see the same value for the
+counter, and we will get a duplicate key error when we add
+the two children with the same identifier to the table.
+In this case there are two good ways to implement the
+reading and incrementing of the counter: (1) update the counter
+first by incrementing it by 1 and only after that read it,
+or (2) read the counter first with
+a lock mode @code{FOR UPDATE}, and increment after that:
+@example
+SELECT COUNTER_FIELD FROM CHILD_CODES FOR UPDATE;
+UPDATE CHILD_CODES SET COUNTER_FIELD = COUNTER_FIELD + 1;
+@end example
+A @code{SELECT ... FOR UPDATE} will read the latest
+available data setting exclusive locks on each row it reads.
+Thus it sets the same locks a searched SQL @code{UPDATE} would set
+on the rows.
+@subsubsection Next-key locking: avoiding the 'phantom problem'
+In row level locking InnoDB uses an algorithm called next-key locking.
+InnoDB does the row level locking so that when it searches or
+scans an index of a table, it sets shared or exclusive locks
+on the index records in encounters. Thus the row level locks are
+more precisely called index record locks.
+The locks InnoDB sets on index records also affect the 'gap'
+before that index record. If a user has a shared or exclusive
+lock on record R in an index, then another user cannot insert
+a new index record immediately before R in the index order.
+This locking of gaps is done to prevent the so-called phantom
+problem. Suppose I want to read and lock all children with identifier
+bigger than 100 from table @code{CHILD},
+and update some field in the selected rows.
+@example
+SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;
 @end example
-@node INNODB restrictions,  , Using INNODB tables, INNODB
+Suppose there is an index on table @code{CHILD} on column
-@subsection Some restrictions on @code{INNODB} tables:
+@code{ID}. Our query will scan that index starting from
+the first record where @code{ID} is bigger than 100.
+Now, if the locks set on the index records would not lock out
+inserts made in the gaps, a new child might meanwhile be
+inserted to the table. If now I in my transaction execute
+@example
+SELECT * FROM CHILD WHERE ID > 100 FOR UPDATE;
+@end example
+again, I will see a new child in the result set the query returns.
+This is against the isolation principle of transactions:
+a transaction should be able to run so that the data
+it has read does not change during the transaction. If we regard
+a set of rows as a data item, then the new 'phantom' child would break
+this isolation principle.
+When InnoDB scans an index it can also lock the gap
+after the last record in the index. Just that happens in the previous
+example: the locks set by InnoDB will prevent any insert to
+the table where @code{ID} would be bigger than 100.
+You can use the next-key locking to implement a uniqueness
+check in your application: if you read your data in share mode
+and do not see a duplicate for a row you are going to insert,
+then you can safely insert your row and know that the next-key
+lock set on the successor of your row during the read will prevent
+anyone meanwhile inserting a duplicate for your row. Thus the next-key
+locking allows you to 'lock' the non-existence of something in your
+table.
+@subsubsection Locks set by different SQL statements in InnoDB
+@itemize @bullet
+@item
+@code{SELECT ... FROM ...} : this is a consistent read, reading a
+snapshot of the database and setting no locks.
+@item
+@code{SELECT ... FROM ... IN SHARE MODE} : sets shared next-key locks
+on all index records the read encounters.
+@item
+@code{SELECT ... FROM ... FOR UPDATE} : sets exclusive next-key locks
+on all index records the read encounters.
+@item
+@code{INSERT INTO ... VALUES (...)} : sets an exclusive lock
+on the inserted row; note that this lock is not a next-key lock
+and does not prevent other users from inserting to the gap before the
+inserted row. If a duplicate key error occurs, sets a shared lock
+on the duplicate index record.
+@item
+@code{INSERT INTO T SELECT ... FROM S WHERE ...} sets an exclusive
+(non-next-key) lock on each row inserted into @code{T}. Does
+the search on @code{S} as a consistent read, but sets shared next-key
+locks on @code{S} if the MySQL logging is on. InnoDB has to set
+locks in the latter case because in roll-forward recovery from a
+backup every SQL statement has to be executed in exactly the same
+way as it was done originally.
+@item
+@code{CREATE TABLE ... SELECT ...} performs the @code{SELECT}
+as a consistent read or with shared locks, like in the previous
+item.
+@item
+@code{REPLACE} is done like an insert if there is no collision
+on a unique key. Otherwise, an exclusive next-key lock is placed
+on the row which has to be updated.
+@item
+@code{UPDATE ... SET ... WHERE ...} : sets an exclusive next-key
+lock on every record the search encounters.
+@item
+@code{DELETE FROM ... WHERE ...} : sets an exclusive next-key
+lock on every record the search encounters.
+@item
+@code{LOCK TABLES ... } : sets table locks. In the implementation
+the MySQL layer of code sets these locks. The automatic deadlock detection
+of InnoDB cannot detect deadlocks where such table locks are involved:
+see the next section below. See also section 13 'InnoDB restrictions'
+about the following: since MySQL does know about row level locks,
+it is possible that you
+get a table lock on a table where another user currently has row level
+locks. But that does not put transaction integerity into danger.
+@end itemize
+@subsubsection Deadlock detection and rollback
+InnoDB automatically detects a deadlock of transactions and rolls
+back the transaction whose lock request was the last one to build
+a deadlock, that is, a cycle in the waits-for graph of transactions.
+InnoDB cannot detect deadlocks where a lock set by a MySQL
+@code{LOCK TABLES} statement is involved, or if a lock set
+in another table handler than InnoDB is involved. You have to resolve
+these situations using @code{innodb_lock_wait_timeout} set in
+@file{my.cnf}.
+When InnoDB performs a complete rollback of a transaction, all the
+locks of the transaction are released. However, if just a single SQL
+statement is rolled back as a result of an error, some of the locks
+set by the SQL statement may be preserved. This is because InnoDB
+stores row locks in a format where it cannot afterwards know which was
+set by which SQL statement.
+@node Implementation
+@subsection Implementation of multiversioning
+Since InnoDB is a multiversioned database, it must keep information
+of old versions of rows in the tablespace. This information is stored
+in a data structure we call a rollback segment after an analogous
+data structure in Oracle.
+InnoDB internally adds two fields to each row stored in the database.
+A 6-byte field tells the transaction identifier for the last
+transaction which inserted or updated the row. Also a deletion
+is internally treated as an update where a special bit in the row
+is set to mark it as deleted. Each row also contains a 7-byte
+field called the roll pointer. The roll pointer points to an
+undo log record written to the rollback segment. If the row was
+updated, then the undo log record contains the information necessary
+to rebuild the content of the row before it was updated.
+InnoDB uses the information in the rollback segment to perform the
+undo operations needed in a transaction rollback. It also uses the
+information to build earlier versions of a row for a consistent
+read.
+Undo logs in the rollback segment are divided into insert and update
+undo logs. Insert undo logs are only needed in transaction rollback
+and can be discarded as soon as the transaction commits. Update undo logs
+are used also in consistent reads, and they can be discarded only after
+there is no transaction present for which InnoDB has assigned
+a snapshot that in a consistent read could need the information
+in the update undo log to build an earlier version of a database
+row.
+You must remember to commit your transactions regularly. Otherwise
+InnoDB cannot discard data from the update undo logs, and the
+rollback segment may grow too big, filling up your tablespace.
+The physical size of an undo log record in the rollback segment
+is typically smaller than the corresponding inserted or updated
+row. You can use this information to calculate the space need
+for your rollback segment.
+In our multiversioning scheme a row is not physically removed from
+the database immediately when you delete it with an SQL statement.
+Only  when InnoDB can discard the update undo log record written for
+the deletion, it can also physically remove the corresponding row and
+its index records from the database. This removal operation is
+called a purge, and it is quite fast, usually taking the same order of
+time as the SQL statement which did the deletion.
+@node Table and index
+@subsection Table and index structures
+Every InnoDB table has a special index called the clustered index
+where the data of the rows is stored. If you define a
+@code{PRIMARY KEY} on your table, then the index of the primary key
+will be the clustered index.
+If you do not define a primary key for
+your table, InnoDB will internally generate a clustered index
+where the rows are ordered by the row id InnoDB assigns
+to the rows in such a table. The row id is a 6-byte field which
+monotonically increases as new rows are inserted. Thus the rows
+ordered by the row id will be physically in the insertion order.
+Accessing a row through the clustered index is fast, because
+the row data will be on the same page where the index search
+leads us. In many databases the data is traditionally stored on a different
+page from the index record. If a table is large, the clustered
+index architecture often saves a disk i/o when compared to the
+traditional solution.
+The records in non-clustered indexes (we also call them secondary indexes),
+in InnoDB contain the primary key value for the row. InnoDB
+uses this primary key value to search for the row from the clustered
+index. Note that if the primary key is long, the secondary indexes
+will use more space.
+@subsubsection Physical structure of an index
+All indexes in InnoDB are B-trees where the index records are
+stored in the leaf pages of the tree. The default size of an index
+page is 16 kB. When new records are inserted, InnoDB tries to
+leave 1 / 16 of the page free for future insertions and updates
+of the index records.
+If index records are inserted in a sequential (ascending or descending)
+order, the resulting index pages will be about 15/16 full.
+If records are inserted in a random order, then the pages will be
+1/2 - 15/16 full. If the fillfactor of an index page drops below 1/4,
+InnoDB will try to contract the index tree to free the page.
+@subsubsection Insert buffering
+It is a common situation in a database application that the
+primary key is a unique identifier and new rows are inserted in the
+ascending order of the primary key. Thus the insertions to the
+clustered index do not require random reads from a disk.
+On the other hand, secondary indexes are usually non-unique and
+insertions happen in a relatively random order into secondary indexes.
+This would cause a lot of random disk i/o's without a special mechanism
+used in InnoDB.
+If an index record should be inserted to a non-unique secondary index,
+InnoDB checks if the secondary index page is already in the buffer
+pool. If that is the case, InnoDB will do the insertion directly to
+the index page. But, if the index page is not found from the buffer
+pool, InnoDB inserts the record to a special insert buffer structure.
+The insert buffer is kept so small that it entirely fits in the buffer
+pool, and insertions can be made to it very fast.
+The insert buffer is periodically merged to the secondary index
+trees in the database. Often we can merge several insertions on the
+same page in of the index tree, and hence save disk i/o's.
+It has been measured that the insert buffer can speed up insertions
+to a table up to 15 times.
+@subsubsection Adaptive hash indexes
+If a database fits almost entirely in main memory, then the fastest way
+to perform queries on it is to use hash indexes. InnoDB has an
+automatic mechanism which monitors index searches made to the indexes
+defined for a table, and if InnoDB notices that queries could
+benefit from building of a hash index, such an index is automatically
+built.
+But note that the hash index is always built based on an existing
+B-tree index on the table. InnoDB can build a hash index on a prefix
+of any length of the key defined for the B-tree, depending on
+what search pattern InnoDB observes on the B-tree index.
+A hash index can be partial: it is not required that the whole
+B-tree index is cached in the buffer pool. InnoDB will build
+hash indexes on demand to those pages of the index which are
+often accessed.
+In a sense, through the adaptive hash index mechanism InnoDB adapts itself
+to ample main memory, coming closer to the architecture of main memory
+databases.
+@subsubsection Physical record structure
+@itemize @bullet
+@item
+Each index record in InnoDB contains a header of 6 bytes. The header
+is used to link consecutive records together, and also in the row level
+locking.
+@item
+Records in the clustered index contain fields for all user-defined
+columns. In addition, there is a 6-byte field for the transaction id
+and a 7-byte field for the roll pointer.
+@item
+If the user has not defined a primary key for a table, then each clustered
+index record contains also a 6-byte row id field.
+@item
+Each secondary index record contains also all the fields defined
+for the clustered index key.
+@item
+A record contains also a pointer to each field of the record.
+If the total length of the fields in a record is < 256 bytes, then
+the pointer is 1 byte, else 2 bytes.
+@end itemize
+@node File space management
+@subsection File space management and disk i/o
+@subsubsection Disk i/o
+In disk i/o InnoDB uses asynchronous i/o. On Windows NT
+it uses the native asynchronous i/o provided by the operating system.
+On Unixes InnoDB uses simulated asynchronous i/o built
+into InnoDB: InnoDB creates a number of i/o threads to take care
+of i/o operations, such as read-ahead. In a future version we will
+add support for simulated aio on Windows NT and native aio on those
+Unixes which have one.
+On Windows NT InnoDB uses non-buffered i/o. That means that the disk
+pages InnoDB reads or writes are not buffered in the operating system
+file cache. This saves some memory bandwidth.
+You can also use a raw disk in InnoDB, though this has not been tested yet:
+just define the raw disk in place of a data file in @file{my.cnf}.
+You must give the exact size in bytes of the raw disk in @file{my.cnf},
+because at startup InnoDB checks that the size of the file
+is the same as specified in the configuration file. Using a raw disk
+you can on some Unixes perform non-buffered i/o.
+There are two read-ahead heuristics in InnoDB: sequential read-ahead
+and random read-ahead. In sequential read-ahead InnoDB notices that
+the access pattern to a segment in the tablespace is sequential.
+Then InnoDB will post in advance a batch of reads of database pages to the
+i/o system. In random read-ahead InnoDB notices that some area
+in a tablespace seems to be in the process of being
+fully read into the buffer pool. Then InnoDB posts the remaining
+reads to the i/o system.
+@subsubsection File space management
+The data files you define in the configuration file form the tablespace
+of InnoDB. The files are simply catenated to form the tablespace,
+there is no striping in use.
+Currently you cannot directly instruct where the space is allocated
+for your tables, except by using the following fact: from a newly created
+tablespace InnoDB will allocate space starting from the low end.
+The tablespace consists of database pages whose default size is 16 kB.
+The pages are grouped into extents of 64 consecutive pages. The 'files' inside
+a tablespace are called segments in InnoDB. The name of the rollback
+segment is somewhat misleading because it actually contains many
+segments in the tablespace.
+For each index in InnoDB we allocate two segments: one is for non-leaf
+nodes of the B-tree, the other is for the leaf nodes. The idea here is
+to achieve better sequentiality for the leaf nodes, which contain the
+data.
+When a segment grows inside the tablespace, InnoDB allocates the
+first 32 pages to it individually. After that InnoDB starts
+to allocate whole extents to the segment.
+InnoDB can add to a large segment up to 4 extents at a time to ensure
+good sequentiality of data.
+Some pages in the tablespace contain bitmaps of other pages, and
+therefore a few extents in an InnoDB tablespace cannot be
+allocated to segments as a whole, but only as individual pages.
+When you issue a query @code{SHOW TABLE STATUS FROM ... LIKE ...}
+to ask for available free space in the tablespace, InnoDB will
+report you the space which is certainly usable in totally free extents
+of the tablespace. InnoDB always reserves some extents for
+clean-up and other internal purposes; these reserved extents are not
+included in the free space.
+When you delete data from a table, InnoDB will contract the corresponding
+B-tree indexes. It depends on the pattern of deletes if that frees
+individual pages or extents to the tablespace, so that the freed
+space is available for other users. Dropping a table or deleting
+all rows from it is guaranteed to release the space to other users,
+but remember that deleted rows can be physically removed only in a
+purge operation after they are no longer needed in transaction rollback or
+consistent read.
+@node Error handling
+@subsection Error handling
+The error handling in InnoDB is not always the same as
+specified in the ANSI SQL standards. According to the ANSI
+standard, any error during an SQL statement should cause the
+rollback of that statement. InnoDB sometimes rolls back only
+part of the statement.
+The following list specifies the error handling of InnoDB.
+@itemize @bullet
+@item
+If you run out of file space in the tablespace,
+you will get the MySQL @code{'Table is full'} error
+and InnoDB rolls back the SQL statement.
+@item
+A transaction deadlock or a timeout in a lock wait will give
+@code{'Table handler error 1000000'} and InnoDB rolls back
+the SQL statement.
+@item
+A duplicate key error only rolls back the insert of that particular row,
+even in a statement like @code{INSERT INTO ... SELECT ...}.
+This will probably change so that the SQL statement will be rolled
+back if you have not specified the @code{IGNORE} option in your
+statement.
+@item
+A 'row too long' error rolls back the SQL statement.
+@item
+Other errors are mostly detected by the MySQL layer of code, and
+they roll back the corresponding SQL statement.
+@end itemize
+@node InnoDB restrictions, InnoDB contact information, Error handling, InnoDB
+@subsection Some restrictions on InnoDB tables
 @itemize @bullet
+@item You cannot create an index on a prefix of a column:
+@example
+@code{CREATE TABLE T (A CHAR(20), B INT, INDEX T_IND (A(5))) TYPE = InnoDB;
+}
+@end example
+The above will not work. For a MyISAM table the above would create an index
+where only the first 5 characters from column @code{A} are stored.
+@item
+@code{INSERT DELAYED} is not supported for InnoDB tables.
+@item
+The MySQL @code{LOCK TABLES} operation does not know of InnoDB
+row level locks set in already completed SQL statements: this means that
+you can get a table lock on a table even if there still exist transactions
+of other users which have row level locks on the same table. Thus
+your operations on the table may have to wait if they collide with
+these locks of other users. Also a deadlock is possible. However,
+this does not endanger transaction integrity, because the row level
+locks set by InnoDB will always take care of the integrity. 
+Also, a table lock prevents other transactions from acquiring more
+row level locks (in a conflicting lock mode) on the table.
 @item
-You can't have a key on a @code{BLOB} or @code{TEXT} column.
+You cannot have a key on a @code{BLOB} or @code{TEXT} column.
 @item
-@code{DELETE FROM TABLE} doesn't re-generate the table but instead deletes all
+A table cannot contain more than 1000 columns.
-rows, one by one, which isn't that fast.
 @item
-The maximum blob size is 8000 bytes.
+@code{DELETE FROM TABLE} does not regenerate the table but instead
+deletes all rows, one by one, which is not that fast. In future versions
+of MySQL you can use @code{TRUNCATE} which is fast.
 @item
-Before dropping a database with @code{INNODB} tables one has to drop
+Before dropping a database with InnoDB tables one has to drop
-the individual tables first.  If one doesn't do that, the space in the
+the individual InnoDB tables first.
-Innodb table space will not be reclaimed.
+@item
+The default database page size in InnoDB is 16 kB. By recompiling the
+code one can set it from 8 kB to 64 kB.
+The maximun row length is slightly less than a half of a database page,
+the row length also includes @code{BLOB} and @code{TEXT} type
+columns. The restriction on the size of @code{BLOB} and
+@code{TEXT} columns will be removed by June 2001 in a future version of
+InnoDB.
+@item
+The maximum data or log file size is 2 GB or 4 GB depending on how large
+files your operating system supports. Support for > 4 GB files will
+be added to InnoDB in a future version.
+@item
+The maximum tablespace size is 4 billion database pages. This is also
+the maximum size for a table.
 @end itemize
+@node InnoDB contact information, , InnoDB restrictions, InnoDB
+@subsection InnoDB contact information
+Contact information of Innobase Oy, producer of the InnoDB engine:
+@example
+Website: www.innobase.fi
+Heikki.Tuuri@@innobase.inet.fi
+phone: 358-9-6969 3250 (office) 358-40-5617367 (mobile)
+InnoDB Oy Inc.
+World Trade Center Helsinki
+Aleksanterinkatu 17
+P.O.Box 800
+00101 Helsinki
+Finland
+@end example
 @cindex tutorial
 @cindex terminal monitor, defined
 @cindex monitor, terminal
@@ -42996,7 +43893,7 @@ Fixed a bug when using @code{HEAP} tables with @code{LIKE}.
 @item
 Added @code{--mysql-version} to @code{safe_mysqld}
 @item
-Changed @code{INNOBASE} to @code{INNODB} (because the @code{INNOBASE}
+Changed @code{INNOBASE} to @code{InnoDB} (because the @code{INNOBASE}
 name was already used). All @code{configure} options and @code{mysqld}
 start options are now using @code{innodb} instead of @code{innobase}. This
 means that you have to change any configuration files where you have used