Commit b8d05b19 authored by Rusty Russell's avatar Rusty Russell

tdb2: update design doc.

parent 44eea6ca
...@@ -53,8 +53,8 @@ Rusty Russell, IBM Corporation ...@@ -53,8 +53,8 @@ Rusty Russell, IBM Corporation
\change_deleted 0 1283307542 \change_deleted 0 1283307542
26-July 26-July
\change_inserted 0 1284016854 \change_inserted 0 1284423485
9-September 14-September
\change_unchanged \change_unchanged
-2010 -2010
\end_layout \end_layout
...@@ -476,6 +476,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional ...@@ -476,6 +476,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
\begin_layout Subsubsection \begin_layout Subsubsection
Proposed Solution Proposed Solution
\change_inserted 0 1284422789
\begin_inset CommandInset label
LatexCommand label
name "attributes"
\end_inset
\change_unchanged
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -1289,13 +1300,69 @@ Proposed Solution ...@@ -1289,13 +1300,69 @@ Proposed Solution
\begin_layout Standard \begin_layout Standard
\change_inserted 0 1284016847 \change_inserted 0 1284422552
We often have extra padding at the tail of a record. We often have extra padding at the tail of a record.
If we ensure that the first byte (if any) of this padding is zero, we will If we ensure that the first byte (if any) of this padding is zero, we will
have a way for future changes to detect code which doesn't understand a have a way for future changes to detect code which doesn't understand a
new format: the new code would write (say) a 1 at the tail, and thus if new format: the new code would write (say) a 1 at the tail, and thus if
there is no tail or the first byte is 0, we would know the extension is there is no tail or the first byte is 0, we would know the extension is
not present on that record. not present on that record.
\end_layout
\begin_layout Subsection
\change_inserted 0 1284422568
TDB Does Not Use Talloc
\end_layout
\begin_layout Standard
\change_inserted 0 1284422646
Many users of TDB (particularly Samba) use the talloc allocator, and thus
have to wrap TDB in a talloc context to use it conveniently.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284422656
Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284423065
The allocation within TDB is not complicated enough to justify the use of
talloc, and I am reluctant to force another (excellent) library on TDB
users.
Nonetheless a compromise is possible.
An attribute (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) can be added later to tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which would
ignore the
\begin_inset Quotes eld
\end_inset
context
\begin_inset Quotes erd
\end_inset
argument).
\end_layout
\begin_layout Standard
\change_inserted 0 1284423042
This would form a talloc heirarchy as expected, but the caller would still
have to attach a destructor to the tdb context returned from tdb_open to
close it.
All TDB_DATA fields would be children of the tdb_context, and the caller
would still have to manage them (using talloc_free() or talloc_steal()).
\change_unchanged \change_unchanged
\end_layout \end_layout
...@@ -1875,7 +1942,7 @@ status open ...@@ -1875,7 +1942,7 @@ status open
\begin_layout Plain Layout \begin_layout Plain Layout
\change_inserted 0 1283310945 \change_inserted 0 1284424151
Using Using
\begin_inset Formula $2^{16+N*3}$ \begin_inset Formula $2^{16+N*3}$
\end_inset \end_inset
...@@ -1886,6 +1953,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal ...@@ -1886,6 +1953,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
byte zone. byte zone.
Zones range in factor of 8 steps. Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can determine
the start of the zone.
\change_unchanged \change_unchanged
\end_layout \end_layout
...@@ -2330,6 +2399,8 @@ TDB Does Not Have Snapshot Support ...@@ -2330,6 +2399,8 @@ TDB Does Not Have Snapshot Support
\begin_layout Subsubsection \begin_layout Subsubsection
Proposed Solution Proposed Solution
\change_deleted 0 1284423472
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -2342,7 +2413,23 @@ use a real database ...@@ -2342,7 +2413,23 @@ use a real database
\begin_inset Quotes erd \begin_inset Quotes erd
\end_inset \end_inset
\change_inserted 0 1284423891
\change_deleted 0 1284423891
. .
\change_inserted 0 1284423901
(but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -2365,6 +2452,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack ...@@ -2365,6 +2452,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
\begin_layout Standard \begin_layout Standard
We could then implement snapshots using a similar method, using multiple We could then implement snapshots using a similar method, using multiple
different hash tables/free tables. different hash tables/free tables.
\change_inserted 0 1284423495
\end_layout \end_layout
\begin_layout Subsection \begin_layout Subsection
...@@ -2384,6 +2473,18 @@ Proposed Solution ...@@ -2384,6 +2473,18 @@ Proposed Solution
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
\change_inserted 0 1284424201
None (but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
We could solve a small part of the problem by providing read-only transactions. We could solve a small part of the problem by providing read-only transactions.
These would allow one write transaction to begin, but it could not commit These would allow one write transaction to begin, but it could not commit
until all r/o transactions are done. until all r/o transactions are done.
...@@ -2569,6 +2670,53 @@ At some later point, a sync would allow recovery of the old data into the ...@@ -2569,6 +2670,53 @@ At some later point, a sync would allow recovery of the old data into the
free lists (perhaps when the array of top-level pointers filled). free lists (perhaps when the array of top-level pointers filled).
On crash, tdb_open() would examine the array of top levels, and apply the On crash, tdb_open() would examine the array of top levels, and apply the
transactions until it encountered an invalid checksum. transactions until it encountered an invalid checksum.
\change_inserted 0 1284423555
\end_layout
\begin_layout Subsection
\change_inserted 0 1284423617
Tracing Is Fragile, Replay Is External
\end_layout
\begin_layout Standard
\change_inserted 0 1284423719
The current TDB has compile-time-enabled tracing code, but it often breaks
as it is not enabled by default.
In a similar way, the ctdb code has an external wrapper which does replay
tracing so it can coordinate cluster-wide transactions.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284423864
Proposed Solution
\begin_inset CommandInset label
LatexCommand label
name "replay-attribute"
\end_inset
\end_layout
\begin_layout Standard
\change_inserted 0 1284423850
Tridge points out that an attribute can be later added to tdb_open (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) to provide replay/trace hooks, which could become the basis for this and
future parallel transactions and snapshot support.
\change_unchanged
\end_layout \end_layout
\end_body \end_body
......
head 1.9; head 1.10;
access; access;
symbols; symbols;
locks; strict; locks; strict;
comment @# @; comment @# @;
1.10
date 2010.09.14.00.33.57; author rusty; state Exp;
branches;
next 1.9;
1.9 1.9
date 2010.09.09.07.25.12; author rusty; state Exp; date 2010.09.09.07.25.12; author rusty; state Exp;
branches; branches;
...@@ -56,9 +61,9 @@ desc ...@@ -56,9 +61,9 @@ desc
@ @
1.9 1.10
log log
@Extension mechanism. @Tracing attribute, talloc support.
@ @
text text
@#LyX 1.6.5 created this file. For more info see http://www.lyx.org/ @#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
...@@ -116,8 +121,8 @@ Rusty Russell, IBM Corporation ...@@ -116,8 +121,8 @@ Rusty Russell, IBM Corporation
\change_deleted 0 1283307542 \change_deleted 0 1283307542
26-July 26-July
\change_inserted 0 1284016854 \change_inserted 0 1284423485
9-September 14-September
\change_unchanged \change_unchanged
-2010 -2010
\end_layout \end_layout
...@@ -539,6 +544,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional ...@@ -539,6 +544,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
\begin_layout Subsubsection \begin_layout Subsubsection
Proposed Solution Proposed Solution
\change_inserted 0 1284422789
\begin_inset CommandInset label
LatexCommand label
name "attributes"
\end_inset
\change_unchanged
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -1352,13 +1368,69 @@ Proposed Solution ...@@ -1352,13 +1368,69 @@ Proposed Solution
\begin_layout Standard \begin_layout Standard
\change_inserted 0 1284016847 \change_inserted 0 1284422552
We often have extra padding at the tail of a record. We often have extra padding at the tail of a record.
If we ensure that the first byte (if any) of this padding is zero, we will If we ensure that the first byte (if any) of this padding is zero, we will
have a way for future changes to detect code which doesn't understand a have a way for future changes to detect code which doesn't understand a
new format: the new code would write (say) a 1 at the tail, and thus if new format: the new code would write (say) a 1 at the tail, and thus if
there is no tail or the first byte is 0, we would know the extension is there is no tail or the first byte is 0, we would know the extension is
not present on that record. not present on that record.
\end_layout
\begin_layout Subsection
\change_inserted 0 1284422568
TDB Does Not Use Talloc
\end_layout
\begin_layout Standard
\change_inserted 0 1284422646
Many users of TDB (particularly Samba) use the talloc allocator, and thus
have to wrap TDB in a talloc context to use it conveniently.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284422656
Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284423065
The allocation within TDB is not complicated enough to justify the use of
talloc, and I am reluctant to force another (excellent) library on TDB
users.
Nonetheless a compromise is possible.
An attribute (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) can be added later to tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which would
ignore the
\begin_inset Quotes eld
\end_inset
context
\begin_inset Quotes erd
\end_inset
argument).
\end_layout
\begin_layout Standard
\change_inserted 0 1284423042
This would form a talloc heirarchy as expected, but the caller would still
have to attach a destructor to the tdb context returned from tdb_open to
close it.
All TDB_DATA fields would be children of the tdb_context, and the caller
would still have to manage them (using talloc_free() or talloc_steal()).
\change_unchanged \change_unchanged
\end_layout \end_layout
...@@ -1938,7 +2010,7 @@ status open ...@@ -1938,7 +2010,7 @@ status open
\begin_layout Plain Layout \begin_layout Plain Layout
\change_inserted 0 1283310945 \change_inserted 0 1284424151
Using Using
\begin_inset Formula $2^{16+N*3}$ \begin_inset Formula $2^{16+N*3}$
\end_inset \end_inset
...@@ -1949,6 +2021,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal ...@@ -1949,6 +2021,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
byte zone. byte zone.
Zones range in factor of 8 steps. Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can determine
the start of the zone.
\change_unchanged \change_unchanged
\end_layout \end_layout
...@@ -2393,6 +2467,8 @@ TDB Does Not Have Snapshot Support ...@@ -2393,6 +2467,8 @@ TDB Does Not Have Snapshot Support
\begin_layout Subsubsection \begin_layout Subsubsection
Proposed Solution Proposed Solution
\change_deleted 0 1284423472
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -2405,7 +2481,23 @@ use a real database ...@@ -2405,7 +2481,23 @@ use a real database
\begin_inset Quotes erd \begin_inset Quotes erd
\end_inset \end_inset
\change_inserted 0 1284423891
\change_deleted 0 1284423891
. .
\change_inserted 0 1284423901
(but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
...@@ -2428,6 +2520,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack ...@@ -2428,6 +2520,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
\begin_layout Standard \begin_layout Standard
We could then implement snapshots using a similar method, using multiple We could then implement snapshots using a similar method, using multiple
different hash tables/free tables. different hash tables/free tables.
\change_inserted 0 1284423495
\end_layout \end_layout
\begin_layout Subsection \begin_layout Subsection
...@@ -2447,6 +2541,18 @@ Proposed Solution ...@@ -2447,6 +2541,18 @@ Proposed Solution
\end_layout \end_layout
\begin_layout Standard \begin_layout Standard
\change_inserted 0 1284424201
None (but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
We could solve a small part of the problem by providing read-only transactions. We could solve a small part of the problem by providing read-only transactions.
These would allow one write transaction to begin, but it could not commit These would allow one write transaction to begin, but it could not commit
until all r/o transactions are done. until all r/o transactions are done.
...@@ -2632,6 +2738,53 @@ At some later point, a sync would allow recovery of the old data into the ...@@ -2632,6 +2738,53 @@ At some later point, a sync would allow recovery of the old data into the
free lists (perhaps when the array of top-level pointers filled). free lists (perhaps when the array of top-level pointers filled).
On crash, tdb_open() would examine the array of top levels, and apply the On crash, tdb_open() would examine the array of top levels, and apply the
transactions until it encountered an invalid checksum. transactions until it encountered an invalid checksum.
\change_inserted 0 1284423555
\end_layout
\begin_layout Subsection
\change_inserted 0 1284423617
Tracing Is Fragile, Replay Is External
\end_layout
\begin_layout Standard
\change_inserted 0 1284423719
The current TDB has compile-time-enabled tracing code, but it often breaks
as it is not enabled by default.
In a similar way, the ctdb code has an external wrapper which does replay
tracing so it can coordinate cluster-wide transactions.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284423864
Proposed Solution
\begin_inset CommandInset label
LatexCommand label
name "replay-attribute"
\end_inset
\end_layout
\begin_layout Standard
\change_inserted 0 1284423850
Tridge points out that an attribute can be later added to tdb_open (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) to provide replay/trace hooks, which could become the basis for this and
future parallel transactions and snapshot support.
\change_unchanged
\end_layout \end_layout
\end_body \end_body
...@@ -2639,6 +2792,33 @@ At some later point, a sync would allow recovery of the old data into the ...@@ -2639,6 +2792,33 @@ At some later point, a sync would allow recovery of the old data into the
@ @
1.9
log
@Extension mechanism.
@
text
@d56 2
a57 2
\change_inserted 0 1284016854
9-September
d479 11
d1303 1
a1303 1
\change_inserted 0 1284016847
d1310 56
d1945 1
a1945 1
\change_inserted 0 1283310945
d1956 2
d2402 2
d2416 4
d2421 12
d2455 2
d2476 12
d2673 47
@
1.8 1.8
log log
@Remove bogus footnote @Remove bogus footnote
......
...@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase ...@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase
Rusty Russell, IBM Corporation Rusty Russell, IBM Corporation
9-September-2010 14-September-2010
Abstract Abstract
...@@ -74,7 +74,7 @@ optional hashing function and an optional logging function ...@@ -74,7 +74,7 @@ optional hashing function and an optional logging function
argument. Additional arguments to open would require the argument. Additional arguments to open would require the
introduction of a tdb_open_ex2 call etc. introduction of a tdb_open_ex2 call etc.
2.1.1 Proposed Solution 2.1.1 Proposed Solution<attributes>
tdb_open() will take a linked-list of attributes: tdb_open() will take a linked-list of attributes:
...@@ -519,6 +519,28 @@ understand a new format: the new code would write (say) a 1 at ...@@ -519,6 +519,28 @@ understand a new format: the new code would write (say) a 1 at
the tail, and thus if there is no tail or the first byte is 0, we the tail, and thus if there is no tail or the first byte is 0, we
would know the extension is not present on that record. would know the extension is not present on that record.
2.17 TDB Does Not Use Talloc
Many users of TDB (particularly Samba) use the talloc allocator,
and thus have to wrap TDB in a talloc context to use it
conveniently.
2.17.1 Proposed Solution
The allocation within TDB is not complicated enough to justify
the use of talloc, and I am reluctant to force another
(excellent) library on TDB users. Nonetheless a compromise is
possible. An attribute (see [attributes]) can be added later to
tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which
would ignore the “context” argument).
This would form a talloc heirarchy as expected, but the caller
would still have to attach a destructor to the tdb context
returned from tdb_open to close it. All TDB_DATA fields would be
children of the tdb_context, and the caller would still have to
manage them (using talloc_free() or talloc_steal()).
3 Performance And Scalability Issues 3 Performance And Scalability Issues
3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST 3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
...@@ -790,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a ...@@ -790,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a
random zone”, but that's less common). It could be done with as random zone”, but that's less common). It could be done with as
few as 4 bits from the record header.[footnote: few as 4 bits from the record header.[footnote:
Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
the maximal 2^{61} byte zone. Zones range in factor of 8 steps. the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can
determine the start of the zone.
] ]
3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented 3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
...@@ -1009,7 +1033,8 @@ we need only check for recovery if this is set. ...@@ -1009,7 +1033,8 @@ we need only check for recovery if this is set.
3.9.1 Proposed Solution 3.9.1 Proposed Solution
None. At some point you say “use a real database”. None. At some point you say “use a real database” (but see [replay-attribute]
).
But as a thought experiment, if we implemented transactions to But as a thought experiment, if we implemented transactions to
only overwrite free entries (this is tricky: there must not be a only overwrite free entries (this is tricky: there must not be a
...@@ -1038,11 +1063,11 @@ failed. ...@@ -1038,11 +1063,11 @@ failed.
3.10.1 Proposed Solution 3.10.1 Proposed Solution
We could solve a small part of the problem by providing read-only None (but see [replay-attribute]). We could solve a small part of
transactions. These would allow one write transaction to begin, the problem by providing read-only transactions. These would
but it could not commit until all r/o transactions are done. This allow one write transaction to begin, but it could not commit
would require a new RO_TRANSACTION_LOCK, which would be upgraded until all r/o transactions are done. This would require a new
on commit. RO_TRANSACTION_LOCK, which would be upgraded on commit.
3.11 Default Hash Function Is Suboptimal 3.11 Default Hash Function Is Suboptimal
...@@ -1137,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top ...@@ -1137,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top
levels, and apply the transactions until it encountered an levels, and apply the transactions until it encountered an
invalid checksum. invalid checksum.
3.15 Tracing Is Fragile, Replay Is External
The current TDB has compile-time-enabled tracing code, but it
often breaks as it is not enabled by default. In a similar way,
the ctdb code has an external wrapper which does replay tracing
so it can coordinate cluster-wide transactions.
3.15.1 Proposed Solution<replay-attribute>
Tridge points out that an attribute can be later added to
tdb_open (see [attributes]) to provide replay/trace hooks, which
could become the basis for this and future parallel transactions
and snapshot support.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment