Commit b8d05b19 authored by Rusty Russell's avatar Rusty Russell

tdb2: update design doc.

parent 44eea6ca
......@@ -53,8 +53,8 @@ Rusty Russell, IBM Corporation
\change_deleted 0 1283307542
26-July
\change_inserted 0 1284016854
9-September
\change_inserted 0 1284423485
14-September
\change_unchanged
-2010
\end_layout
......@@ -476,6 +476,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
\begin_layout Subsubsection
Proposed Solution
\change_inserted 0 1284422789
\begin_inset CommandInset label
LatexCommand label
name "attributes"
\end_inset
\change_unchanged
\end_layout
\begin_layout Standard
......@@ -1289,13 +1300,69 @@ Proposed Solution
\begin_layout Standard
\change_inserted 0 1284016847
\change_inserted 0 1284422552
We often have extra padding at the tail of a record.
If we ensure that the first byte (if any) of this padding is zero, we will
have a way for future changes to detect code which doesn't understand a
new format: the new code would write (say) a 1 at the tail, and thus if
there is no tail or the first byte is 0, we would know the extension is
not present on that record.
\end_layout
\begin_layout Subsection
\change_inserted 0 1284422568
TDB Does Not Use Talloc
\end_layout
\begin_layout Standard
\change_inserted 0 1284422646
Many users of TDB (particularly Samba) use the talloc allocator, and thus
have to wrap TDB in a talloc context to use it conveniently.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284422656
Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284423065
The allocation within TDB is not complicated enough to justify the use of
talloc, and I am reluctant to force another (excellent) library on TDB
users.
Nonetheless a compromise is possible.
An attribute (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) can be added later to tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which would
ignore the
\begin_inset Quotes eld
\end_inset
context
\begin_inset Quotes erd
\end_inset
argument).
\end_layout
\begin_layout Standard
\change_inserted 0 1284423042
This would form a talloc heirarchy as expected, but the caller would still
have to attach a destructor to the tdb context returned from tdb_open to
close it.
All TDB_DATA fields would be children of the tdb_context, and the caller
would still have to manage them (using talloc_free() or talloc_steal()).
\change_unchanged
\end_layout
......@@ -1875,7 +1942,7 @@ status open
\begin_layout Plain Layout
\change_inserted 0 1283310945
\change_inserted 0 1284424151
Using
\begin_inset Formula $2^{16+N*3}$
\end_inset
......@@ -1886,6 +1953,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
byte zone.
Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can determine
the start of the zone.
\change_unchanged
\end_layout
......@@ -2330,6 +2399,8 @@ TDB Does Not Have Snapshot Support
\begin_layout Subsubsection
Proposed Solution
\change_deleted 0 1284423472
\end_layout
\begin_layout Standard
......@@ -2342,7 +2413,23 @@ use a real database
\begin_inset Quotes erd
\end_inset
\change_inserted 0 1284423891
\change_deleted 0 1284423891
.
\change_inserted 0 1284423901
(but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
\end_layout
\begin_layout Standard
......@@ -2365,6 +2452,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
\begin_layout Standard
We could then implement snapshots using a similar method, using multiple
different hash tables/free tables.
\change_inserted 0 1284423495
\end_layout
\begin_layout Subsection
......@@ -2384,6 +2473,18 @@ Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284424201
None (but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
We could solve a small part of the problem by providing read-only transactions.
These would allow one write transaction to begin, but it could not commit
until all r/o transactions are done.
......@@ -2569,6 +2670,53 @@ At some later point, a sync would allow recovery of the old data into the
free lists (perhaps when the array of top-level pointers filled).
On crash, tdb_open() would examine the array of top levels, and apply the
transactions until it encountered an invalid checksum.
\change_inserted 0 1284423555
\end_layout
\begin_layout Subsection
\change_inserted 0 1284423617
Tracing Is Fragile, Replay Is External
\end_layout
\begin_layout Standard
\change_inserted 0 1284423719
The current TDB has compile-time-enabled tracing code, but it often breaks
as it is not enabled by default.
In a similar way, the ctdb code has an external wrapper which does replay
tracing so it can coordinate cluster-wide transactions.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284423864
Proposed Solution
\begin_inset CommandInset label
LatexCommand label
name "replay-attribute"
\end_inset
\end_layout
\begin_layout Standard
\change_inserted 0 1284423850
Tridge points out that an attribute can be later added to tdb_open (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) to provide replay/trace hooks, which could become the basis for this and
future parallel transactions and snapshot support.
\change_unchanged
\end_layout
\end_body
......
head 1.9;
head 1.10;
access;
symbols;
locks; strict;
comment @# @;
1.10
date 2010.09.14.00.33.57; author rusty; state Exp;
branches;
next 1.9;
1.9
date 2010.09.09.07.25.12; author rusty; state Exp;
branches;
......@@ -56,9 +61,9 @@ desc
@
1.9
1.10
log
@Extension mechanism.
@Tracing attribute, talloc support.
@
text
@#LyX 1.6.5 created this file. For more info see http://www.lyx.org/
......@@ -116,8 +121,8 @@ Rusty Russell, IBM Corporation
\change_deleted 0 1283307542
26-July
\change_inserted 0 1284016854
9-September
\change_inserted 0 1284423485
14-September
\change_unchanged
-2010
\end_layout
......@@ -539,6 +544,17 @@ The tdb_open() call was expanded to tdb_open_ex(), which added an optional
\begin_layout Subsubsection
Proposed Solution
\change_inserted 0 1284422789
\begin_inset CommandInset label
LatexCommand label
name "attributes"
\end_inset
\change_unchanged
\end_layout
\begin_layout Standard
......@@ -1352,13 +1368,69 @@ Proposed Solution
\begin_layout Standard
\change_inserted 0 1284016847
\change_inserted 0 1284422552
We often have extra padding at the tail of a record.
If we ensure that the first byte (if any) of this padding is zero, we will
have a way for future changes to detect code which doesn't understand a
new format: the new code would write (say) a 1 at the tail, and thus if
there is no tail or the first byte is 0, we would know the extension is
not present on that record.
\end_layout
\begin_layout Subsection
\change_inserted 0 1284422568
TDB Does Not Use Talloc
\end_layout
\begin_layout Standard
\change_inserted 0 1284422646
Many users of TDB (particularly Samba) use the talloc allocator, and thus
have to wrap TDB in a talloc context to use it conveniently.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284422656
Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284423065
The allocation within TDB is not complicated enough to justify the use of
talloc, and I am reluctant to force another (excellent) library on TDB
users.
Nonetheless a compromise is possible.
An attribute (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) can be added later to tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which would
ignore the
\begin_inset Quotes eld
\end_inset
context
\begin_inset Quotes erd
\end_inset
argument).
\end_layout
\begin_layout Standard
\change_inserted 0 1284423042
This would form a talloc heirarchy as expected, but the caller would still
have to attach a destructor to the tdb context returned from tdb_open to
close it.
All TDB_DATA fields would be children of the tdb_context, and the caller
would still have to manage them (using talloc_free() or talloc_steal()).
\change_unchanged
\end_layout
......@@ -1938,7 +2010,7 @@ status open
\begin_layout Plain Layout
\change_inserted 0 1283310945
\change_inserted 0 1284424151
Using
\begin_inset Formula $2^{16+N*3}$
\end_inset
......@@ -1949,6 +2021,8 @@ means 0 gives a minimal 65536-byte zone, 15 gives the maximal
byte zone.
Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can determine
the start of the zone.
\change_unchanged
\end_layout
......@@ -2393,6 +2467,8 @@ TDB Does Not Have Snapshot Support
\begin_layout Subsubsection
Proposed Solution
\change_deleted 0 1284423472
\end_layout
\begin_layout Standard
......@@ -2405,7 +2481,23 @@ use a real database
\begin_inset Quotes erd
\end_inset
\change_inserted 0 1284423891
\change_deleted 0 1284423891
.
\change_inserted 0 1284423901
(but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
\end_layout
\begin_layout Standard
......@@ -2428,6 +2520,8 @@ This would not allow arbitrary changes to the database, such as tdb_repack
\begin_layout Standard
We could then implement snapshots using a similar method, using multiple
different hash tables/free tables.
\change_inserted 0 1284423495
\end_layout
\begin_layout Subsection
......@@ -2447,6 +2541,18 @@ Proposed Solution
\end_layout
\begin_layout Standard
\change_inserted 0 1284424201
None (but see
\begin_inset CommandInset ref
LatexCommand ref
reference "replay-attribute"
\end_inset
).
\change_unchanged
We could solve a small part of the problem by providing read-only transactions.
These would allow one write transaction to begin, but it could not commit
until all r/o transactions are done.
......@@ -2632,6 +2738,53 @@ At some later point, a sync would allow recovery of the old data into the
free lists (perhaps when the array of top-level pointers filled).
On crash, tdb_open() would examine the array of top levels, and apply the
transactions until it encountered an invalid checksum.
\change_inserted 0 1284423555
\end_layout
\begin_layout Subsection
\change_inserted 0 1284423617
Tracing Is Fragile, Replay Is External
\end_layout
\begin_layout Standard
\change_inserted 0 1284423719
The current TDB has compile-time-enabled tracing code, but it often breaks
as it is not enabled by default.
In a similar way, the ctdb code has an external wrapper which does replay
tracing so it can coordinate cluster-wide transactions.
\end_layout
\begin_layout Subsubsection
\change_inserted 0 1284423864
Proposed Solution
\begin_inset CommandInset label
LatexCommand label
name "replay-attribute"
\end_inset
\end_layout
\begin_layout Standard
\change_inserted 0 1284423850
Tridge points out that an attribute can be later added to tdb_open (see
\begin_inset CommandInset ref
LatexCommand ref
reference "attributes"
\end_inset
) to provide replay/trace hooks, which could become the basis for this and
future parallel transactions and snapshot support.
\change_unchanged
\end_layout
\end_body
......@@ -2639,6 +2792,33 @@ At some later point, a sync would allow recovery of the old data into the
@
1.9
log
@Extension mechanism.
@
text
@d56 2
a57 2
\change_inserted 0 1284016854
9-September
d479 11
d1303 1
a1303 1
\change_inserted 0 1284016847
d1310 56
d1945 1
a1945 1
\change_inserted 0 1283310945
d1956 2
d2402 2
d2416 4
d2421 12
d2455 2
d2476 12
d2673 47
@
1.8
log
@Remove bogus footnote
......
......@@ -2,7 +2,7 @@ TDB2: A Redesigning The Trivial DataBase
Rusty Russell, IBM Corporation
9-September-2010
14-September-2010
Abstract
......@@ -74,7 +74,7 @@ optional hashing function and an optional logging function
argument. Additional arguments to open would require the
introduction of a tdb_open_ex2 call etc.
2.1.1 Proposed Solution
2.1.1 Proposed Solution<attributes>
tdb_open() will take a linked-list of attributes:
......@@ -519,6 +519,28 @@ understand a new format: the new code would write (say) a 1 at
the tail, and thus if there is no tail or the first byte is 0, we
would know the extension is not present on that record.
2.17 TDB Does Not Use Talloc
Many users of TDB (particularly Samba) use the talloc allocator,
and thus have to wrap TDB in a talloc context to use it
conveniently.
2.17.1 Proposed Solution
The allocation within TDB is not complicated enough to justify
the use of talloc, and I am reluctant to force another
(excellent) library on TDB users. Nonetheless a compromise is
possible. An attribute (see [attributes]) can be added later to
tdb_open() to provide an alternate allocation mechanism,
specifically for talloc but usable by any other allocator (which
would ignore the “context” argument).
This would form a talloc heirarchy as expected, but the caller
would still have to attach a destructor to the tdb context
returned from tdb_open to close it. All TDB_DATA fields would be
children of the tdb_context, and the caller would still have to
manage them (using talloc_free() or talloc_steal()).
3 Performance And Scalability Issues
3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
......@@ -790,7 +812,9 @@ question “what zone is this record in?” much harder (and “pick a
random zone”, but that's less common). It could be done with as
few as 4 bits from the record header.[footnote:
Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
Given the zone size for the zone the current record is in, we can
determine the start of the zone.
]
3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
......@@ -1009,7 +1033,8 @@ we need only check for recovery if this is set.
3.9.1 Proposed Solution
None. At some point you say “use a real database”.
None. At some point you say “use a real database” (but see [replay-attribute]
).
But as a thought experiment, if we implemented transactions to
only overwrite free entries (this is tricky: there must not be a
......@@ -1038,11 +1063,11 @@ failed.
3.10.1 Proposed Solution
We could solve a small part of the problem by providing read-only
transactions. These would allow one write transaction to begin,
but it could not commit until all r/o transactions are done. This
would require a new RO_TRANSACTION_LOCK, which would be upgraded
on commit.
None (but see [replay-attribute]). We could solve a small part of
the problem by providing read-only transactions. These would
allow one write transaction to begin, but it could not commit
until all r/o transactions are done. This would require a new
RO_TRANSACTION_LOCK, which would be upgraded on commit.
3.11 Default Hash Function Is Suboptimal
......@@ -1137,3 +1162,17 @@ filled). On crash, tdb_open() would examine the array of top
levels, and apply the transactions until it encountered an
invalid checksum.
3.15 Tracing Is Fragile, Replay Is External
The current TDB has compile-time-enabled tracing code, but it
often breaks as it is not enabled by default. In a similar way,
the ctdb code has an external wrapper which does replay tracing
so it can coordinate cluster-wide transactions.
3.15.1 Proposed Solution<replay-attribute>
Tridge points out that an attribute can be later added to
tdb_open (see [attributes]) to provide replay/trace hooks, which
could become the basis for this and future parallel transactions
and snapshot support.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment