Commit f38caef7 authored by Kirill Smelkov's avatar Kirill Smelkov

.

parent 77ccb352
...@@ -6,8 +6,8 @@ This file contains notes additional to usage documentation and internal ...@@ -6,8 +6,8 @@ This file contains notes additional to usage documentation and internal
organization overview in wcfs.go . organization overview in wcfs.go .
Invalidations to wcfs clients are delayed until they read Invalidations to wcfs clients are delayed until block access
========================================================= ============================================================
Initially it was planned that wcfs would send invalidation messages to its Initially it was planned that wcfs would send invalidation messages to its
clients right after receiving invalidation message from ZODB at transaction clients right after receiving invalidation message from ZODB at transaction
...@@ -18,7 +18,7 @@ Emitting whole δR right at transaction-boundary time requires to keep whole ...@@ -18,7 +18,7 @@ Emitting whole δR right at transaction-boundary time requires to keep whole
ZBigFile.blktab index in RAM. Even though from space point of view it is ZBigFile.blktab index in RAM. Even though from space point of view it is
somewhat acceptable (~ 0.01% of whole-file data size, i.e. ~ 128MB of index for somewhat acceptable (~ 0.01% of whole-file data size, i.e. ~ 128MB of index for
~ 1TB of data), it is not good from time overhead point of view - initial open ~ 1TB of data), it is not good from time overhead point of view - initial open
of a file this way would be potentially very slow. of a file this way would be potentially slow.
-> we took the approach where we invalidate a block lazily only when it is -> we took the approach where we invalidate a block lazily only when it is
actually accesses. actually accesses.
......
...@@ -238,11 +238,11 @@ package main ...@@ -238,11 +238,11 @@ package main
// //
// Wcfs is a ZODB client that translates ZODB objects into OS files as would // Wcfs is a ZODB client that translates ZODB objects into OS files as would
// non-wcfs wendelin.core do for a ZBigFile. Contrary to non-wcfs wendelin.core, // non-wcfs wendelin.core do for a ZBigFile. Contrary to non-wcfs wendelin.core,
// it keeps bigfile data in shared cache efficiently. It is organized as follows: // it keeps bigfile data in shared OS cache efficiently. It is organized as follows:
// //
// 1) 1 ZODB connection for "latest data" for whole filesystem (zhead). // 1) 1 ZODB connection for "latest data" for whole filesystem (zhead).
// 2) head/data of all bigfiles represent state as of zhead.At . // 2) head/bigfile/* of all bigfiles represent state as of zhead.At .
// 3) for */head/data the following invariant is maintained: // 3) for head/bigfile/* the following invariant is maintained:
// //
// #blk ∈ file cache => ZBlk(#blk) + all BTree/Bucket that lead to it ∈ zhead cache // #blk ∈ file cache => ZBlk(#blk) + all BTree/Bucket that lead to it ∈ zhead cache
// (ZBlk* in ghost state) // (ZBlk* in ghost state)
...@@ -258,7 +258,7 @@ package main ...@@ -258,7 +258,7 @@ package main
// try to synchronize to kernel freeing its pagecache pages. // try to synchronize to kernel freeing its pagecache pages.
// //
// 4) when we receive an invalidation message from ZODB - we process it and // 4) when we receive an invalidation message from ZODB - we process it and
// propagate invalidations to OS file cache of */head/data: // propagate invalidations to OS file cache of head/bigfile/*:
// //
// invalidation message: (tid↑, []oid) // invalidation message: (tid↑, []oid)
// //
...@@ -277,22 +277,23 @@ package main ...@@ -277,22 +277,23 @@ package main
// //
// 4.4) for all file/blk to invalidate we do: // 4.4) for all file/blk to invalidate we do:
// //
// - try to retrieve file/head/data[blk] from OS file cache; // - try to retrieve head/bigfile/file[blk] from OS file cache;
// - if retrieved successfully -> store retrieved data back into OS file // - if retrieved successfully -> store retrieved data back into OS file
// cache for file/@<rev>/data[blk], where // cache for @<rev>/bigfile/file[blk], where
// //
// rev = max(δFtail.by(#blk)) || min(rev ∈ δFtail) || zhead.at ; see below about δFtail // rev = max(δFtail.by(#blk)) || min(rev ∈ δFtail) || zhead.at ; see below about δFtail
// //
// - invalidate file/head/data[blk] in OS file cache. // - invalidate head/bigfile/file[blk] in OS file cache.
// //
// This preserves previous data in OS file cache in case it will be needed // This preserves previous data in OS file cache in case it will be needed
// by not-yet-uptodate clients, and makes sure file read of head/data[blk] // by not-yet-uptodate clients, and makes sure file read of head/bigfile/file[blk]
// won't be served from OS file cache and instead will trigger a FUSE read // won't be served from OS file cache and instead will trigger a FUSE read
// request to wcfs. // request to wcfs.
// //
// 4.5) no invalidation messages are sent to wcfs clients at this point(*). // 4.5) no invalidation messages are sent to wcfs clients at this point(*).
// //
// XXX processing ZODB invalidations and serving reads are mutually exclusive. // 4.6) processing ZODB invalidations and serving file reads (see 7) are
// organized to be mutually exclusive.
// //
// 5) after OS file cache was invalidated, we resync zhead to new database // 5) after OS file cache was invalidated, we resync zhead to new database
// view corresponding to tid. // view corresponding to tid.
...@@ -305,12 +306,15 @@ package main ...@@ -305,12 +306,15 @@ package main
// δFtail.tail describes invalidations to file we learned from ZODB invalidation. // δFtail.tail describes invalidations to file we learned from ZODB invalidation.
// δFtail.by allows to quickly lookup information by #blk. // δFtail.by allows to quickly lookup information by #blk.
// //
// min(rev) in δFtail is min(@at) at which head/data is currently mmapped (see below). // min(rev) in δFtail is min(@at) at which head/bigfile/file is currently mmapped (see below).
// XXX min(10 minutes) of history to support initial openings
// //
// 7) when we receive a FUSE read(#blk) request to a file/head/data we process it as follows: // to support initial openings with @at being slightly in the past, we also
// make sure that min(rev) is enough to cover last 10 minutes of history
// from head/at.
// //
// 7.1) load blkdata for head/data[blk] @zhead.at . // 7) when we receive a FUSE read(#blk) request to a head/bigfile/file we process it as follows:
//
// 7.1) load blkdata for head/bigfile/file[blk] @zhead.at .
// //
// while loading this also gives upper bound estimate of when the block // while loading this also gives upper bound estimate of when the block
// was last changed: // was last changed:
...@@ -334,13 +338,13 @@ package main ...@@ -334,13 +338,13 @@ package main
// rev(blk) ≤ rev'(blk) rev'(blk) = min(^^^) // rev(blk) ≤ rev'(blk) rev'(blk) = min(^^^)
// //
// //
// 7.2) for all client@at mmappings of file/head/data: // 7.2) for all client@at mmappings of head/bigfile/file:
// //
// - rev'(blk) ≤ at: -> do nothing // - rev'(blk) ≤ at: -> do nothing
// - rev'(blk) > at: // - rev'(blk) > at:
// - if blk ∈ mmapping.pinned -> do nothing // - if blk ∈ mmapping.pinned -> do nothing
// - rev = max(δFtail.by(#blk) : _ ≤ at) || min(rev ∈ δFtail : rev ≤ at) || at // - rev = max(δFtail.by(#blk) : _ ≤ at) || min(rev ∈ δFtail : rev ≤ at) || at
// - client.remmap(file, #blk, @rev/data) // - client.remmap(file, #blk, @rev/bigfile/file)
// - mmapping.pinned += blk // - mmapping.pinned += blk
// //
// remmapping is done via "invalidation protocol" exchange with client. // remmapping is done via "invalidation protocol" exchange with client.
...@@ -348,7 +352,7 @@ package main ...@@ -348,7 +352,7 @@ package main
// wcfs-trusted code via ptrace that wcfs injects into clients, but ptrace // wcfs-trusted code via ptrace that wcfs injects into clients, but ptrace
// won't work when client thread is blocked under pagefault or syscall(~) ) // won't work when client thread is blocked under pagefault or syscall(~) )
// //
// in order to support remmapping for each file/head/data // in order to support remmapping for each head/bigfile/file
// //
// [] of mmapping{client@at↑, pinned} // [] of mmapping{client@at↑, pinned}
// //
...@@ -360,15 +364,14 @@ package main ...@@ -360,15 +364,14 @@ package main
// and a client that wants @rev data will get @rev data, even if it was this // and a client that wants @rev data will get @rev data, even if it was this
// "old" client that triggered the pagefault(+). // "old" client that triggered the pagefault(+).
// //
// (*) see "Invalidations to wcfs clients are delayed until they read" in notes.txt // (*) see "Invalidations to wcfs clients are delayed until block access" in notes.txt
// (+) see "Changing mmapping while under pagefault is possible" in notes.txt // (+) see "Changing mmapping while under pagefault is possible" in notes.txt
// (~) see "Client cannot be ptraced while under pagefault" in notes.txt // (~) see "Client cannot be ptraced while under pagefault" in notes.txt
// //
// //
// XXX mmap(@at) open
//
// XXX 8) serving read from @<rev>/data + zconn(s) for historical state // XXX 8) serving read from @<rev>/data + zconn(s) for historical state
// //
// XXX For every ZODB connection a dedicated read-only transaction is maintained.
// //
// XXX(integrate place=?) ZData - no need to keep track -> ZBlk1 is always // XXX(integrate place=?) ZData - no need to keep track -> ZBlk1 is always
// marked as changed on blk data change. // marked as changed on blk data change.
...@@ -419,20 +422,12 @@ package main ...@@ -419,20 +422,12 @@ package main
// //
// δ(BTree) in wcfs context: // δ(BTree) in wcfs context:
// //
// . -k(blk) -> invalidata #blk // . -k(blk) -> invalidate #blk
// . +k(blk) -> invalidate #blk (e.g. if blk was previously read as hold) // . +k(blk) -> invalidate #blk (e.g. if blk was previously read as hold)
// //
// //
// ---------------------------------------- // ----------------------------------------
// //
// - XXX(kill) 1 ZODB connection per 1 bigfile (each bigfile can be at its different @at,
// because invalidations for different bigfiles can be processed with different
// timings depending on clients). No harm here as different bigfiles use
// completely different ZODB BTree and data objects.
//
// For every ZODB connection a dedicated read-only transaction is maintained.
//
//
// Notes on OS pagecache control: // Notes on OS pagecache control:
// //
// the cache of snapshotted bigfile can be pre-made hot, if invalidated region // the cache of snapshotted bigfile can be pre-made hot, if invalidated region
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment