Commit c40f3831 authored by Kirill Smelkov's avatar Kirill Smelkov

.

parent 901e9fc1
...@@ -28,7 +28,7 @@ ...@@ -28,7 +28,7 @@
// file that represents whole ZBigFile's data. // file that represents whole ZBigFile's data.
// //
// For a client, the primary way to access a bigfile should be to mmap // For a client, the primary way to access a bigfile should be to mmap
// bigfile/<bigfileX>/head/data which represents always latest bigfile data. // head/bigfile/<bigfileX> which represents always latest bigfile data.
// Clients that want to get isolation guarantee should subscribe for // Clients that want to get isolation guarantee should subscribe for
// invalidations and re-mmap invalidated regions to file with pinned bigfile revision for // invalidations and re-mmap invalidated regions to file with pinned bigfile revision for
// the duration of their transaction. See "Invalidation protocol" for details. // the duration of their transaction. See "Invalidation protocol" for details.
...@@ -42,119 +42,125 @@ ...@@ -42,119 +42,125 @@
// //
// Top-level structure of provided filesystem is as follows: // Top-level structure of provided filesystem is as follows:
// //
// bigfile/ // head/ ; latest database view
// <oid(bigfile1)>/
// ... // ...
// <oid(bigfile2)>/ // @<rev1>/ ; database view as of revision <revX>
// ... // ...
// ... // @<rev2>/
//
// where for a bigfileX there is bigfile/<oid(bigfileX)>/ directory, with
// oid(bigfileX) being ZODB object-id of corresponding ZBigFile object formatted with %016x.
//
// Each bigfileX/ has the following structure:
//
// bigfile/<bigfileX>/
// head/ ; latest bigfile revision
// ...
// @<tid1>/ ; bigfile revision as of transaction <tidX>
// ...
// @<tid2>/
// ... // ...
// ... // ...
// //
// where head/ represents latest bigfile as stored in upstream ZODB, and // where head/ represents latest data as stored in upstream ZODB, and
// @<tidX>/ represents bigfile as of transaction <tidX>. // @<revX>/ represents data as of revision <revX>.
// //
// head/ has the following structure: // head/ has the following structure:
// //
// bigfile/<bigfileX>/head/ // head/
// data ; latest bigfile data // at ; data inside head/ is as of this ZODB transaction
// at ; data is bigfile view as of this ZODB transaction // watch ; channel for bigfile invalidations
// invalidations ; channel that describes invalidated data regions // bigfile/ ; bigfiles' data
// <oid(bigfile1)>
// <oid(bigfile2)>
// ...
// //
// where /data represents latest bigfile data as stored in upstream ZODB. As // where /bigfile/<bigfileX> represents latest bigfile data as stored in
// there can be some lag receiving updates from the database, /at describes // upstream ZODB. As there can be some lag receiving updates from the database,
// precisely ZODB state for which bigfile data is currently exposed. Whenever // /at describes precisely ZODB state for which bigfile data is currently
// bigfile data is changed in upstream ZODB, information about the changes is // exposed. Whenever bigfile data is changed in upstream ZODB, information
// first propagated to /invalidations, and only after that /data is // about the changes is first propagated to /watch, and only after that
// updated. See "Invalidation protocol" for details. // /bigfile/<bigfileX> is updated. See "Invalidation protocol" for details.
// //
// @<tidX>/ has the following structure: // @<revX>/ has the following structure:
// //
// bigfile/<bigfileX>/@<tidX>/ // @<revX>/
// data ; bigfile data as of transaction <tidX> // at
// bigfile/ ; bigfiles' data as of revision <revX>
// <oid(bigfile1)>
// <oid(bigfile2)>
// ...
// //
// where /data represents bigfile data as of transaction <tidX>. // where /bigfile/<bigfileX> represent bigfile data as of revision <revX>.
// //
// bigfile/<bigfileX>/ should be created by client via mkdir. Unless explicitly // Unless accessed {head,@<revX>}/bigfile/<bigfileX> are not automatically visible in
// created bigfile/<bigfileX>/ are not automatically visible in wcfs // wcfs filesystem. Similarly @<revX>/ should be explicitly created by client via mkdir.
// filesystem. Similarly bigfile/<bigfileX>/@<tidX>/ should be too created by
// client.
// //
// //
// Invalidation protocol // Invalidation protocol
// //
// XXX invalidations will be done via ptrace because we need them to be // In order to support isolation, wcfs implements invalidation protocol that
// synchronous (see "wcfs organization")
//
// In order to support isolation wcfs implements invalidation protocol that
// must be cooperatively followed by both wcfs and client. // must be cooperatively followed by both wcfs and client.
// //
// First, before client wants to mmap bigfile, it opens // First, client mmaps latest bigfile, but does not access it
// bigfile/<bigfileX>/head/invalidations and tells wcfs through it for which
// ZODB state it wants to get bigfile view. The server in turn reports for
// which ZODB state head/data is current, δ describing changed bigfile region
// between those revisions, or "wait" flag if server state is earlier compared
// to what client wants:
// //
// C: want <Cat> // mmap(head/bigfile/<bigfileX>)
// S: have <Sat>, wait ; Sat < Cat
// S: have <Sat>, δR(Cat,Sat) ; Sat ≥ Cat
// //
// If server reply was "wait" the client does nothing and waits for next server // Then client opens head/watch and tells wcfs through it for which ZODB state
// message which must come without "wait" flag set. When client receives have // it wants to get bigfile's view.
// message with δR(Cat,Sat) it has the guarantee from wcfs that head/data
// content is for Sat ZODB revision and won't change until client sends ack
// back to the server. The client in turn now can mmap head/data and
// @<Cat>/data to get bigfile view as of Cat:
// //
// mmap(bigfile/<bigfileX>/head/data) // C: 1 watch <bigfileX> @<at>
// mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Cat,Sat), MAP_FIXED) # mmaped at addresses corresponding to δR(Cat,Sat)
// //
// When client completes its initial mmapping it sends ack back to the server: // The server then, after potentially sending initial pin messages (see below),
// reports either success or failure:
// //
// C: ack // S: 1 ok
// S: 1 error ... ; if <at> is too far away back from head/at
// //
// From now on the server will be processing updates to bigfile coming from // The server sends "ok" reply only after head/at is ≥ requested <at>, and
// ZODB as follows: // only after all initial pin messages are fully acknowledged by the client.
// The client can start to use mmapped data after it gets "ok".
// The server sends "error" reply if requested <at> is too far away back from
// head/at.
// //
// Upon watch request, either initially, or after sending "ok", the server will be notifying the
// client about file blocks that client needs to pin in order to observe file's
// data as of <at> revision:
// //
// The filesystem server itself receives information about changed data // The filesystem server itself receives information about changed data from
// from ZODB server through regular ZODB invalidation channel (as it is ZODB // ZODB server through regular ZODB invalidation channel (as it is ZODB client
// client itself). Then, before actually updating bigfile/<bigfileX>/head/data // itself). Then, separately for each changed file block, before actually
// content in changed part, it notifies through bigfile/<bigfileX>/head/invalidations // updating head/bigfile/<bigfileX> content, it notifies through head/watch to
// to clients that had opened this file (separately to each client) about the changes: // clients, that had requested it (separately to each client), about the
// changes:
// //
// S: have <Sat>, δR(Sat_prev, Sat) // S: 2 pin <bigfileX> #<blk> @<rev_max>
// //
// where Sat_prev is ZODB revision last reported to client for this bigfile, // and waits until all clients confirm that changed file block can be updated
// and waits until they all confirm that changed file part can be updated in // in global OS cache.
// global OS cache.
// //
// The client in turn can now re-mmap invalidated regions to bigfile@Cat // The client in turn should now re-mmap requested to be pinned block to bigfile@<rev_max>
// //
// # mmapped at addresses corresponding to δR(Sat_prev, Sat) // # mmapped at address corresponding to #blk
// mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Sat_prev, Sat), MAP_FIXED) // mmap(@<rev_max>/bigfile/<bigfileX>, #blk, MAP_FIXED)
// //
// and must send ack back to the server when it is done: // and must send ack back to the server when it is done:
// //
// C: ack // C: 2 ack
//
// The server sends pin notifications only for file blocks, that are known to
// be potentially changed after client's <at>, and <rev_max> describes the
// upper bound for the block revision:
//
// <at> < <rev_max>
//
// The server maintains short history tail of file changes to be able to
// support openings with <at> being slightly in the past compared to current
// head/at. The server might reject a watch request if <at> is too far away in
// the past from head/at. The client is advised to restart its transaction with
// more uptodate database view if it gets watch setup error.
//
// A later request from the client for the same <bigfileX> but with different
// <at>, overrides previous watch request for that file. A client can use "-"
// instead of "@<at>" to stop watching the file.
//
// A single client can send several watch requests through single head/watch
// open, as well as it can use several head/watch opens simultaneously.
// The server sends pin notifications for all files requested to be watched via
// every head/watch open.
// //
// When clients are done with bigfile/<bigfileX>/@<Cat>/data (i.e. Cat // When clients are done with @<revX>/bigfile/<bigfileX> (i.e. client's
// transaction ends and array is unmapped), the server sees number of opened // transaction ends and array is unmapped), the server sees number of opened
// files to bigfile/<bigfileX>/@<Cat>/data drops to zero, and automatically // files to @<revX>/bigfile/<bigfileX> drops to zero, and automatically
// destroys bigfile/<bigfileX>/@<Cat>/ directory after reasonable timeout. // destroys @<revX>/bigfile/<bigfileX> after reasonable timeout.
// //
// //
// Protection against slow or faulty clients // Protection against slow or faulty clients
...@@ -293,6 +299,7 @@ package main ...@@ -293,6 +299,7 @@ package main
// δFtail.by allows to quickly lookup information by #blk. // δFtail.by allows to quickly lookup information by #blk.
// //
// min(rev) in δFtail is min(@at) at which head/data is currently mmapped (see below). // min(rev) in δFtail is min(@at) at which head/data is currently mmapped (see below).
// XXX min(10 minutes) of history to support initial openenings
// //
// 7) when we receive a FUSE read(#blk) request to a file/head/data we process it as follows: // 7) when we receive a FUSE read(#blk) request to a file/head/data we process it as follows:
// //
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment