Commit c40f3831 authored by Kirill Smelkov's avatar Kirill Smelkov

.

parent 901e9fc1
......@@ -28,7 +28,7 @@
// file that represents whole ZBigFile's data.
//
// For a client, the primary way to access a bigfile should be to mmap
// bigfile/<bigfileX>/head/data which represents always latest bigfile data.
// head/bigfile/<bigfileX> which represents always latest bigfile data.
// Clients that want to get isolation guarantee should subscribe for
// invalidations and re-mmap invalidated regions to file with pinned bigfile revision for
// the duration of their transaction. See "Invalidation protocol" for details.
......@@ -42,119 +42,125 @@
//
// Top-level structure of provided filesystem is as follows:
//
// bigfile/
// <oid(bigfile1)>/
// head/ ; latest database view
// ...
// <oid(bigfile2)>/
// @<rev1>/ ; database view as of revision <revX>
// ...
// ...
//
// where for a bigfileX there is bigfile/<oid(bigfileX)>/ directory, with
// oid(bigfileX) being ZODB object-id of corresponding ZBigFile object formatted with %016x.
//
// Each bigfileX/ has the following structure:
//
// bigfile/<bigfileX>/
// head/ ; latest bigfile revision
// ...
// @<tid1>/ ; bigfile revision as of transaction <tidX>
// ...
// @<tid2>/
// @<rev2>/
// ...
// ...
//
// where head/ represents latest bigfile as stored in upstream ZODB, and
// @<tidX>/ represents bigfile as of transaction <tidX>.
// where head/ represents latest data as stored in upstream ZODB, and
// @<revX>/ represents data as of revision <revX>.
//
// head/ has the following structure:
//
// bigfile/<bigfileX>/head/
// data ; latest bigfile data
// at ; data is bigfile view as of this ZODB transaction
// invalidations ; channel that describes invalidated data regions
// head/
// at ; data inside head/ is as of this ZODB transaction
// watch ; channel for bigfile invalidations
// bigfile/ ; bigfiles' data
// <oid(bigfile1)>
// <oid(bigfile2)>
// ...
//
// where /data represents latest bigfile data as stored in upstream ZODB. As
// there can be some lag receiving updates from the database, /at describes
// precisely ZODB state for which bigfile data is currently exposed. Whenever
// bigfile data is changed in upstream ZODB, information about the changes is
// first propagated to /invalidations, and only after that /data is
// updated. See "Invalidation protocol" for details.
// where /bigfile/<bigfileX> represents latest bigfile data as stored in
// upstream ZODB. As there can be some lag receiving updates from the database,
// /at describes precisely ZODB state for which bigfile data is currently
// exposed. Whenever bigfile data is changed in upstream ZODB, information
// about the changes is first propagated to /watch, and only after that
// /bigfile/<bigfileX> is updated. See "Invalidation protocol" for details.
//
// @<tidX>/ has the following structure:
// @<revX>/ has the following structure:
//
// bigfile/<bigfileX>/@<tidX>/
// data ; bigfile data as of transaction <tidX>
// @<revX>/
// at
// bigfile/ ; bigfiles' data as of revision <revX>
// <oid(bigfile1)>
// <oid(bigfile2)>
// ...
//
// where /data represents bigfile data as of transaction <tidX>.
// where /bigfile/<bigfileX> represent bigfile data as of revision <revX>.
//
// bigfile/<bigfileX>/ should be created by client via mkdir. Unless explicitly
// created bigfile/<bigfileX>/ are not automatically visible in wcfs
// filesystem. Similarly bigfile/<bigfileX>/@<tidX>/ should be too created by
// client.
// Unless accessed {head,@<revX>}/bigfile/<bigfileX> are not automatically visible in
// wcfs filesystem. Similarly @<revX>/ should be explicitly created by client via mkdir.
//
//
// Invalidation protocol
//
// XXX invalidations will be done via ptrace because we need them to be
// synchronous (see "wcfs organization")
//
// In order to support isolation wcfs implements invalidation protocol that
// In order to support isolation, wcfs implements invalidation protocol that
// must be cooperatively followed by both wcfs and client.
//
// First, before client wants to mmap bigfile, it opens
// bigfile/<bigfileX>/head/invalidations and tells wcfs through it for which
// ZODB state it wants to get bigfile view. The server in turn reports for
// which ZODB state head/data is current, δ describing changed bigfile region
// between those revisions, or "wait" flag if server state is earlier compared
// to what client wants:
// First, client mmaps latest bigfile, but does not access it
//
// C: want <Cat>
// S: have <Sat>, wait ; Sat < Cat
// S: have <Sat>, δR(Cat,Sat) ; Sat ≥ Cat
// mmap(head/bigfile/<bigfileX>)
//
// If server reply was "wait" the client does nothing and waits for next server
// message which must come without "wait" flag set. When client receives have
// message with δR(Cat,Sat) it has the guarantee from wcfs that head/data
// content is for Sat ZODB revision and won't change until client sends ack
// back to the server. The client in turn now can mmap head/data and
// @<Cat>/data to get bigfile view as of Cat:
// Then client opens head/watch and tells wcfs through it for which ZODB state
// it wants to get bigfile's view.
//
// mmap(bigfile/<bigfileX>/head/data)
// mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Cat,Sat), MAP_FIXED) # mmaped at addresses corresponding to δR(Cat,Sat)
// C: 1 watch <bigfileX> @<at>
//
// When client completes its initial mmapping it sends ack back to the server:
// The server then, after potentially sending initial pin messages (see below),
// reports either success or failure:
//
// C: ack
// S: 1 ok
// S: 1 error ... ; if <at> is too far away back from head/at
//
// From now on the server will be processing updates to bigfile coming from
// ZODB as follows:
// The server sends "ok" reply only after head/at is ≥ requested <at>, and
// only after all initial pin messages are fully acknowledged by the client.
// The client can start to use mmapped data after it gets "ok".
// The server sends "error" reply if requested <at> is too far away back from
// head/at.
//
// Upon watch request, either initially, or after sending "ok", the server will be notifying the
// client about file blocks that client needs to pin in order to observe file's
// data as of <at> revision:
//
// The filesystem server itself receives information about changed data
// from ZODB server through regular ZODB invalidation channel (as it is ZODB
// client itself). Then, before actually updating bigfile/<bigfileX>/head/data
// content in changed part, it notifies through bigfile/<bigfileX>/head/invalidations
// to clients that had opened this file (separately to each client) about the changes:
// The filesystem server itself receives information about changed data from
// ZODB server through regular ZODB invalidation channel (as it is ZODB client
// itself). Then, separately for each changed file block, before actually
// updating head/bigfile/<bigfileX> content, it notifies through head/watch to
// clients, that had requested it (separately to each client), about the
// changes:
//
// S: have <Sat>, δR(Sat_prev, Sat)
// S: 2 pin <bigfileX> #<blk> @<rev_max>
//
// where Sat_prev is ZODB revision last reported to client for this bigfile,
// and waits until they all confirm that changed file part can be updated in
// global OS cache.
// and waits until all clients confirm that changed file block can be updated
// in global OS cache.
//
// The client in turn can now re-mmap invalidated regions to bigfile@Cat
// The client in turn should now re-mmap requested to be pinned block to bigfile@<rev_max>
//
// # mmapped at addresses corresponding to δR(Sat_prev, Sat)
// mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Sat_prev, Sat), MAP_FIXED)
// # mmapped at address corresponding to #blk
// mmap(@<rev_max>/bigfile/<bigfileX>, #blk, MAP_FIXED)
//
// and must send ack back to the server when it is done:
//
// C: ack
// C: 2 ack
//
// The server sends pin notifications only for file blocks, that are known to
// be potentially changed after client's <at>, and <rev_max> describes the
// upper bound for the block revision:
//
// <at> < <rev_max>
//
// The server maintains short history tail of file changes to be able to
// support openings with <at> being slightly in the past compared to current
// head/at. The server might reject a watch request if <at> is too far away in
// the past from head/at. The client is advised to restart its transaction with
// more uptodate database view if it gets watch setup error.
//
// A later request from the client for the same <bigfileX> but with different
// <at>, overrides previous watch request for that file. A client can use "-"
// instead of "@<at>" to stop watching the file.
//
// A single client can send several watch requests through single head/watch
// open, as well as it can use several head/watch opens simultaneously.
// The server sends pin notifications for all files requested to be watched via
// every head/watch open.
//
// When clients are done with bigfile/<bigfileX>/@<Cat>/data (i.e. Cat
// When clients are done with @<revX>/bigfile/<bigfileX> (i.e. client's
// transaction ends and array is unmapped), the server sees number of opened
// files to bigfile/<bigfileX>/@<Cat>/data drops to zero, and automatically
// destroys bigfile/<bigfileX>/@<Cat>/ directory after reasonable timeout.
// files to @<revX>/bigfile/<bigfileX> drops to zero, and automatically
// destroys @<revX>/bigfile/<bigfileX> after reasonable timeout.
//
//
// Protection against slow or faulty clients
......@@ -293,6 +299,7 @@ package main
// δFtail.by allows to quickly lookup information by #blk.
//
// min(rev) in δFtail is min(@at) at which head/data is currently mmapped (see below).
// XXX min(10 minutes) of history to support initial openenings
//
// 7) when we receive a FUSE read(#blk) request to a file/head/data we process it as follows:
//
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment