Commit 23d8da82 authored by Kirill Smelkov's avatar Kirill Smelkov

wcfs: zdata: ΔFtail

ΔFtail builds on ΔBtail and  provides ZBigFile-level history that WCFS
will use to compute which blocks of a ZBigFile need to be invalidated in
OS file cache given raw ZODB changes on ZODB invalidation message.

It also will be used by WCFS to implement isolation protocol, where on
every FUSE READ request WCFS will query ΔFtail to find out revision of
corresponding file block.

Quoting ΔFtail documentation:

---- 8< ----

ΔFtail provides ZBigFile-level history tail.

It translates ZODB object-level changes to information about which blocks of
which ZBigFile were modified, and provides service to query that information.

ΔFtail class documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~

ΔFtail represents tail of revisional changes to files.

It semantically consists of

    []δF			; rev ∈ (tail, head]

where δF represents a change in files space

    δF:
    	.rev↑
    	{} file ->  {}blk | EPOCH

Only files and blocks explicitly requested to be tracked are guaranteed to
be present. In particular a block that was not explicitly requested to be
tracked, even if it was changed in δZ, is not guaranteed to be present in δF.

After file epoch (file creation, deletion, or any other change to file
object) previous track requests for that file become forgotten and have no
further effect.

ΔFtail provides the following operations:

  .Track(file, blk, path, zblk)	- add file and block reached via BTree path to tracked set.

  .Update(δZ) -> δF				- update files δ tail given raw ZODB changes
  .ForgetPast(revCut)			- forget changes ≤ revCut
  .SliceByRev(lo, hi) -> []δF		- query for all files changes with rev ∈ (lo, hi]
  .SliceByFileRev(file, lo, hi) -> []δfile	- query for changes of a file with rev ∈ (lo, hi]
  .BlkRevAt(file, #blk, at) -> blkrev	- query for what is last revision that changed
    					  file[#blk] as of @at database state.

where δfile represents a change to one file

    δfile:
    	.rev↑
    	{}blk | EPOCH

See also zodb.ΔTail and xbtree.ΔBtail

Concurrency

ΔFtail is safe to use in single-writer / multiple-readers mode. That is at
any time there should be either only sole writer, or, potentially several
simultaneous readers. The table below classifies operations:

    Writers:  Update, ForgetPast
    Readers:  Track + all queries (SliceByRev, SliceByFileRev, BlkRevAt)

Note that, in particular, it is correct to run multiple Track and queries
requests simultaneously.

ΔFtail organization
~~~~~~~~~~~~~~~~~~~

ΔFtail leverages:

    - ΔBtail to track changes to ZBigFile.blktab BTree, and
    - ΔZtail to track changes to ZBlk objects and to ZBigFile object itself.

then every query merges ΔBtail and ΔZtail data on the fly to provide
ZBigFile-level result.

Merging on the fly, contrary to computing and maintaining vδF data, is done
to avoid complexity of recomputing vδF when tracking set changes. Most of
ΔFtail complexity is, thus, located in ΔBtail, which implements BTree diff
and handles complexity of recomputing vδB when set of tracked blocks
changes after new track requests.

Changes to ZBigFile object indicate epochs. Epochs could be:

    - file creation or deletion,
    - change of ZBigFile.blksize,
    - change of ZBigFile.blktab to point to another BTree.

Epochs represent major changes to file history where file is assumed to
change so dramatically, that practically it can be considered to be a
"whole" change. In particular, WCFS, upon seeing a ZBigFile epoch,
invalidates all data in corresponding OS-level cache for the file.

The only historical data, that ΔFtail maintains by itself, is history of
epochs. That history does not need to be recomputed when more blocks become
tracked and is thus easy to maintain. It also can be maintained only in
ΔFtail because ΔBtail and ΔZtail does not "know" anything about ZBigFile.

Concurrency

In order to allow multiple Track and queries requests to be served in
parallel, ΔFtail bases its concurrency promise on ΔBtail guarantees +
snapshot-style access for vδE and ztrackInBlk in queries:

1. Track calls ΔBtail.Track and quickly updates .byFile, .byRoot and
   _RootTrack indices under a lock.

2. BlkRevAt queries ΔBtail.GetAt and then combines retrieved information
   about zblk with vδE and δZ.

3. SliceByFileRev queries ΔBtail.SliceByRootRev and then merges retrieved
   vδT data with vδZ, vδE and ztrackInBlk.

4. In queries vδE is retrieved/built in snapshot style similarly to how vδT
   is built in ΔBtail. Note that vδE needs to be built only the first time,
   and does not need to be further rebuilt, so the logic in ΔFtail is simpler
   compared to ΔBtail.

5. for ztrackInBlk - that is used by SliceByFileRev query - an atomic
   snapshot is retrieved for objects of interest. This allows to hold
   δFtail.mu lock for relatively brief time without blocking other parallel
   Track/queries requests for long.

Combined this organization allows non-overlapping queries/track-requests
to run simultaneously. (This property is essential to WCFS because otherwise
WCFS would not be able to serve several non-overlapping READ requests to one
file in parallel.)

See also "Concurrency" in ΔBtail organization for more details.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some preliminary history:

ef74aebc    X ΔFtail: Keep reference to ZBigFile via Oid, not via *ZBigFile
bf9a7405    X No longer rely on ZODB cache invariant for invalidations
46340069    X found by Random
e7b598c6    X start of ΔFtail.SliceByFileRev rework to function via merging δB and δZ histories on the fly
59c83009    X ΔFtail.SliceByFileRoot tests started to work draftly after "on-the-fly" rework
210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
bf3ace66    X ΔFtail: Rebuild vδE after first track
46624787    X ΔFtail: `go test -failfast -short -v -run Random -randseed=1626793016249041295` discovered problems
786dd336    X Size no longer tracks [0,∞) since we start tracking when zfile is non-empty
4f707117    X test that shows problem of SliceByRootRev where untracked blocks are not added uniformly into whole history
c0b7e4c3    X ΔFtail.SliceByFileRev: Fix untracked entries to be present uniformly in result
aac37c11    X zdata: Introduce T to start removing duplication in tests
bf411aa9    X zdata: Deduplicate zfile loading
b74dda09    X Start switching Track from Track(key) to Track(keycov)
aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
588a512a    X zdata: Switch SliceByFileRev not to clone Zinblk
8b5d8523    X Move tracking of which blocks were accessed from wcfs to ΔFtail
parent 305d897b
...@@ -19,7 +19,7 @@ ...@@ -19,7 +19,7 @@
// Package zdata provides access for wendelin.core in-ZODB data. // Package zdata provides access for wendelin.core in-ZODB data.
// //
// ZBlk* + ZBigFile. // ZBlk* + ZBigFile + ΔFtail for ZBigFile-level ZODB history.
package zdata package zdata
// module: "wendelin.bigfile.file_zodb" // module: "wendelin.bigfile.file_zodb"
......
This diff is collapsed.
This diff is collapsed.
// Copyright (C) 2021 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
package zdata_test
import (
_ "lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/xbtree/xbtreetest/init"
)
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment