-
Kirill Smelkov authored
ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will use to compute which blocks of a ZBigFile need to be invalidated in OS file cache given raw ZODB changes on ZODB invalidation message. It also will be used by WCFS to implement isolation protocol, where on every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to find out revision of corresponding file block. Quoting ΔBtail documentation: ---- 8< ---- ΔBtail provides BTree-level history tail. It translates ZODB object-level changes to information about which keys of which BTree were modified, and provides service to query that information. ΔBtail class documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~ ΔBtail represents tail of revisional changes to BTrees. It semantically consists of []δB ; rev ∈ (tail, head] where δB represents a change in BTrees space δB: .rev↑ {} root -> {}(key, δvalue) It covers only changes to keys from tracked subset of BTrees parts. In particular a key that was not explicitly requested to be tracked, even if it was changed in δZ, is not guaranteed to be present in δB. ΔBtail provides the following operations: .Track(path) - start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi] .Update(δZ) -> δB - update BTree δ tail given raw ZODB changes .ForgetPast(revCut) - forget changes ≤ revCut .SliceByRev(lo, hi) -> []δB - query for all trees changes with rev ∈ (lo, hi] .SliceByRootRev(root, lo, hi) -> []δT - query for changes of a tree with rev ∈ (lo, hi] .GetAt(root, key, at) -> (value, rev) - get root[key] @at assuming root[key] ∈ tracked where δT represents a change to one tree δT: .rev↑ {}(key, δvalue) An example for tracked set is a set of visited BTree paths. There is no requirement that tracked set belongs to only one single BTree. See also zodb.ΔTail and zdata.ΔFtail Concurrency ΔBtail is safe to use in single-writer / multiple-readers mode. That is at any time there should be either only sole writer, or, potentially several simultaneous readers. The table below classifies operations: Writers: Update, ForgetPast Readers: Track + all queries (SliceByRev, SliceByRootRev, GetAt) Note that, in particular, it is correct to run multiple Track and queries requests simultaneously. ΔBtail organization ~~~~~~~~~~~~~~~~~~~ ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is maintained with tree-level history in ΔTtail.vδT . Because it is very computationally expensive(+) to find out for an object to which BTree it belongs, ΔBtail cannot provide full BTree-level history given just ΔZtail with δZ changes. Due to this ΔBtail requires help from users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know that such and such ZODB objects constitute a path from root of a tree to some of its leaf. After Track call the objects from the path and tree keys, that are covered by leaf node, become tracked: from now-on ΔBtail will detect and provide BTree-level changes caused by any change of tracked tree objects or tracked keys. This guarantee can be provided because ΔBtail now knows that such and such objects belong to a particular tree. To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet. This data-structure represents so-called PP-connected set of tree nodes: simply speaking it builds on some leafs and then includes parent(leaf), parent(parent(leaf)), etc. In other words it's a "parent"-closure of the leafs. The property of being PP-connected means that starting from any node from such set, it is always possible to reach root node by traversing .parent links, and that every intermediate node went-through during traversal also belongs to the set. A new Track request potentially grows tracked keys coverage. Due to this, on a query, ΔBtail needs to recompute potentially whole vδT of the affected tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild" functions and uses the same treediff algorithm, that Update is using, but modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially needs to rebuild whole vδT history, not only append new δT, because a change to tracked tree nodes can result in growth of tracked key coverage. Queries are relatively straightforward code that work on vδT snapshot. The main complexity, besides BTree-diff algorithm, lies in recomputing vδT when set of tracked keys changes, and in handling that recomputation in such a way that multiple Track and queries requests could be all served in parallel. Concurrency In order to allow multiple Track and queries requests to be served in parallel ΔBtail employs special organization of vδT rebuild process where complexity of concurrency is reduced to math on merging updates to vδT and trackSet, and on key range lookup: 1. vδT is managed under read-copy-update (RCU) discipline: before making any vδT change the mutator atomically clones whole vδT and applies its change to the clone. This way a query, once it retrieves vδT snapshot, does not need to further synchronize with vδT mutators, and can rely on that retrieved vδT snapshot will remain immutable. 2. a Track request goes through 3 states: "new", "handle-in-progress" and "handled". At each state keys/nodes of the Track are maintained in: - ΔTtail.ktrackNew and .trackNew for "new", - ΔTtail.krebuildJobs for "handle-in-progress", and - ΔBtail.trackSet for "handled". trackSet keeps nodes, and implicitly keys, from all handled Track requests. For all keys, covered by trackSet, vδT is fully computed. a new Track(keycov, path) is remembered in ktrackNew and trackNew to be further processed when a query should need keys from keycov. vδT is not yet providing data for keycov keys. when a Track request starts to be processed, its keys and nodes are moved from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data for requested-to-be-tracked keys. all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint: trackSet ^ trackNew = ø trackSet ^ krebuildJobs = ø trackNew ^ krebuildJobs = ø 3. when a query is served, it needs to retrieve vδT snapshot that takes related previous Track requests into account. Retrieving such snapshots is implemented in vδTSnapForTracked*() family of functions: there it checks ktrackNew/trackNew, and if those sets overlap with query's keys of interest, run vδT rebuild for keys queued in ktrackNew. the main part of that rebuild can be run without any locks, because it does not use nor modify any ΔBtail data, and for δ(vδT) it just computes a fresh full vδT build modulo retrieved ktrackNew. Only after that computation is complete, ΔBtail is locked again to quickly merge in δ(vδT) update back into vδT. This organization is based on the fact that vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂ ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the same as merge of vδT computed for tracked set T₁ and vδT computed for tracked set T₂ ) and that trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂) ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same as union of tracking set updated with δPP₁ and tracking set updated with δPP₂ ) these merge properties allow to run computation for δ(vδT) and δ(trackSet) independently and with ΔBtail unlocked, which in turn enables running several Track/queries in parallel. 4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov entry to indicate in-progress rebuild. Should a query need vδT for keys from that job, it first waits for corresponding job(s) to complete. Explained rebuild organization allows non-overlapping queries/track-requests to run simultaneously. (This property is essential to WCFS because otherwise WCFS would not be able to serve several non-overlapping READ requests to one file in parallel.) -------- (*) implemented in treediff.go (+) full database scan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some preliminary history: 877e64a9 X wcfs: Fix tests to pass again c32055fc X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø 78f2f88b X wcfs/xbtree: Fix treediff(a, ø) 5324547c X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø) f65f775b X wcfs/xbtree: treediff(ø, b) c75b1c6f X wcfs/xbtree: Start killing holeIdx 0fa06cbd X kadj must be taken into account as kadj^δZ ef5e5183 X treediff ret += δtkeycov f30826a6 X another bug in δtkeyconv computation 0917380e X wcfs: assert that keycov only grow 502e05c2 X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs 450ba707 X Fix rebuild with ø @at2 f60528c9 X ΔBtail.Clone had bug that it was aliasing klon and orig data 9d20f8e8 X treediff: Fix BUG while computing AB coverage ddb28043 X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV 324241eb X rebuild: tests: Don't reflect.DeepEqual in inner loop 8f6e2b1e X rebuild: tests: Don't access ZODB in XGetδKV 2c0b4793 X rebuild: tests: Don't access ZODB in xtrackKeys 8f0e37f2 X rebuild: tests: Precompute kadj10·kadj21 271d953d X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test a87cc6de X rebuild: tests: Don't recompute trackSet(keys1R2) several times 01433e96 X rebuild: tests: Don't compute keyCover in trackSet 7371f9c5 X rebuild: tests: Inline _assertTrack 3e9164b3 X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update e9c4b619 X rebuild: tests: Random testing d0fe680a X δbtail += ForgetPast 210e9b07 X Fix ΔBtail.SliceByRootRev (lo,hi] handling 855ab4b8 X ΔBtail: Goodbye .KVAtTail 2f5582e6 X ΔBtail: Tweak tests to run faster in normal mode cf352737 X random testing found another failing test for rebuild... 7f7e34e0 X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild - called by Update - already added it 6ad0052c X ΔBtail.Track: No need to return error aafcacdf X xbtree: GetAt test 784a6761 X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys 0bb1c22e X xbtree: Verify that ForgetPast clones vδT on trim a8945cbf X Start reworking rebuild routines not to modify data inplace b74dda09 X Start switching Track from Track(key) to Track(keycov) dea85e87 X Switch GetAt to vδTSnapForTrackedKey aa0288ce X Switch SliceByRootRev to vδTSnapForTracked c4366b14 X xbtree: tests: Also verify state of ΔTtail.ktrackNew b98706ad X Track should be nop if keycov/path is already in krebuildJobs e141848a X test.go ↑ timeout 10m -> 20m
305d897b