• Kirill Smelkov's avatar
    wcfs: xbtree: ΔBtail · 2ab4be93
    Kirill Smelkov authored
    ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will
    use to compute which blocks of a ZBigFile need to be invalidated in OS
    file cache given raw ZODB changes on ZODB invalidation message.
    It also will be used by WCFS to implement isolation protocol, where on
    every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to
    find out revision of corresponding file block.
    Quoting ΔBtail documentation:
    ---- 8< ----
    ΔBtail provides BTree-level history tail.
    It translates ZODB object-level changes to information about which keys of
    which BTree were modified, and provides service to query that information.
    ΔBtail class documentation
    ΔBtail represents tail of revisional changes to BTrees.
    It semantically consists of
        []δB			; rev ∈ (tail, head]
    where δB represents a change in BTrees space
        	{} root -> {}(key, δvalue)
    It covers only changes to keys from tracked subset of BTrees parts.
    In particular a key that was not explicitly requested to be tracked, even if
    it was changed in δZ, is not guaranteed to be present in δB.
    ΔBtail provides the following operations:
      .Track(path)	- start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi]
      .Update(δZ) -> δB				- update BTree δ tail given raw ZODB changes
      .ForgetPast(revCut)			- forget changes ≤ revCut
      .SliceByRev(lo, hi) -> []δB		- query for all trees changes with rev ∈ (lo, hi]
      .SliceByRootRev(root, lo, hi) -> []δT	- query for changes of a tree with rev ∈ (lo, hi]
      .GetAt(root, key, at) -> (value, rev)	- get root[key] @at assuming root[key] ∈ tracked
    where δT represents a change to one tree
        	{}(key, δvalue)
    An example for tracked set is a set of visited BTree paths.
    There is no requirement that tracked set belongs to only one single BTree.
    See also zodb.ΔTail and zdata.ΔFtail
    ΔBtail is safe to use in single-writer / multiple-readers mode. That is at
    any time there should be either only sole writer, or, potentially several
    simultaneous readers. The table below classifies operations:
        Writers:  Update, ForgetPast
        Readers:  Track + all queries (SliceByRev, SliceByRootRev, GetAt)
    Note that, in particular, it is correct to run multiple Track and queries
    requests simultaneously.
    ΔBtail organization
    ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to
    turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is
    maintained with tree-level history in ΔTtail.vδT .
    Because it is very computationally expensive(+) to find out for an object to
    which BTree it belongs, ΔBtail cannot provide full BTree-level history given
    just ΔZtail with δZ changes. Due to this ΔBtail requires help from
    users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know
    that such and such ZODB objects constitute a path from root of a tree to some
    of its leaf. After Track call the objects from the path and tree keys, that
    are covered by leaf node, become tracked: from now-on ΔBtail will detect
    and provide BTree-level changes caused by any change of tracked tree objects
    or tracked keys. This guarantee can be provided because ΔBtail now knows
    that such and such objects belong to a particular tree.
    To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet.
    This data-structure represents so-called PP-connected set of tree nodes:
    simply speaking it builds on some leafs and then includes parent(leaf),
    parent(parent(leaf)), etc. In other words it's a "parent"-closure of the
    leafs. The property of being PP-connected means that starting from any node
    from such set, it is always possible to reach root node by traversing
    .parent links, and that every intermediate node went-through during
    traversal also belongs to the set.
    A new Track request potentially grows tracked keys coverage. Due to this,
    on a query, ΔBtail needs to recompute potentially whole vδT of the affected
    tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild"
    functions and uses the same treediff algorithm, that Update is using, but
    modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially
    needs to rebuild whole vδT history, not only append new δT, because a
    change to tracked tree nodes can result in growth of tracked key coverage.
    Queries are relatively straightforward code that work on vδT snapshot. The
    main complexity, besides BTree-diff algorithm, lies in recomputing vδT when
    set of tracked keys changes, and in handling that recomputation in such a way
    that multiple Track and queries requests could be all served in parallel.
    In order to allow multiple Track and queries requests to be served in
    parallel ΔBtail employs special organization of vδT rebuild process where
    complexity of concurrency is reduced to math on merging updates to vδT and
    trackSet, and on key range lookup:
    1. vδT is managed under read-copy-update (RCU) discipline: before making
       any vδT change the mutator atomically clones whole vδT and applies its
       change to the clone. This way a query, once it retrieves vδT snapshot,
       does not need to further synchronize with vδT mutators, and can rely on
       that retrieved vδT snapshot will remain immutable.
    2. a Track request goes through 3 states: "new", "handle-in-progress" and
       "handled". At each state keys/nodes of the Track are maintained in:
       - ΔTtail.ktrackNew and .trackNew       for "new",
       - ΔTtail.krebuildJobs                  for "handle-in-progress", and
       - ΔBtail.trackSet                      for "handled".
       trackSet keeps nodes, and implicitly keys, from all handled Track
       requests. For all keys, covered by trackSet, vδT is fully computed.
       a new Track(keycov, path) is remembered in ktrackNew and trackNew to be
       further processed when a query should need keys from keycov. vδT is not
       yet providing data for keycov keys.
       when a Track request starts to be processed, its keys and nodes are moved
       from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data
       for requested-to-be-tracked keys.
       all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint:
        trackSet ^ trackNew     = ø
        trackSet ^ krebuildJobs = ø
        trackNew ^ krebuildJobs = ø
    3. when a query is served, it needs to retrieve vδT snapshot that takes
       related previous Track requests into account. Retrieving such snapshots
       is implemented in vδTSnapForTracked*() family of functions: there it
       checks ktrackNew/trackNew, and if those sets overlap with query's keys
       of interest, run vδT rebuild for keys queued in ktrackNew.
       the main part of that rebuild can be run without any locks, because it
       does not use nor modify any ΔBtail data, and for δ(vδT) it just computes
       a fresh full vδT build modulo retrieved ktrackNew. Only after that
       computation is complete, ΔBtail is locked again to quickly merge in
       δ(vδT) update back into vδT.
       This organization is based on the fact that
        vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂
         ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the
           same as merge of vδT computed for tracked set T₁ and vδT computed
          for tracked set T₂ )
       and that
        trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂)
        ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same
          as union of tracking set updated with δPP₁ and tracking set updated
          with δPP₂ )
       these merge properties allow to run computation for δ(vδT) and δ(trackSet)
       independently and with ΔBtail unlocked, which in turn enables running
       several Track/queries in parallel.
    4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov
       entry to indicate in-progress rebuild. Should a query need vδT for keys
       from that job, it first waits for corresponding job(s) to complete.
    Explained rebuild organization allows non-overlapping queries/track-requests
    to run simultaneously. (This property is essential to WCFS because otherwise
    WCFS would not be able to serve several non-overlapping READ requests to one
    file in parallel.)
    (*) implemented in treediff.go
    (+) full database scan
    Some preliminary history:
    kirr/wendelin.core@877e64a9    X wcfs: Fix tests to pass again
    kirr/wendelin.core@c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
    kirr/wendelin.core@78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
    kirr/wendelin.core@5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
    kirr/wendelin.core@f65f775b    X wcfs/xbtree: treediff(ø, b)
    kirr/wendelin.core@c75b1c6f    X wcfs/xbtree: Start killing holeIdx
    kirr/wendelin.core@0fa06cbd    X kadj must be taken into account as kadj^δZ
    kirr/wendelin.core@ef5e5183    X treediff ret += δtkeycov
    kirr/wendelin.core@f30826a6    X another bug in δtkeyconv computation
    kirr/wendelin.core@0917380e    X wcfs: assert that keycov only grow
    kirr/wendelin.core@502e05c2    X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs
    kirr/wendelin.core@450ba707    X Fix rebuild with ø @at2
    kirr/wendelin.core@f60528c9    X ΔBtail.Clone had bug that it was aliasing klon and orig data
    kirr/wendelin.core@9d20f8e8    X treediff: Fix BUG while computing AB coverage
    kirr/wendelin.core@ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
    kirr/wendelin.core@324241eb    X rebuild: tests: Don't reflect.DeepEqual in inner loop
    kirr/wendelin.core@8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
    kirr/wendelin.core@2c0b4793    X rebuild: tests: Don't access ZODB in xtrackKeys
    kirr/wendelin.core@8f0e37f2    X rebuild: tests: Precompute kadj10·kadj21
    kirr/wendelin.core@271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
    kirr/wendelin.core@a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
    kirr/wendelin.core@01433e96    X rebuild: tests: Don't compute keyCover in trackSet
    kirr/wendelin.core@7371f9c5    X rebuild: tests: Inline _assertTrack
    kirr/wendelin.core@3e9164b3    X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update
    kirr/wendelin.core@e9c4b619    X rebuild: tests: Random testing
    kirr/wendelin.core@d0fe680a    X δbtail += ForgetPast
    kirr/wendelin.core@210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
    kirr/wendelin.core@855ab4b8    X ΔBtail: Goodbye .KVAtTail
    kirr/wendelin.core@2f5582e6    X ΔBtail: Tweak tests to run faster in normal mode
    kirr/wendelin.core@cf352737    X random testing found another failing test for rebuild...
    kirr/wendelin.core@7f7e34e0    X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild  - called by Update - already added it
    kirr/wendelin.core@6ad0052c    X ΔBtail.Track: No need to return error
    kirr/wendelin.core@aafcacdf    X xbtree: GetAt test
    kirr/wendelin.core@784a6761    X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys
    kirr/wendelin.core@0bb1c22e    X xbtree: Verify that ForgetPast clones vδT on trim
    kirr/wendelin.core@a8945cbf    X Start reworking rebuild routines not to modify data inplace
    kirr/wendelin.core@b74dda09    X Start switching Track from Track(key) to Track(keycov)
    kirr/wendelin.core@dea85e87    X Switch GetAt to vδTSnapForTrackedKey
    kirr/wendelin.core@aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
    kirr/wendelin.core@c4366b14    X xbtree: tests: Also verify state of ΔTtail.ktrackNew
    kirr/wendelin.core@b98706ad    X Track should be nop if keycov/path is already in krebuildJobs
    kirr/wendelin.core@e141848a    X test.go  ↑ timeout  10m -> 20m
    kirr/wendelin.core@423f77be    X wcfs: Goodby holeIdx
    kirr/wendelin.core@37c2e806    X wcfs: Teach treediff to compute not only δtrack (set of nodes), but also δ for track-key coverage
    kirr/wendelin.core@52c72dbb    X ΔBtail.rebuild started to work draftly
    kirr/wendelin.core@c9f13fc7    X Get rebuild tests to run in a sane time; Add proper random-based testing for rebuild
    kirr/wendelin.core@c7f1e3c9    X xbtree: Factor testing infrastructure bits into xbtree/xbtreetest
    kirr/wendelin.core@7602c1f4    ΔBtail concurrency