1. 16 Nov, 2021 4 commits
    • Kirill Smelkov's avatar
      . · 76e8cdce
      Kirill Smelkov authored
      76e8cdce
    • Kirill Smelkov's avatar
      . · a58e35d8
      Kirill Smelkov authored
      a58e35d8
    • Kirill Smelkov's avatar
      Merge branch 'master' into t2 · 3eb2e25c
      Kirill Smelkov authored
      * master:
        wcfs: Server.stop: Make sure to remove mount entry even if we had to use FUSE abort
        tests: Don't leak WCFS log files
        tests: Remove test NEO database after test run is over
        nxdtest: Don't run test.go for multiple GOMAXPROCS
        wcfs: Make sure to remove mountpoint directory on Server.stop
        nxdtest: Run WCFS-related tests in verbose mode on testnodes
        setup: Fix egg_info after addition of δbtail.go
      3eb2e25c
    • Kirill Smelkov's avatar
      . · 257018a5
      Kirill Smelkov authored
      257018a5
  2. 15 Nov, 2021 1 commit
    • Kirill Smelkov's avatar
      wcfs: Server.stop: Make sure to remove mount entry even if we had to use FUSE abort · 5f684a49
      Kirill Smelkov authored
      Server.stop currently tries to unmount, and if that fails invokes FUSE
      abort and kills wcfs.go . However it does not call unmount the second
      time after such abort, and this way the filesystem remains mounted (in
      ENOTCONN state) and rmdir(mountpoint) fails.
      
      -> Fix it by calling unmount the second time if we had to abort FUSE
      connection. In that second try use lazy unmounting, because regular
      unmount can still fail with "Device or resource busy" since there
      could be still client file descriptors left pointing to the mounted
      filesystem. With lazy mode unmounting + followup rmdir, hopefully,
      always succeeds.
      
      Here is example test run where one test timed out, FUSE connection was
      aborted, but neither the filesystem was unmounted, nor mountpoint
      directory was deleted, which led to all followup tests failing in setup
      assert that testmountpoint does not exist:
      
      https://nexedijs.erp5.net/#/test_result_module/20211112-1ACEA62D/22
      
      This patch should fix those followup failures + fix another leakage of
      WCFS mounts in real services.
      5f684a49
  3. 12 Nov, 2021 4 commits
    • Kirill Smelkov's avatar
      tests: Don't leak WCFS log files · 54f6e741
      Kirill Smelkov authored
      By default every WCFS run creates several files in /tmp/wcfs.*.log.* and
      without explicit cleanup those files are left hanging on testnodes. Over
      last ~6 months we accumulated ~ 300K such files.
      
      Don't allow those files to be leaked by instructing WCFS to log to
      stderr during test run. This should be also useful to see details in the
      test output.
      54f6e741
    • Kirill Smelkov's avatar
      tests: Remove test NEO database after test run is over · 49251408
      Kirill Smelkov authored
      With NEO we were creating test database on /tmp but we were not deleting
      it in the end. As the result many /tmp/neo_XXXXXX non-empty directories
      were being leaked.
      
      -> Fix it by creating testdb directory outselves and removing it at the
      end, similarly to FileStorage and ZEO.
      
      Fixes: 7fc4ec66 (tests: Allow to test with ZEO & NEO ZODB storages)
      49251408
    • Kirill Smelkov's avatar
      nxdtest: Don't run test.go for multiple GOMAXPROCS · 45178531
      Kirill Smelkov authored
      We run tests with different GOMAXPROCS because some WCFS bugs are only
      likely to trigger when there is only 1 or 2 main OS thread(s) in WCFS.
      
      However test.go does not exercise filesystem functionality - it runs
      unit tests for ZBlk decoding, ΔBtail and similar. At the same time
      test.go:* currently occupies ~ 50% of whole time to run full testsuite
      with the main consumer being ΔBtail random testing.
      
      -> Run test.go only once. This should save ~ 1000s for each run and
      lower whole time to run wendelin.core testsuite on testnode from
      ~60m -> to ~40 minutes.
      45178531
    • Kirill Smelkov's avatar
      wcfs: Make sure to remove mountpoint directory on Server.stop · d2fd8b77
      Kirill Smelkov authored
      Else every time test.py/wcfs is run several empty directories are left
      in /dev/shm/wcfs - each corresponding to WCFS server that was
      automatically spawned and stopped at the end of the test. Over time this
      can accumulate to some big number as e.g. ~20000 of such directories
      were left on the testnode during last 6 months.
      d2fd8b77
  4. 09 Nov, 2021 2 commits
    • Kirill Smelkov's avatar
      nxdtest: Run WCFS-related tests in verbose mode on testnodes · 5c13cc82
      Kirill Smelkov authored
      This are the early days of WCFS - we want full details which in default
      configuration might not be available to see if WCFS gets stuck for one
      reason or another. See added comments for details.
      5c13cc82
    • Kirill Smelkov's avatar
      setup: Fix egg_info after addition of δbtail.go · d07824dc
      Kirill Smelkov authored
      `python setup.py egg_info` stopped working after we added non-ASCII
      files, e.g. δbtail.go in 2ab4be93 (wcfs: xbtree: ΔBtail) and δftail.go
      in f980471f (wcfs: zdata: ΔFtail):
      
          (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ python setup.py egg_info
          running egg_info
          writing requirements to wendelin.core.egg-info/requires.txt
          writing wendelin.core.egg-info/PKG-INFO
          writing top-level names to wendelin.core.egg-info/top_level.txt
          writing dependency_links to wendelin.core.egg-info/dependency_links.txt
          writing entry points to wendelin.core.egg-info/entry_points.txt
          package init file '__init__.py' not found (or not a regular file)
          /usr/lib/python2.7/distutils/filelist.py:64: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
            sortable_files.sort()
          Traceback (most recent call last):
            File "setup.py", line 416, in <module>
              """.splitlines()]
            File "/home/kirr/src/tools/go/pygolang/golang/pyx/build.py", line 118, in setup
              setuptools_dso.setup(**kw)
            File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/setuptools_dso/__init__.py", line 37, in setup
              _setup(**kws)
            File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/setuptools/__init__.py", line 162, in setup
              return distutils.core.setup(**attrs)
            File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
              dist.run_commands()
            File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
              self.run_command(cmd)
            File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
              cmd_obj.run()
            File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 296, in run
              self.find_sources()
            File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 303, in find_sources
              mm.run()
            File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/setuptools/command/egg_info.py", line 538, in run
              self.filelist.sort()
            File "/usr/lib/python2.7/distutils/filelist.py", line 64, in sort
              sortable_files.sort()
          UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
      
      This happens becuase by default setuptools collects filenames as str, not
      unicode, and our git_lsfiles - also registered into setuptools.file_finders
      entrypoint - collects filenames as unicode. Previously everything was working
      because there was no on-ASCII filenames, and so unicode vs str coercion worked
      automatically. But now, after there is filename like 'δbtail.go', it stopped to
      work and raises UnicodeDecodeError.
      
      -> Fix it by adjusting git_lsfiles to collect filenames as UTF-8 encoded
      strings instead of unicode.
      d07824dc
  5. 08 Nov, 2021 8 commits
    • Kirill Smelkov's avatar
      Merge branch 'master' into t2 · eeb7a544
      Kirill Smelkov authored
      * master: (40 commits)
        fixup! wcfs: Handle ZODB invalidations
        wcfs/internal/mm: Complete the package
        fixup! wcfs: client: Provide client package to care about isolation protocol details
        lib/zodb: zconn_at: Fix how ZODB4 is asserted to be patched
        lib/zodb: zstor_2zurl: Explicitly reject MappingStorage
        bigfile/zodb: Teach ZBigFile backend to use WCFS
        wcfs: client: Provide virtmem integration
        wcfs: client: Add wczsync package to maintain WCFS connection in sync to ZODB connection
        lib/zodb: Teach zconn_at to work on ZODB4
        lib/zodb: Add ZODB.Connection.onShutdownCallback
        lib/zodb: Teach Connection.onResyncCallback to work on ZODB4
        bigfile/py: Allow PyBigFile backend to expose "mmap overlay" functionality
        bigfile/virtmem: Introduce "mmap overlay" mode
        wcfs: client: Provide client package to care about isolation protocol details
        wcfs: Provide isolation to clients
        wcfs: Handle ZODB invalidations
        wcfs: Add FileSock FUSE utility
        wcfs: zdata: ΔFtail
        wcfs: xbtree: ΔBtail
        wcfs: xbtree: BTree-diff algorithm
        ...
      eeb7a544
    • Kirill Smelkov's avatar
      . · c0199fcd
      Kirill Smelkov authored
      c0199fcd
    • Kirill Smelkov's avatar
      fixup! wcfs: Handle ZODB invalidations · 083251b3
      Kirill Smelkov authored
      Fix last-minute error that crept in during
      kirr/wendelin.core@4af54da9 :
      
          (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs$ go test
          # lab.nexedi.com/nexedi/wendelin.core/wcfs
          ./wcfs.go:957:4: Errorf format %s has arg sk of wrong type *lab.nexedi.com/nexedi/wendelin.core/wcfs.FileSock
      
      Amends 4430de41.
      083251b3
    • Kirill Smelkov's avatar
      . · 739ebf54
      Kirill Smelkov authored
      739ebf54
    • Kirill Smelkov's avatar
      . · b7010b29
      Kirill Smelkov authored
      b7010b29
    • Kirill Smelkov's avatar
      wcfs/internal/mm: Complete the package · 482b1a10
      Kirill Smelkov authored
      Add two functions, that were developed during wendelin.core 2 α, to the
      package for completeness:
      
      - map_zero_into_ro complements map_zero_ro, but mmaps into user-provided buffer.
      - sync calls msync on the provided memory.
      482b1a10
    • Kirill Smelkov's avatar
      fixup! wcfs: client: Provide client package to care about isolation protocol details · a49d737e
      Kirill Smelkov authored
      Remove outdated TODO because test_wcfs_watch_before_create passes this
      days. It was fixed after ΔFtail was taught about epochs and the fix was
      reflected in kirr/wendelin.core@63ae8326.
      
      Amends 10f7153a.
      a49d737e
    • Kirill Smelkov's avatar
      lib/zodb: zconn_at: Fix how ZODB4 is asserted to be patched · fc0445c8
      Kirill Smelkov authored
      Fix how unpatched ZODB4 is reported to lack required patch:
      
      Before:
      
          Traceback (most recent call last):
            File "/home/kirr/src/wendelin/wendelin.core/lib/tests/test_zodb.py", line 251, in test_zconn_at
              assert zconn_at(conn1) == at0
            File "/home/kirr/src/wendelin/wendelin.core/lib/zodb.py", line 162, in zconn_at
              assert 'conn:MVCC-via-loadBefore-only' in ZODB.nxd_patches, \
          AttributeError: 'module' object has no attribute 'nxd_patches'
      
      After:
      
          Traceback (most recent call last):
            File "/home/kirr/src/wendelin/wendelin.core/lib/tests/test_zodb.py", line 251, in test_zconn_at
              assert zconn_at(conn1) == at0
            File "/home/kirr/src/wendelin/wendelin.core/lib/zodb.py", line 163, in zconn_at
              "nexedi/ZODB!1")
            File "/home/kirr/src/wendelin/wendelin.core/lib/zodb.py", line 191, in _zassertHasNXDPatch
              (zmajor, patch, details_link))
          AssertionError: ZODB4 is not patched with required Nexedi patch 'conn:MVCC-via-loadBefore-only'
                  See nexedi/ZODB!1 for details
      
      Fixes 1f866c00 (lib/zodb: Teach zconn_at to work on ZODB4).
      fc0445c8
  6. 28 Oct, 2021 21 commits
    • Kirill Smelkov's avatar
      lib/zodb: zstor_2zurl: Explicitly reject MappingStorage · fe9c46c9
      Kirill Smelkov authored
      It is not possible for WCFS to access data of in-RAM storage of another
      process. But without explicit explanation the error message is confusing
      - it was something like:
      
          NotImplementedError: don't know how to extract zurl from <ZODB.MappingStorage.MappingStorage object at 0x7f28f04cea10>
      
      which suggests it was just not implemented.
      fe9c46c9
    • Kirill Smelkov's avatar
      bigfile/zodb: Teach ZBigFile backend to use WCFS · c5e18c74
      Kirill Smelkov authored
      By using WCFS as mmap-overlay for base data(*). WCFS-mode is still opt-in
      with default remaining to use old full user-space virtual memory manager
      mode as initially introduced in 2015.
      
      Wendelin.core should be draftly usable in WCFS mode now.
      
      This patch is organized as follows:
      
      - file_zodb.cpp provides mmap-overlay operations for WCFS implemented via
        WCFS client library.
      - file_zodb.py is adjusted accordingly to use WCFS if requested.
        Low-level things specific to gluing to file_zodb.cpp are moved to _file_zodb.pyx.
      - the rest of the changes are drive-by by main ones.
      
      (*) see the following patches for what is mmap-overlay:
      
      - fae045cc  (bigfile/virtmem: Introduce "mmap overlay" mode)
      - 23362204  (bigfile/py: Allow PyBigFile backend to expose "mmap overlay" functionality)
      
      Some preliminary history:
      
      kirr/wendelin.core@01916f09    X Draft demo that reading data through wcfs works
      kirr/wendelin.core@fd58082a    X Fix build on old GCC
      kirr/wendelin.core@f622e751    X tests: Stop wcfs spawned during tests
      kirr/wendelin.core@f118617b    X tests: Don't try to stop wcfs that is already exited
      c5e18c74
    • Kirill Smelkov's avatar
      wcfs: client: Provide virtmem integration · 986cf86e
      Kirill Smelkov authored
      Provide integration with virtmem, so that WCFS Mapping can be associated
      and managed under virtmem VMA. In other words provide support so that WCFS can
      be used as ZBigFile backend in "mmap overlay" mode (see fae045cc "bigfile/virtmem:
      Introduce "mmap overlay" mode" for description of mmap-overlay mode).
      
      We'll need this functionality for ZBigFile + WCFS client integration.
      
      Virtmem integration will be tested via running whole wendelin.core functional
      testsuite in wcfs-mode after the next patch.
      
      Quoting added description:
      
      ---- 8< ----
      
      Integration with wendelin.core virtmem layer
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      This client package can be used standalone, but additionally provides
      integration with wendelin.core userspace virtual memory manager: when a
      Mapping is created, it can be associated as serving base layer for a
      particular virtmem VMA via FileH.mmap(vma=...). In that case, since virtmem
      itself adds another layer of dirty pages over read-only base provided by
      Mapping(+)
      
                       ┌──┐                      ┌──┐
                       │RW│                      │RW│    ← virtmem VMA dirty pages
                       └──┘                      └──┘
                                 +
                                                         VMA base = X@at view provided by Mapping:
      
                                                ___        /@revA/bigfile/X
              __                                           /@revB/bigfile/X
                     _                                     /@revC/bigfile/X
                                 +                         ...
           ───  ───── ──────────────────────────   ─────   /head/bigfile/X
      
      the Mapping will interact with virtmem layer to coordinate
      updates to mapping virtual memory.
      
      How it works
      ~~~~~~~~~~~~
      
      Wcfs client integrates with virtmem layer to support virtmem handle
      dirtying pages of read-only base-layer that wcfs client provides via
      isolated Mapping. For wcfs-backed bigfiles every virtmem VMA is interlinked
      with Mapping:
      
            VMA     -> BigFileH -> ZBigFile -----> Z
             ↑↓                                    O
           Mapping  -> FileH    -> wcfs server --> DB
      
      When a page is write-accessed, virtmem mmaps in a page of RAM in place of
      accessed virtual memory, copies base-layer content provided by Mapping into
      there, and marks that page as read-write.
      
      Upon receiving pin message, the pinner consults virtmem, whether
      corresponding page was already dirtied in virtmem's BigFileH (call to
      __fileh_page_isdirty), and if it was, the pinner does not remmap Mapping
      part to wcfs/@revX/f and just leaves dirty page in its place, remembering
      pin information in fileh._pinned.
      
      Once dirty pages are no longer needed (either after discard/abort or
      writeout/commit), virtmem asks wcfs client to remmap corresponding regions
      of Mapping in its place again via calls to Mapping.remmap_blk for previously
      dirtied blocks.
      
      The scheme outlined above does not need to split Mapping upon dirtying an
      inner page.
      
      See bigfile_ops interface (wendelin/bigfile/file.h) that explains base-layer
      and overlaying from virtmem point of view. For wcfs this interface is
      provided by small wcfs client wrapper in bigfile/file_zodb.cpp.
      
      (+) see bigfile_ops interface (wendelin/bigfile/file.h) that gives virtmem
          point of view on layering.
      
      ----------------------------------------
      
      Some preliminary history:
      
      kirr/wendelin.core@f330bd2f    X wcfs/client: Overview += interaction with virtmem layer
      986cf86e
    • Kirill Smelkov's avatar
      wcfs: client: Add wczsync package to maintain WCFS connection in sync to ZODB connection · e11edc70
      Kirill Smelkov authored
      For ZBigFile + WCFS client integration we'll need to open WCFS
      connections that observer database at the same state as current ZODB
      connection. Later that WCFS connection needs to adjust its on-WCFS view
      in accordance to how ZODB connection adjusts its one.
      
      Wczsync provides a function to do so: pywconnOf(zconn) will open WCFS
      connection and maintain it in sync with ZODB connection zconn.
      
      Some preliminary history:
      
      kirr/wendelin.core@8bf8f23b    X bigfile/_file_zodb: Fix logic around ZSync usage
      kirr/wendelin.core@571cb737    fixup! X bigfile/_file_zodb: Fix logic around ZSync usage
      kirr/wendelin.core@a9a82d5a    X bigfile/_file_zodb: Fix ZSync to close not only wconn, but also wconn.wc through which wconn was created
      kirr/wendelin.core@cf92937f    X wcfs: Move wconn<->zconn sync functionality into wcfs.client._wczsync
      kirr/wendelin.core@7203d7ab    X wcfs: Fix ZSync to close wconn on zdb.close, even if zconn stays alive
      e11edc70
    • Kirill Smelkov's avatar
      lib/zodb: Teach zconn_at to work on ZODB4 · 1f866c00
      Kirill Smelkov authored
      In 3bd82127 (lib/zodb: Add zconn_at draft (ZODB5 only)) we added
      zconn_at function to find out as of which state a ZODB connection is
      viewing the database. That was ZODB5-only however.
      
      Let's add support for ZODB4 now - by requiring ZODB4-wc2 - a version of
      ZODB4 with MVCC backported from ZODB5: nexedi/ZODB!1
      
      This makes wendelin.core to work on either ZODB5 or ZODB4-wc2, but not
      plain ZODB4. However as zconn_at will be used only for WCFS-integration,
      non-wcfs mode will continue to work on all ZODB5, ZODB4-wc2 and plain
      ZODB4.
      
      ZBigFile + WCFS client integration will use zconn_at to open WCFS
      connection that corresponds to ZODB connection.
      
      Preliminary history:
      
      kirr/wendelin.core@1c3b7750    X zconn_at for ZODB4
      1f866c00
    • Kirill Smelkov's avatar
      lib/zodb: Add ZODB.Connection.onShutdownCallback · 1dba3a9a
      Kirill Smelkov authored
      Add patch to ZODB.Connection to support callback on after database is
      closed. ZBigFile + WCFS client integration will use this callback to
      close WCFS connection when corresponding ZODB.DB is closed.
      
      Preliminary history:
      
      kirr/wendelin.core@a26d9659    X lib/zodb: Connection += onShutdownCallback
      1dba3a9a
    • Kirill Smelkov's avatar
      lib/zodb: Teach Connection.onResyncCallback to work on ZODB4 · ceadfcc7
      Kirill Smelkov authored
      In 959ae2d0 (lib/zodb: Add patch to ZODB.Connection to support callback
      on connection DB view change) we added patch for ZODB.Connection to
      support callback when database view of the connection changes. At that
      time the patch was working for ZODB5 and ZODB4 was TODO.
      Let's add support for ZODB4 (both ZODB4 and ZODB4-wc2) now.
      
      As a reminder: ZBigFile + WCFS client integration will use this callback
      to keep WCFS connection in sync with ZODB connection.
      
      Preliminary history:
      
      kirr/wendelin.core@533a4cfa     X onResyncCallback for ZODB4
      ceadfcc7
    • Kirill Smelkov's avatar
      bigfile/py: Allow PyBigFile backend to expose "mmap overlay" functionality · 23362204
      Kirill Smelkov authored
      This patch logically continues previous change `bigfile/virtmem:
      Introduce "mmap overlay" mode` and exposes mmap-overlay functionality to
      Python: if PyBigFile backend provides .blkmmapper PyCapsule the
      mmap-related methods will be extracted from it and passed on through to
      virtmem - see _bigfile.h for details.
      
      ZBigFile will use this to hook into using WCFS.
      23362204
    • Kirill Smelkov's avatar
      bigfile/virtmem: Introduce "mmap overlay" mode · fae045cc
      Kirill Smelkov authored
      with the intention to later use WCFS through it.
      
      Before this patch virtmem had only one mode: a BigFile backend was
      providing loadblk and storeblk methods, and on every block access
      loadblk was called to load block data into allocated RAM page.
      
      However with WCFS virtmem won't be needed to do anything to load data -
      because loading from head/bigfile/f mmaped through OS will be handled by
      OS directly. Thus for wcfs, that leaves virtmem only to handle dirtying
      and writeout.
      
      -> Introduce "mmap overlay" mode into virtmem to handle WCFS-like
      BigFile backends - that can provide read-only base layer suitable for
      mmapping.
      
      This patch is organized as follows:
      
      - fileh_open is added flags argument to indicate which mode to use for
        opened fileh. BigFileH is added .mmap_overlay bitfield correspondingly.
        (virtmem.h)
      
      - struct bigfile_ops is extended with 3 optional methods that a BigFile
        backend might provide to support mmap-overlay mode:
      
        * mmap_setup_read,
        * remmap_blk_read, and
        * munmap
      
        (see file.h changes for documentation of this new interface)
      
      - if opened with MMAP_OVERLAY flag, virtmem is using those methods to
        organize VMA views backed by read-only base mmap layer and writeout
        for such VMAs (virtmem.c)
      
      - a test is added to exercise MMAP_OVERLAY virtmem mode (test_virtmem.c)
      
      - everything else, including bigfile.py, is switched to use
        DONT_MMAP_OVERLAY unconditionally for now.
      
      In internal comments inside virtmem new mode is interchangeable called
      "mmap overlay" and "wcfs", even though wcfs is not hooked to be used
      mmap-overlaying yet.
      
      Some preliminary history:
      
      kirr/wendelin.core@fb6932a2    X Split PAGE_LOADED -> PAGE_LOADED, PAGE_LOADED_FOR_WRITE
      kirr/wendelin.core@4a20a573    X Settled on what should happen after writeout for wcfs case
      kirr/wendelin.core@f084ff9b    X Transition to all VMA under 1 fileh to be either all based on wcfs or all based on !wcfs
      fae045cc
    • Kirill Smelkov's avatar
      wcfs: client: Provide client package to care about isolation protocol details · 10f7153a
      Kirill Smelkov authored
      This patch follows-up on previous patch, that added server-side part of
      isolation protocol handling, and adds client package that takes care about
      WCFS isolation protocol details and provides to clients simple interface to
      isolated view of bigfile data on WCFS similar to regular files: given a
      particular revision of database @at, it provides synthetic read-only bigfile
      memory mappings with data corresponding to @at state, but using /head/bigfile/*
      most of the time to build and maintain the mappings.
      
      The patch is organized as follows:
      
      - wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the
        main part of the implementation.
      
      - wcfs/client/client_test.py is tests.
      
      - The rest of the changes in wcfs/client/ are to support the implementation and tests.
      
      Quoting package documentation for the reference:
      
      ---- 8< ----
      
      Package wcfs provides WCFS client.
      
      This client package takes care about WCFS isolation protocol details and
      provides to clients simple interface to isolated view of bigfile data on
      WCFS similar to regular files: given a particular revision of database @at,
      it provides synthetic read-only bigfile memory mappings with data
      corresponding to @at state, but using /head/bigfile/* most of the time to
      build and maintain the mappings.
      
      For its data a mapping to bigfile X mostly reuses kernel cache for
      /head/bigfile/X with amount of data not associated with kernel cache for
      /head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
      case where many client workers simultaneously serve requests, their database
      views are a bit outdated, but close to head, which means that in practice
      the kernel cache for /head/bigfile/* is being used almost 100% of the time.
      
      A mapping for bigfile X@at is built from OS-level memory mappings of
      on-WCFS files as follows:
      
                                                ___        /@revA/bigfile/X
              __                                           /@revB/bigfile/X
                     _                                     /@revC/bigfile/X
                                 +                         ...
           ───  ───── ──────────────────────────   ─────   /head/bigfile/X
      
      where @revR mmaps are being dynamically added/removed by this client package
      to maintain X@at data view according to WCFS isolation protocol(*).
      
      API overview
      
       - `WCFS` represents filesystem-level connection to wcfs server.
       - `Conn` represents logical connection that provides view of data on wcfs
         filesystem as of particular database state.
       - `FileH` represent isolated file view under Conn.
       - `Mapping` represents one memory mapping of FileH.
      
      A path from WCFS to Mapping is as follows:
      
       WCFS.connect(at)                    -> Conn
       Conn.open(foid)                     -> FileH
       FileH.mmap([blk_start +blk_len))    -> Mapping
      
      A connection can be resynced to another database view via Conn.resync(at').
      
      Documentation for classes provides more thorough overview and API details.
      
      --------
      
      (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
      
      .
      
      Wcfs client organization
      ~~~~~~~~~~~~~~~~~~~~~~~~
      
      Wcfs client provides to its users isolated bigfile views backed by data on
      WCFS filesystem. In the absence of Isolation property, wcfs client would
      reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On
      the other hand there is a simple, but inefficient, way to support isolation:
      for @at database view of bigfile f - directly use OS-level file wcfs/@at/f.
      The latter works, but is very inefficient because OS-cache for f data is not
      shared in between two connections with @at1 and @at2 views. The cache is
      also lost when connection view of the database is resynced on transaction
      boundary. To support isolation efficiently, wcfs client uses wcfs/head/f
      most of the time, but injects wcfs/@revX/f parts into mappings to maintain
      f@at view driven by pin messages that wcfs server sends to client in
      accordance to WCFS isolation protocol(*).
      
      Wcfs server sends pin messages synchronously triggered by access to mmaped
      memory. That means that a client thread, that is accessing wcfs/head/f mmap,
      is completely blocked while wcfs server sends pins and waits to receive acks
      from all clients. In other words on-client handling of pins has to be done
      in separate thread, because wcfs server can also send pins to client that
      triggered the access.
      
      Wcfs client implements pins handling in so-called "pinner" thread(+). The
      pinner thread receives pin requests from wcfs server via watchlink handle
      opened through wcfs/head/watch. For every pin request the pinner finds
      corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk
      appropriately.
      
      The same watchlink handle is used to send client-originated requests to wcfs
      server. The requests are sent to tell wcfs that client wants to observe a
      particular bigfile as of particular revision, or to stop watching it.
      Such requests originate from regular client threads - not pinner - via entry
      points like Conn.open, Conn.resync and FileH.close.
      
      Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
      dict is updated by pinner driven by pin messages, and is used when
      new fileh Mapping is created (FileH.mmap).
      
      In wendelin.core a bigfile has semantic that it is infinite in size and
      reads as all zeros beyond region initialized with data. Memory-mapping of
      OS-level files can also go beyond file size, however accessing memory
      corresponding to file region after file.size triggers SIGBUS. To preserve
      wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after
      wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and
      never shrink. It is indeed currently so, but will have to be revisited
      if/when wendelin.core adds bigfile truncation. Wcfs client restats
      wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
      in FileH._headfsize for use during one transaction(%).
      
      --------
      
      (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
      (+) currently, for simplicity, there is one pinner thread for each connection.
          In the future, for efficiency, it might be reworked to be one pinner thread
          that serves all connections simultaneously.
      (%) see _headWait comments on how this has to be reworked.
      
      Wcfs client locking organization
      
      Wcfs client needs to synchronize regular user threads vs each other and vs
      pinner. A major lock Conn.atMu protects updates to changes to Conn's view of
      the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync),
      and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking
      Conn.resync is not running).
      
      Similarly to wcfs.go(*) several locks that protect internal data structures
      are minor to Conn.atMu - they need to be taken only under atMu.R (to
      synchronize e.g. multiple fileh open running simultaneously), but do not
      need to be taken at all if atMu.W is taken. In data structures such locks
      are noted as follows
      
           sync::Mutex xMu;    // atMu.W  |  atMu.R + xMu
      
      After atMu, Conn.filehMu protects registry of opened file handles
      (Conn._filehTab), and FileH.mmapMu protects registry of created Mappings
      (FileH.mmaps) and FileH.pinned.
      
      Several locks are RWMutex instead of just Mutex not only to allow more
      concurrency, but, in the first place for correctness: pinner thread being
      core element in handling WCFS isolation protocol, is effectively invoked
      synchronously from other threads via messages coming through wcfs server.
      For example Conn.resync sends watch request to wcfs server and waits for the
      answer. Wcfs server, in turn, might send corresponding pin messages to the
      pinner and _wait_ for the answer before answering to resync:
      
             - - - - - -
            |       .···|·····.        ---->   = request
               pinner <------.↓        <····   = response
            |           |   wcfs
               resync -------^↓
            |      `····|·····
             - - - - - -
            client process
      
      This creates the necessity to use RWMutex for locks that pinner and other
      parts of the code could be using at the same time in synchronous scenarios
      similar to the above. This locks are:
      
           - Conn.atMu
           - Conn.filehMu
      
      Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
      client calls into wcfs server via watchlink with mmapMu held.
      
      The ordering of locks is:
      
           Conn.atMu > Conn.filehMu > FileH.mmapMu
      
      The pinner takes the following locks:
      
           - wconn.atMu.R
           - wconn.filehMu.R
           - fileh.mmapMu (to read .mmaps  +  write .pinned)
      
      (*) see "Wcfs locking organization" in wcfs.go
      
      Handling of fork
      
      When a process calls fork, OS copies its memory and creates child process
      with only 1 thread. That child inherits file descriptors and memory mappings
      from parent. To correctly continue using Conn, FileH and Mappings, the child
      must recreate pinner thread and reconnect to wcfs via reopened watchlink.
      The reason here is that without reconnection - by using watchlink file
      descriptor inherited from parent - the child would interfere into
      parent-wcfs exchange and neither parent nor child could continue normal
      protocol communication with WCFS.
      
      For simplicity, since fork is seldomly used for things besides followup
      exec, wcfs client currently takes straightforward approach by disabling
      mappings and detaching from WCFS server in the child right after fork. This
      ensures that there is no interference into parent-wcfs exchange should child
      decide not to exec and to continue running in the forked thread. Without
      this protection the interference might come even automatically via e.g.
      Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.
      
      ----------------------------------------
      
      Some preliminary history:
      
      kirr/wendelin.core@a8fa9178    X wcfs: move client tests into client/
      kirr/wendelin.core@990afac1    X wcfs/client: Package overview (draft)
      kirr/wendelin.core@3f83469c    X wcfs: client: Handle fork
      kirr/wendelin.core@0ed6b8b6    fixup! X wcfs: client: Handle fork
      kirr/wendelin.core@24378c46    X wcfs: client: Provide Conn.at()
      10f7153a
    • Kirill Smelkov's avatar
      wcfs: Provide isolation to clients · 6f0cdaff
      Kirill Smelkov authored
      Via custom isolation protocol that both server and clients must cooperatively
      follow. This is the core change that enables file cache to be practically
      shared while each client can still be provided with isolated view of the database.
      
      This patch brings only server changes, tests + the minimum client bits to support the tests.
      The client library, that will implement isolation protocol on client side, will come next.
      
      This patch is organized as follows:
      
      - wcfs.go brings in description of the protocol, overview of how server
        implements that protocol and the implementation itself.
        See also notes.txt
      
      - wcfs_test.py brings in tests for server implementation.
        tWCFS._abort_ontimeout had to be moved into nogil mode into wcfs_test.pyx
        to avoid deadlock on the GIL (see comments in wcfs_test.pyx for details).
      
      - files added in wcfs/client/ are needed to provide client-side
        implementation of WatchLink - the message exchange protocol over
        opened head/watch file - for tests. Client-side watchlink implementation
        lives in wcfs/client/wcfs_watchlink.{h,cpp}. The other additions in
        wcfs/client/ are to support that and to expose the WatchLink to Python.
      
        Client-side bits are done right in C++ because upcoming WCFS client
        library will be implemented in C++ to work in nogil mode in order to
        avoid deadlock on the GIL because client-side pinner thread might be
        woken-up synchronously by WCFS server at any moment, including when
        another client thread already holds the GIL and is paused by WCFS.
      
      Some preliminary history:
      
      9b4a42a3    X invalidation design draftly settled
      27d91d47    X δFtail settled
      c27c1940    X mmap over under pagefault to this mmapping works
      d36b171f    X ptrace when client is under pagefault or syscall won't work
      c1f5bb19    X notes on why lazy-invalidate approach was taken
      4fbdd270    X Proof that that it is possible to change mmapping while under pagefault to it
      33e0dfce    X ΔTail draftly done
      12628943    X make sure "bye" is always processed immediately - even if a handleWatch is currently blocked
      af0a64cb    X test for "bye" canceling blocked handlers
      996dc6a8    X Fix race in test
      43915fe9    X wcfs: Don't forbid simultaneous watch requests
      941dc54b    X wcfs: threading.Lock -> sync.Mutex
      d75b2304    X wcfs: Move _abort_ontimeout to pyx/nogil
      79234659    X Notes on why eagier invalidation was rejected
      f05271b1    X Test that sysread(/head/watch) can be interrupted
      5ba816da    X restore test_wcfs_watch_robust after f05271b1.
      4bd88564    X "Invalidation protocol" -> "Isolation protocol"
      f7b54ca4    X avoid fmt::vsprintf  (now compils again with latest pygolang@master)
      0a8fcd9d    X wcfs/client: Move EOF -> pygolang
      153e02e6    X test_wcfs_watch_setup and test_wcfs_watch_setup_ahead work again
      17f98edc    X wcfs: client: os: Factor syserr -> string into _sysErrString
      7b0c301c    X wcfs: tests: Fix tFile.assertBlk not to segfault on a test failure
      b74dda09    X Start switching Track from Track(key) to Track(keycov)
      8b5d8523    X Move tracking of which blocks were accessed from wcfs to ΔFtail
      6f0cdaff
    • Kirill Smelkov's avatar
      wcfs: Handle ZODB invalidations · 4430de41
      Kirill Smelkov authored
      Use ΔFtail.Track on every READ, and query accumulated ΔFtail upon
      receiving ZODB invalidation to query it about which blocks of which
      files have been changed. Then invalidate those blocks in OS file cache.
      
      See added documentation to wcfs.go and notes.txt for details.
      
      Now the filesystem is no longer stale: it provides view of data
      that is uptodate wrt changes on ZODB storage.
      
      Some preliminary history:
      
      kirr/wendelin.core@9b4a42a3    X invalidation design draftly settled
      kirr/wendelin.core@27d91d47    X δFtail settled
      kirr/wendelin.core@33e0dfce    X ΔTail draftly done
      kirr/wendelin.core@822366a7    X keeping fd to root opened prevents the filesystem from being unmounted
      kirr/wendelin.core@89ad3a79    X Don't keep ZBigFile activated during whole current transaction
      kirr/wendelin.core@245511ac    X Give pointer on from where to get nxd-fuse.ko
      kirr/wendelin.core@d1cd128c    X Hit FUSE-related deadlock
      kirr/wendelin.core@d134ee44    X FUSE lookup deadlock should be hopefully fixed
      kirr/wendelin.core@0e60e9ff    X wcfs: Don't noise ZWatcher trace logs with "select ..."
      kirr/wendelin.core@bf9a7405    X No longer rely on ZODB cache invariant for invalidations
      4430de41
    • Kirill Smelkov's avatar
      wcfs: Add FileSock FUSE utility · 46f3f3fd
      Kirill Smelkov authored
      FileSock is bidirectional channel associated with opened file.
      
      FileSock provides streaming write/read operations for filesystem server that
      are correspondingly matched with read/write operations on filesystem user side.
      
      WCFS will use FileSock to implement exchange over .wcfs/zhead and,
      later, head/watch files.
      
      Some preliminary history:
      
      kirr/wendelin.core@b17aeb8c    X Change FileSock to use xio.Pipe which is io.Pipe + support for IO cancellation
      46f3f3fd
    • Kirill Smelkov's avatar
      wcfs: zdata: ΔFtail · f980471f
      Kirill Smelkov authored
      ΔFtail builds on ΔBtail and  provides ZBigFile-level history that WCFS
      will use to compute which blocks of a ZBigFile need to be invalidated in
      OS file cache given raw ZODB changes on ZODB invalidation message.
      
      It also will be used by WCFS to implement isolation protocol, where on
      every FUSE READ request WCFS will query ΔFtail to find out revision of
      corresponding file block.
      
      Quoting ΔFtail documentation:
      
      ---- 8< ----
      
      ΔFtail provides ZBigFile-level history tail.
      
      It translates ZODB object-level changes to information about which blocks of
      which ZBigFile were modified, and provides service to query that information.
      
      ΔFtail class documentation
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      ΔFtail represents tail of revisional changes to files.
      
      It semantically consists of
      
          []δF			; rev ∈ (tail, head]
      
      where δF represents a change in files space
      
          δF:
          	.rev↑
          	{} file ->  {}blk | EPOCH
      
      Only files and blocks explicitly requested to be tracked are guaranteed to
      be present. In particular a block that was not explicitly requested to be
      tracked, even if it was changed in δZ, is not guaranteed to be present in δF.
      
      After file epoch (file creation, deletion, or any other change to file
      object) previous track requests for that file become forgotten and have no
      further effect.
      
      ΔFtail provides the following operations:
      
        .Track(file, blk, path, zblk)	- add file and block reached via BTree path to tracked set.
      
        .Update(δZ) -> δF				- update files δ tail given raw ZODB changes
        .ForgetPast(revCut)			- forget changes ≤ revCut
        .SliceByRev(lo, hi) -> []δF		- query for all files changes with rev ∈ (lo, hi]
        .SliceByFileRev(file, lo, hi) -> []δfile	- query for changes of a file with rev ∈ (lo, hi]
        .BlkRevAt(file, #blk, at) -> blkrev	- query for what is last revision that changed
          					  file[#blk] as of @at database state.
      
      where δfile represents a change to one file
      
          δfile:
          	.rev↑
          	{}blk | EPOCH
      
      See also zodb.ΔTail and xbtree.ΔBtail
      
      Concurrency
      
      ΔFtail is safe to use in single-writer / multiple-readers mode. That is at
      any time there should be either only sole writer, or, potentially several
      simultaneous readers. The table below classifies operations:
      
          Writers:  Update, ForgetPast
          Readers:  Track + all queries (SliceByRev, SliceByFileRev, BlkRevAt)
      
      Note that, in particular, it is correct to run multiple Track and queries
      requests simultaneously.
      
      ΔFtail organization
      ~~~~~~~~~~~~~~~~~~~
      
      ΔFtail leverages:
      
          - ΔBtail to track changes to ZBigFile.blktab BTree, and
          - ΔZtail to track changes to ZBlk objects and to ZBigFile object itself.
      
      then every query merges ΔBtail and ΔZtail data on the fly to provide
      ZBigFile-level result.
      
      Merging on the fly, contrary to computing and maintaining vδF data, is done
      to avoid complexity of recomputing vδF when tracking set changes. Most of
      ΔFtail complexity is, thus, located in ΔBtail, which implements BTree diff
      and handles complexity of recomputing vδB when set of tracked blocks
      changes after new track requests.
      
      Changes to ZBigFile object indicate epochs. Epochs could be:
      
          - file creation or deletion,
          - change of ZBigFile.blksize,
          - change of ZBigFile.blktab to point to another BTree.
      
      Epochs represent major changes to file history where file is assumed to
      change so dramatically, that practically it can be considered to be a
      "whole" change. In particular, WCFS, upon seeing a ZBigFile epoch,
      invalidates all data in corresponding OS-level cache for the file.
      
      The only historical data, that ΔFtail maintains by itself, is history of
      epochs. That history does not need to be recomputed when more blocks become
      tracked and is thus easy to maintain. It also can be maintained only in
      ΔFtail because ΔBtail and ΔZtail does not "know" anything about ZBigFile.
      
      Concurrency
      
      In order to allow multiple Track and queries requests to be served in
      parallel, ΔFtail bases its concurrency promise on ΔBtail guarantees +
      snapshot-style access for vδE and ztrackInBlk in queries:
      
      1. Track calls ΔBtail.Track and quickly updates .byFile, .byRoot and
         _RootTrack indices under a lock.
      
      2. BlkRevAt queries ΔBtail.GetAt and then combines retrieved information
         about zblk with vδE and δZ.
      
      3. SliceByFileRev queries ΔBtail.SliceByRootRev and then merges retrieved
         vδT data with vδZ, vδE and ztrackInBlk.
      
      4. In queries vδE is retrieved/built in snapshot style similarly to how vδT
         is built in ΔBtail. Note that vδE needs to be built only the first time,
         and does not need to be further rebuilt, so the logic in ΔFtail is simpler
         compared to ΔBtail.
      
      5. for ztrackInBlk - that is used by SliceByFileRev query - an atomic
         snapshot is retrieved for objects of interest. This allows to hold
         δFtail.mu lock for relatively brief time without blocking other parallel
         Track/queries requests for long.
      
      Combined this organization allows non-overlapping queries/track-requests
      to run simultaneously. (This property is essential to WCFS because otherwise
      WCFS would not be able to serve several non-overlapping READ requests to one
      file in parallel.)
      
      See also "Concurrency" in ΔBtail organization for more details.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Some preliminary history:
      
      kirr/wendelin.core@ef74aebc    X ΔFtail: Keep reference to ZBigFile via Oid, not via *ZBigFile
      kirr/wendelin.core@bf9a7405    X No longer rely on ZODB cache invariant for invalidations
      kirr/wendelin.core@46340069    X found by Random
      kirr/wendelin.core@e7b598c6    X start of ΔFtail.SliceByFileRev rework to function via merging δB and δZ histories on the fly
      kirr/wendelin.core@59c83009    X ΔFtail.SliceByFileRoot tests started to work draftly after "on-the-fly" rework
      kirr/wendelin.core@210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
      kirr/wendelin.core@bf3ace66    X ΔFtail: Rebuild vδE after first track
      kirr/wendelin.core@46624787    X ΔFtail: `go test -failfast -short -v -run Random -randseed=1626793016249041295` discovered problems
      kirr/wendelin.core@786dd336    X Size no longer tracks [0,∞) since we start tracking when zfile is non-empty
      kirr/wendelin.core@4f707117    X test that shows problem of SliceByRootRev where untracked blocks are not added uniformly into whole history
      kirr/wendelin.core@c0b7e4c3    X ΔFtail.SliceByFileRev: Fix untracked entries to be present uniformly in result
      kirr/wendelin.core@aac37c11    X zdata: Introduce T to start removing duplication in tests
      kirr/wendelin.core@bf411aa9    X zdata: Deduplicate zfile loading
      kirr/wendelin.core@b74dda09    X Start switching Track from Track(key) to Track(keycov)
      kirr/wendelin.core@aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
      kirr/wendelin.core@588a512a    X zdata: Switch SliceByFileRev not to clone Zinblk
      kirr/wendelin.core@8b5d8523    X Move tracking of which blocks were accessed from wcfs to ΔFtail
      kirr/wendelin.core@30f5ddc7    ΔFtail += .Epoch in δf
      kirr/wendelin.core@22f5f096    X Rework ΔFtail so that BlkRevAt works with ZBigFile checkout from any at ∈ (tail, head]
      kirr/wendelin.core@0853cc9f    X ΔFtail + tests
      kirr/wendelin.core@124688f9    X ΔFtail fixes
      kirr/wendelin.core@d85bb82c    ΔFtail concurrency
      f980471f
    • Kirill Smelkov's avatar
      wcfs: xbtree: ΔBtail · 2ab4be93
      Kirill Smelkov authored
      ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will
      use to compute which blocks of a ZBigFile need to be invalidated in OS
      file cache given raw ZODB changes on ZODB invalidation message.
      
      It also will be used by WCFS to implement isolation protocol, where on
      every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to
      find out revision of corresponding file block.
      
      Quoting ΔBtail documentation:
      
      ---- 8< ----
      
      ΔBtail provides BTree-level history tail.
      
      It translates ZODB object-level changes to information about which keys of
      which BTree were modified, and provides service to query that information.
      
      ΔBtail class documentation
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      ΔBtail represents tail of revisional changes to BTrees.
      
      It semantically consists of
      
          []δB			; rev ∈ (tail, head]
      
      where δB represents a change in BTrees space
      
          δB:
          	.rev↑
          	{} root -> {}(key, δvalue)
      
      It covers only changes to keys from tracked subset of BTrees parts.
      In particular a key that was not explicitly requested to be tracked, even if
      it was changed in δZ, is not guaranteed to be present in δB.
      
      ΔBtail provides the following operations:
      
        .Track(path)	- start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi]
      
        .Update(δZ) -> δB				- update BTree δ tail given raw ZODB changes
        .ForgetPast(revCut)			- forget changes ≤ revCut
        .SliceByRev(lo, hi) -> []δB		- query for all trees changes with rev ∈ (lo, hi]
        .SliceByRootRev(root, lo, hi) -> []δT	- query for changes of a tree with rev ∈ (lo, hi]
        .GetAt(root, key, at) -> (value, rev)	- get root[key] @at assuming root[key] ∈ tracked
      
      where δT represents a change to one tree
      
          δT:
          	.rev↑
          	{}(key, δvalue)
      
      An example for tracked set is a set of visited BTree paths.
      There is no requirement that tracked set belongs to only one single BTree.
      
      See also zodb.ΔTail and zdata.ΔFtail
      
      Concurrency
      
      ΔBtail is safe to use in single-writer / multiple-readers mode. That is at
      any time there should be either only sole writer, or, potentially several
      simultaneous readers. The table below classifies operations:
      
          Writers:  Update, ForgetPast
          Readers:  Track + all queries (SliceByRev, SliceByRootRev, GetAt)
      
      Note that, in particular, it is correct to run multiple Track and queries
      requests simultaneously.
      
      ΔBtail organization
      ~~~~~~~~~~~~~~~~~~~
      
      ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to
      turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is
      maintained with tree-level history in ΔTtail.vδT .
      
      Because it is very computationally expensive(+) to find out for an object to
      which BTree it belongs, ΔBtail cannot provide full BTree-level history given
      just ΔZtail with δZ changes. Due to this ΔBtail requires help from
      users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know
      that such and such ZODB objects constitute a path from root of a tree to some
      of its leaf. After Track call the objects from the path and tree keys, that
      are covered by leaf node, become tracked: from now-on ΔBtail will detect
      and provide BTree-level changes caused by any change of tracked tree objects
      or tracked keys. This guarantee can be provided because ΔBtail now knows
      that such and such objects belong to a particular tree.
      
      To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet.
      This data-structure represents so-called PP-connected set of tree nodes:
      simply speaking it builds on some leafs and then includes parent(leaf),
      parent(parent(leaf)), etc. In other words it's a "parent"-closure of the
      leafs. The property of being PP-connected means that starting from any node
      from such set, it is always possible to reach root node by traversing
      .parent links, and that every intermediate node went-through during
      traversal also belongs to the set.
      
      A new Track request potentially grows tracked keys coverage. Due to this,
      on a query, ΔBtail needs to recompute potentially whole vδT of the affected
      tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild"
      functions and uses the same treediff algorithm, that Update is using, but
      modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially
      needs to rebuild whole vδT history, not only append new δT, because a
      change to tracked tree nodes can result in growth of tracked key coverage.
      
      Queries are relatively straightforward code that work on vδT snapshot. The
      main complexity, besides BTree-diff algorithm, lies in recomputing vδT when
      set of tracked keys changes, and in handling that recomputation in such a way
      that multiple Track and queries requests could be all served in parallel.
      
      Concurrency
      
      In order to allow multiple Track and queries requests to be served in
      parallel ΔBtail employs special organization of vδT rebuild process where
      complexity of concurrency is reduced to math on merging updates to vδT and
      trackSet, and on key range lookup:
      
      1. vδT is managed under read-copy-update (RCU) discipline: before making
         any vδT change the mutator atomically clones whole vδT and applies its
         change to the clone. This way a query, once it retrieves vδT snapshot,
         does not need to further synchronize with vδT mutators, and can rely on
         that retrieved vδT snapshot will remain immutable.
      
      2. a Track request goes through 3 states: "new", "handle-in-progress" and
         "handled". At each state keys/nodes of the Track are maintained in:
      
         - ΔTtail.ktrackNew and .trackNew       for "new",
         - ΔTtail.krebuildJobs                  for "handle-in-progress", and
         - ΔBtail.trackSet                      for "handled".
      
         trackSet keeps nodes, and implicitly keys, from all handled Track
         requests. For all keys, covered by trackSet, vδT is fully computed.
      
         a new Track(keycov, path) is remembered in ktrackNew and trackNew to be
         further processed when a query should need keys from keycov. vδT is not
         yet providing data for keycov keys.
      
         when a Track request starts to be processed, its keys and nodes are moved
         from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data
         for requested-to-be-tracked keys.
      
         all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint:
      
          trackSet ^ trackNew     = ø
          trackSet ^ krebuildJobs = ø
          trackNew ^ krebuildJobs = ø
      
      3. when a query is served, it needs to retrieve vδT snapshot that takes
         related previous Track requests into account. Retrieving such snapshots
         is implemented in vδTSnapForTracked*() family of functions: there it
         checks ktrackNew/trackNew, and if those sets overlap with query's keys
         of interest, run vδT rebuild for keys queued in ktrackNew.
      
         the main part of that rebuild can be run without any locks, because it
         does not use nor modify any ΔBtail data, and for δ(vδT) it just computes
         a fresh full vδT build modulo retrieved ktrackNew. Only after that
         computation is complete, ΔBtail is locked again to quickly merge in
         δ(vδT) update back into vδT.
      
         This organization is based on the fact that
      
          vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂
      
           ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the
             same as merge of vδT computed for tracked set T₁ and vδT computed
            for tracked set T₂ )
      
         and that
      
          trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂)
      
          ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same
            as union of tracking set updated with δPP₁ and tracking set updated
            with δPP₂ )
      
         these merge properties allow to run computation for δ(vδT) and δ(trackSet)
         independently and with ΔBtail unlocked, which in turn enables running
         several Track/queries in parallel.
      
      4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov
         entry to indicate in-progress rebuild. Should a query need vδT for keys
         from that job, it first waits for corresponding job(s) to complete.
      
      Explained rebuild organization allows non-overlapping queries/track-requests
      to run simultaneously. (This property is essential to WCFS because otherwise
      WCFS would not be able to serve several non-overlapping READ requests to one
      file in parallel.)
      
      --------
      
      (*) implemented in treediff.go
      (+) full database scan
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Some preliminary history:
      
      kirr/wendelin.core@877e64a9    X wcfs: Fix tests to pass again
      kirr/wendelin.core@c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      kirr/wendelin.core@78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      kirr/wendelin.core@5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      kirr/wendelin.core@f65f775b    X wcfs/xbtree: treediff(ø, b)
      kirr/wendelin.core@c75b1c6f    X wcfs/xbtree: Start killing holeIdx
      kirr/wendelin.core@0fa06cbd    X kadj must be taken into account as kadj^δZ
      kirr/wendelin.core@ef5e5183    X treediff ret += δtkeycov
      kirr/wendelin.core@f30826a6    X another bug in δtkeyconv computation
      kirr/wendelin.core@0917380e    X wcfs: assert that keycov only grow
      kirr/wendelin.core@502e05c2    X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs
      kirr/wendelin.core@450ba707    X Fix rebuild with ø @at2
      kirr/wendelin.core@f60528c9    X ΔBtail.Clone had bug that it was aliasing klon and orig data
      kirr/wendelin.core@9d20f8e8    X treediff: Fix BUG while computing AB coverage
      kirr/wendelin.core@ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      kirr/wendelin.core@324241eb    X rebuild: tests: Don't reflect.DeepEqual in inner loop
      kirr/wendelin.core@8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
      kirr/wendelin.core@2c0b4793    X rebuild: tests: Don't access ZODB in xtrackKeys
      kirr/wendelin.core@8f0e37f2    X rebuild: tests: Precompute kadj10·kadj21
      kirr/wendelin.core@271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
      kirr/wendelin.core@a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
      kirr/wendelin.core@01433e96    X rebuild: tests: Don't compute keyCover in trackSet
      kirr/wendelin.core@7371f9c5    X rebuild: tests: Inline _assertTrack
      kirr/wendelin.core@3e9164b3    X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update
      kirr/wendelin.core@e9c4b619    X rebuild: tests: Random testing
      kirr/wendelin.core@d0fe680a    X δbtail += ForgetPast
      kirr/wendelin.core@210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
      kirr/wendelin.core@855ab4b8    X ΔBtail: Goodbye .KVAtTail
      kirr/wendelin.core@2f5582e6    X ΔBtail: Tweak tests to run faster in normal mode
      kirr/wendelin.core@cf352737    X random testing found another failing test for rebuild...
      kirr/wendelin.core@7f7e34e0    X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild  - called by Update - already added it
      kirr/wendelin.core@6ad0052c    X ΔBtail.Track: No need to return error
      kirr/wendelin.core@aafcacdf    X xbtree: GetAt test
      kirr/wendelin.core@784a6761    X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys
      kirr/wendelin.core@0bb1c22e    X xbtree: Verify that ForgetPast clones vδT on trim
      kirr/wendelin.core@a8945cbf    X Start reworking rebuild routines not to modify data inplace
      kirr/wendelin.core@b74dda09    X Start switching Track from Track(key) to Track(keycov)
      kirr/wendelin.core@dea85e87    X Switch GetAt to vδTSnapForTrackedKey
      kirr/wendelin.core@aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
      kirr/wendelin.core@c4366b14    X xbtree: tests: Also verify state of ΔTtail.ktrackNew
      kirr/wendelin.core@b98706ad    X Track should be nop if keycov/path is already in krebuildJobs
      kirr/wendelin.core@e141848a    X test.go  ↑ timeout  10m -> 20m
      kirr/wendelin.core@423f77be    X wcfs: Goodby holeIdx
      kirr/wendelin.core@37c2e806    X wcfs: Teach treediff to compute not only δtrack (set of nodes), but also δ for track-key coverage
      kirr/wendelin.core@52c72dbb    X ΔBtail.rebuild started to work draftly
      kirr/wendelin.core@c9f13fc7    X Get rebuild tests to run in a sane time; Add proper random-based testing for rebuild
      kirr/wendelin.core@c7f1e3c9    X xbtree: Factor testing infrastructure bits into xbtree/xbtreetest
      kirr/wendelin.core@7602c1f4    ΔBtail concurrency
      2ab4be93
    • Kirill Smelkov's avatar
      wcfs: xbtree: BTree-diff algorithm · 80153aa5
      Kirill Smelkov authored
      This algorithm will be internally used by ΔBtail in the next patch.
      
      The algorithm would be simple, if we would need to diff two trees
      completely. However in ΔBtail only subpart of BTree nodes are tracked(*)
      and the diff has to work modulo that tracking set.
      
      No tests now because ΔBtail tests will cover treediff functionality as well.
      
      Some preliminary history:
      
      kirr/wendelin.core@78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      kirr/wendelin.core@5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      kirr/wendelin.core@f65f775b    X wcfs/xbtree: treediff(ø, b)
      kirr/wendelin.core@c75b1c6f    X wcfs/xbtree: Start killing holeIdx
      kirr/wendelin.core@ef5e5183    X treediff ret += δtkeycov
      kirr/wendelin.core@9d20f8e8    X treediff: Fix BUG while computing AB coverage
      kirr/wendelin.core@ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      kirr/wendelin.core@f68398c9    X wcfs: Move treediff into its own file
      
      (*) because full BTree scan is needed to discover all of its nodes.
      
      Quoting treediff documentation:
      
      ---- 8< ----
      
      treediff provides diff for BTrees
      
      Use δZConnectTracked + treediff to compute BTree-diff caused by δZ:
      
          δZConnectTracked(δZ, trackSet)                         -> δZTC, δtopsByRoot
          treediff(root, δtops, δZTC, trackSet, zconn{Old,New})  -> δT, δtrack, δtkeycov
      
      δZConnectTracked computes BTree-connected closure of δZ modulo tracked set
      and also returns δtopsByRoot to indicate which tree objects were changed and
      in which subtree parts. With that information one can call treediff for each
      changed root to compute BTree-diff and δ for trackSet itself.
      
      BTree diff algorithm
      
      diffT, diffB and δMerge constitute the diff algorithm implementation.
      diff(A,B) works on pair of A and B whole key ranges splitted into regions
      covered by tree nodes. The splitting represents current state of recursion
      into corresponding tree. If a node in particular key range is Bucket, that
      bucket contributes to δ- in case of A, and to δ+ in case of B. If a node in
      particular key range is Tree, the algorithm may want to expand that tree
      node into its children and to recourse into some of the children.
      
      There are two phases:
      
      - Phase 1 expands A top->down driven by δZTC, adds reached buckets to δ-,
        and queues key regions of those buckets to be processed on B.
      
      - Phase 2 starts processing from queued key regions, expands them on B and
        adds reached buckets to δ+. Then it iterates to reach consistency in between
        A and B because processing buckets on B side may increase δ key coverage,
        and so corresponding key ranges has to be again processed on A. Which in
        turn may increase δ key coverage again, and needs to be processed on B side,
        etc...
      
      The final δ is merge of δ- and δ+.
      
      diffT has more detailed explanation of phase 1 and phase 2 logic.
      80153aa5
    • Kirill Smelkov's avatar
      wcfs: xbtree: blib += PPTreeSubSet, ΔPPTreeSubSet · 27df5a3b
      Kirill Smelkov authored
      This data structures will be used in ΔBtail to maintain sef of tracked
      BTree nodes, and to represent δ to such set.
      
      Some preliminary history:
      
      kirr/wendelin.core@78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      kirr/wendelin.core@5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      kirr/wendelin.core@f65f775b    X wcfs/xbtree: treediff(ø, b)
      kirr/wendelin.core@66bc41ce    X Fix bug in PPTreeSubSet.Difference  - it was always leaving root node alive
      kirr/wendelin.core@ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      kirr/wendelin.core@a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
      
      Quoting PPTreeSubSet and ΔPPTreeSubSet documentation:
      
      ---- 8< ----
      
      PPTreeSubSet represents PP-connected subset of tree node objects.
      
      It is
      
          PP(xleafs)
      
      where PP(node) maps node to {node, node.parent, node.parent,parent, ...} up
      to top root from where the node is reached.
      
      The nodes in the set are represented by their Oid.
      
      Usually PPTreeSubSet is built as PP(some-leafs), but in general the starting
      nodes are arbitrary. PPTreeSubSet can also have many root nodes, thus not
      necessarily representing a subset of a single tree.
      
      Usual set operations are provided: Union, Difference and Intersection.
      
      Nodes can be added into the set via AddPath. Path is reverse operation - it
      returns path to tree node given its oid.
      
      Every node in the set comes with .parent pointer.
      
      ~~~~
      
      ΔPPTreeSubSet represents a change to PPTreeSubSet.
      
      It can be applied via PPTreeSubSet.ApplyΔ .
      
      The result B of applying δ to A is:
      
          B = A.xDifference(δ.Del).xUnion(δ.Add)		(*)
      
      (*) NOTE δ.Del and δ.Add might have their leafs starting from non-leaf nodes in A/B.
          This situation arises when δ represents a change in path to particular
          node, but that node itself does not change, for example:
      
                 c*             c
                / \            /
              41*  42         41
               |    |         | \
              22   43        46  43
                    |         |   |
                   44        22  44
      
          Here nodes {c, 41} are changed, node 42 is unlinked, and node 46 is added.
          Nodes 43 and 44 stay unchanged.
      
              δ.Del = c-42-43   | c-41-22
              δ.Add = c-41-43   | c-41-46-22
      
          The second component with "-22" builds from leaf, but the first
          component with "-43" builds from non-leaf node.
      
              ΔnchildNonLeafs = {43: +1}
      
          Only complete result of applying all
      
              - xfixup(-1, ΔnchildNonLeafs)
              - δ.Del,
              - δ.Add, and
              - xfixup(+1, ΔnchildNonLeafs)
      
          produces correctly PP-connected set.
      27df5a3b
    • Kirill Smelkov's avatar
      wcfs: xbtree: blib += RangedMap, RangedKeySet · 1f2cd49d
      Kirill Smelkov authored
      RangedMap is Key->VALUE map with adjacent keys mapped to the same value coalesced into Ranges.
      RangedKeySet is set of Keys with adjacent keys coalesced into Ranges.
      
      This data structures will be needed for ΔBtail.
      
      For now the implementation is simple since it keeps whole map in a
      linear slice because both RangedMap and RangedKeySet will be used in
      ΔBtail to keep something proportional to δ of a change, which is assumed
      to be small or medium most of the time.
      
      Some preliminary history:
      
      kirr/wendelin.core@6ea5920a    X xbtree: Less copy/garbage in RangedKeySet ops
      kirr/wendelin.core@3ecacd99    X need to keep Value first so that sizeof(set-entry) = sizeof(KeyRange)
      kirr/wendelin.core@a5b9b19b    X SetRange draftly works
      kirr/wendelin.core@ed2de0de    X Tests for Get
      kirr/wendelin.core@3b7b69e6    X fixes for empty set/range
      kirr/wendelin.core@6972f999    X xbtree/blib: RangedMap, RangedSet += IntersectsRange, Intersection
      kirr/wendelin.core@57be0126    X RangedMap - like RangedSet but for dict
      1f2cd49d
    • Kirill Smelkov's avatar
      wcfs: tests: Tree-based testing environment · b87edcfe
      Kirill Smelkov authored
      Add treeenv.go that combines Treegen and client side access to ZODB with
      committed trees as extension to testing.T . The environment allows to
      easily see which tree update was committed, what is the difference in
      terms of KV, what is the state of updated tree and state of pointed-to
      ZBlk objects.
      
      This will be used to test upcoming ΔBtail and ΔFtail.
      
      Main functionality is in treeenv.go; the other added files are to
      support that.
      
      Some preliminary history:
      
      kirr/wendelin.core@f07502fc    X xbtreetest: Teach T & Commit to automatically provide At in symbolic form
      kirr/wendelin.core@0d62b05e    X Adjust to btree.VGet & friends signature change to include keycov in visit callback
      kirr/wendelin.core@588a512a    X zdata: Switch SliceByFileRev not to clone Zinblk
      kirr/wendelin.core@e9c4b619    X rebuild: tests: Random testing
      kirr/wendelin.core@43090ac7    X tests: Factor-out tree-test-env into tTreeEnv
      kirr/wendelin.core@d4a523b2    X δbtail: tests: Run much faster with live ZODB cache
      kirr/wendelin.core@271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
      kirr/wendelin.core@c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      kirr/wendelin.core@5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      kirr/wendelin.core@8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
      b87edcfe
    • Kirill Smelkov's avatar
      wcfs: Set package · b13ee09b
      Kirill Smelkov authored
      Lacking generics we have set.go.in and instantiation for Set[int64],
      set[string], Set[Oid] and Set[Tid] - that will be used in follow-up
      patches.
      
      The set.go.in itself is mostly a generalized copy from git-backup:
      
      https://lab.nexedi.com/kirr/git-backup/blob/c9db60e8/set.go
      b13ee09b
    • Kirill Smelkov's avatar
      wcfs: tests: Treegen functionality · a8595565
      Kirill Smelkov authored
      treegen.go and treegen.py together provide a way
      
      - to commit a particular BTree topology into ZODB, and
      - to generate set of random tree topologies that all correspond to particular {k->v} dict.
      
      this will be used in upcoming ΔBtail and ΔFtail tests.
      
      See treegen.py documentation for details.
      
      Some preliminary history:
      
      kirr/wendelin.core@9eca74ec    X Teach AllStructs to emit topologies with values
      kirr/wendelin.core@1b962f03    X Restructure: found bug that it was not marking objects as modified
      kirr/wendelin.core@2139af2c    X treegen: Verify that tree actually saved to storage is what was requested
      kirr/wendelin.core@b5e39d4a    X wcfs/treegen: allstructs: Do not keep all tree structures in memory
      kirr/wendelin.core@e9c4b619    X rebuild: tests: Random testing
      kirr/wendelin.core@c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      kirr/wendelin.core@4300d88a    X wcfs/xbtreetest/treegen.py: Fix it on ZODB4
      a8595565