wcfs/client/wcfs_misc.h · wendelin.core-2.0.alpha1-18-g38dde766 · nexedi / wendelin.core

wcfs: client: Provide client package to care about isolation protocol details · 10f7153a

Kirill Smelkov authored Oct 27, 2021

This patch follows-up on previous patch, that added server-side part of
isolation protocol handling, and adds client package that takes care about
WCFS isolation protocol details and provides to clients simple interface to
isolated view of bigfile data on WCFS similar to regular files: given a
particular revision of database @at, it provides synthetic read-only bigfile
memory mappings with data corresponding to @at state, but using /head/bigfile/*
most of the time to build and maintain the mappings.

The patch is organized as follows:

- wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the
main part of the implementation.

- wcfs/client/client_test.py is tests.

- The rest of the changes in wcfs/client/ are to support the implementation and tests.

Quoting package documentation for the reference:

---- 8< ----

Package wcfs provides WCFS client.

This client package takes care about WCFS isolation protocol details and
provides to clients simple interface to isolated view of bigfile data on
WCFS similar to regular files: given a particular revision of database @at,
it provides synthetic read-only bigfile memory mappings with data
corresponding to @at state, but using /head/bigfile/* most of the time to
build and maintain the mappings.

For its data a mapping to bigfile X mostly reuses kernel cache for
/head/bigfile/X with amount of data not associated with kernel cache for
/head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
case where many client workers simultaneously serve requests, their database
views are a bit outdated, but close to head, which means that in practice
the kernel cache for /head/bigfile/* is being used almost 100% of the time.

A mapping for bigfile X@at is built from OS-level memory mappings of
on-WCFS files as follows:

___ /@revA/bigfile/X
__ /@revB/bigfile/X
_ /@revC/bigfile/X
+ ...
─── ───── ────────────────────────── ───── /head/bigfile/X

where @revR mmaps are being dynamically added/removed by this client package
to maintain X@at data view according to WCFS isolation protocol(*).

API overview

- `WCFS` represents filesystem-level connection to wcfs server.
- `Conn` represents logical connection that provides view of data on wcfs
filesystem as of particular database state.
- `FileH` represent isolated file view under Conn.
- `Mapping` represents one memory mapping of FileH.

A path from WCFS to Mapping is as follows:

WCFS.connect(at) -> Conn
Conn.open(foid) -> FileH
FileH.mmap([blk_start +blk_len)) -> Mapping

A connection can be resynced to another database view via Conn.resync(at').

Documentation for classes provides more thorough overview and API details.

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.

Wcfs client organization
~~~~~~~~~~~~~~~~~~~~~~~~

Wcfs client provides to its users isolated bigfile views backed by data on
WCFS filesystem. In the absence of Isolation property, wcfs client would
reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On
the other hand there is a simple, but inefficient, way to support isolation:
for @at database view of bigfile f - directly use OS-level file wcfs/@at/f.
The latter works, but is very inefficient because OS-cache for f data is not
shared in between two connections with @at1 and @at2 views. The cache is
also lost when connection view of the database is resynced on transaction
boundary. To support isolation efficiently, wcfs client uses wcfs/head/f
most of the time, but injects wcfs/@revX/f parts into mappings to maintain
f@at view driven by pin messages that wcfs server sends to client in
accordance to WCFS isolation protocol(*).

Wcfs server sends pin messages synchronously triggered by access to mmaped
memory. That means that a client thread, that is accessing wcfs/head/f mmap,
is completely blocked while wcfs server sends pins and waits to receive acks
from all clients. In other words on-client handling of pins has to be done
in separate thread, because wcfs server can also send pins to client that
triggered the access.

Wcfs client implements pins handling in so-called "pinner" thread(+). The
pinner thread receives pin requests from wcfs server via watchlink handle
opened through wcfs/head/watch. For every pin request the pinner finds
corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk
appropriately.

The same watchlink handle is used to send client-originated requests to wcfs
server. The requests are sent to tell wcfs that client wants to observe a
particular bigfile as of particular revision, or to stop watching it.
Such requests originate from regular client threads - not pinner - via entry
points like Conn.open, Conn.resync and FileH.close.

Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
dict is updated by pinner driven by pin messages, and is used when
new fileh Mapping is created (FileH.mmap).

In wendelin.core a bigfile has semantic that it is infinite in size and
reads as all zeros beyond region initialized with data. Memory-mapping of
OS-level files can also go beyond file size, however accessing memory
corresponding to file region after file.size triggers SIGBUS. To preserve
wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after
wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and
never shrink. It is indeed currently so, but will have to be revisited
if/when wendelin.core adds bigfile truncation. Wcfs client restats
wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
in FileH._headfsize for use during one transaction(%).

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.
(+) currently, for simplicity, there is one pinner thread for each connection.
In the future, for efficiency, it might be reworked to be one pinner thread
that serves all connections simultaneously.
(%) see _headWait comments on how this has to be reworked.

Wcfs client locking organization

Wcfs client needs to synchronize regular user threads vs each other and vs
pinner. A major lock Conn.atMu protects updates to changes to Conn's view of
the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync),
and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking
Conn.resync is not running).

Similarly to wcfs.go(*) several locks that protect internal data structures
are minor to Conn.atMu - they need to be taken only under atMu.R (to
synchronize e.g. multiple fileh open running simultaneously), but do not
need to be taken at all if atMu.W is taken. In data structures such locks
are noted as follows

sync::Mutex xMu; // atMu.W | atMu.R + xMu

After atMu, Conn.filehMu protects registry of opened file handles
(Conn._filehTab), and FileH.mmapMu protects registry of created Mappings
(FileH.mmaps) and FileH.pinned.

Several locks are RWMutex instead of just Mutex not only to allow more
concurrency, but, in the first place for correctness: pinner thread being
core element in handling WCFS isolation protocol, is effectively invoked
synchronously from other threads via messages coming through wcfs server.
For example Conn.resync sends watch request to wcfs server and waits for the
answer. Wcfs server, in turn, might send corresponding pin messages to the
pinner and _wait_ for the answer before answering to resync:

- - - - - -
| .···|·····. ----> = request
pinner <------.↓ <···· = response
| | wcfs
resync -------^↓
| `····|·····
- - - - - -
client process

This creates the necessity to use RWMutex for locks that pinner and other
parts of the code could be using at the same time in synchronous scenarios
similar to the above. This locks are:

- Conn.atMu
- Conn.filehMu

Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
client calls into wcfs server via watchlink with mmapMu held.

The ordering of locks is:

Conn.atMu > Conn.filehMu > FileH.mmapMu

The pinner takes the following locks:

- wconn.atMu.R
- wconn.filehMu.R
- fileh.mmapMu (to read .mmaps + write .pinned)

(*) see "Wcfs locking organization" in wcfs.go

Handling of fork

When a process calls fork, OS copies its memory and creates child process
with only 1 thread. That child inherits file descriptors and memory mappings
from parent. To correctly continue using Conn, FileH and Mappings, the child
must recreate pinner thread and reconnect to wcfs via reopened watchlink.
The reason here is that without reconnection - by using watchlink file
descriptor inherited from parent - the child would interfere into
parent-wcfs exchange and neither parent nor child could continue normal
protocol communication with WCFS.

For simplicity, since fork is seldomly used for things besides followup
exec, wcfs client currently takes straightforward approach by disabling
mappings and detaching from WCFS server in the child right after fork. This
ensures that there is no interference into parent-wcfs exchange should child
decide not to exec and to continue running in the forked thread. Without
this protection the interference might come even automatically via e.g.
Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.

----------------------------------------

Some preliminary history:

kirr/wendelin.core@a8fa9178 X wcfs: move client tests into client/
kirr/wendelin.core@990afac1 X wcfs/client: Package overview (draft)
kirr/wendelin.core@3f83469c X wcfs: client: Handle fork
kirr/wendelin.core@0ed6b8b6 fixup! X wcfs: client: Handle fork
kirr/wendelin.core@24378c46 X wcfs: client: Provide Conn.at()

10f7153a

wcfs_misc.h 6.51 KB

Replace wcfs_misc.h