-
Kirill Smelkov authored
This patch follows-up on previous patch, that added server-side part of isolation protocol handling, and adds client package that takes care about WCFS isolation protocol details and provides to clients simple interface to isolated view of bigfile data on WCFS similar to regular files: given a particular revision of database @at, it provides synthetic read-only bigfile memory mappings with data corresponding to @at state, but using /head/bigfile/* most of the time to build and maintain the mappings. The patch is organized as follows: - wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the main part of the implementation. - wcfs/client/client_test.py is tests. - The rest of the changes in wcfs/client/ are to support the implementation and tests. Quoting package documentation for the reference: ---- 8< ---- Package wcfs provides WCFS client. This client package takes care about WCFS isolation protocol details and provides to clients simple interface to isolated view of bigfile data on WCFS similar to regular files: given a particular revision of database @at, it provides synthetic read-only bigfile memory mappings with data corresponding to @at state, but using /head/bigfile/* most of the time to build and maintain the mappings. For its data a mapping to bigfile X mostly reuses kernel cache for /head/bigfile/X with amount of data not associated with kernel cache for /head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual case where many client workers simultaneously serve requests, their database views are a bit outdated, but close to head, which means that in practice the kernel cache for /head/bigfile/* is being used almost 100% of the time. A mapping for bigfile X@at is built from OS-level memory mappings of on-WCFS files as follows: ___ /@revA/bigfile/X __ /@revB/bigfile/X _ /@revC/bigfile/X + ... ─── ───── ────────────────────────── ───── /head/bigfile/X where @revR mmaps are being dynamically added/removed by this client package to maintain X@at data view according to WCFS isolation protocol(*). API overview - `WCFS` represents filesystem-level connection to wcfs server. - `Conn` represents logical connection that provides view of data on wcfs filesystem as of particular database state. - `FileH` represent isolated file view under Conn. - `Mapping` represents one memory mapping of FileH. A path from WCFS to Mapping is as follows: WCFS.connect(at) -> Conn Conn.open(foid) -> FileH FileH.mmap([blk_start +blk_len)) -> Mapping A connection can be resynced to another database view via Conn.resync(at'). Documentation for classes provides more thorough overview and API details. -------- (*) see wcfs.go documentation for WCFS isolation protocol overview and details. . Wcfs client organization ~~~~~~~~~~~~~~~~~~~~~~~~ Wcfs client provides to its users isolated bigfile views backed by data on WCFS filesystem. In the absence of Isolation property, wcfs client would reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On the other hand there is a simple, but inefficient, way to support isolation: for @at database view of bigfile f - directly use OS-level file wcfs/@at/f. The latter works, but is very inefficient because OS-cache for f data is not shared in between two connections with @at1 and @at2 views. The cache is also lost when connection view of the database is resynced on transaction boundary. To support isolation efficiently, wcfs client uses wcfs/head/f most of the time, but injects wcfs/@revX/f parts into mappings to maintain f@at view driven by pin messages that wcfs server sends to client in accordance to WCFS isolation protocol(*). Wcfs server sends pin messages synchronously triggered by access to mmaped memory. That means that a client thread, that is accessing wcfs/head/f mmap, is completely blocked while wcfs server sends pins and waits to receive acks from all clients. In other words on-client handling of pins has to be done in separate thread, because wcfs server can also send pins to client that triggered the access. Wcfs client implements pins handling in so-called "pinner" thread(+). The pinner thread receives pin requests from wcfs server via watchlink handle opened through wcfs/head/watch. For every pin request the pinner finds corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk appropriately. The same watchlink handle is used to send client-originated requests to wcfs server. The requests are sent to tell wcfs that client wants to observe a particular bigfile as of particular revision, or to stop watching it. Such requests originate from regular client threads - not pinner - via entry points like Conn.open, Conn.resync and FileH.close. Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This dict is updated by pinner driven by pin messages, and is used when new fileh Mapping is created (FileH.mmap). In wendelin.core a bigfile has semantic that it is infinite in size and reads as all zeros beyond region initialized with data. Memory-mapping of OS-level files can also go beyond file size, however accessing memory corresponding to file region after file.size triggers SIGBUS. To preserve wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and never shrink. It is indeed currently so, but will have to be revisited if/when wendelin.core adds bigfile truncation. Wcfs client restats wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size in FileH._headfsize for use during one transaction(%). -------- (*) see wcfs.go documentation for WCFS isolation protocol overview and details. (+) currently, for simplicity, there is one pinner thread for each connection. In the future, for efficiency, it might be reworked to be one pinner thread that serves all connections simultaneously. (%) see _headWait comments on how this has to be reworked. Wcfs client locking organization Wcfs client needs to synchronize regular user threads vs each other and vs pinner. A major lock Conn.atMu protects updates to changes to Conn's view of the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync), and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking Conn.resync is not running). Similarly to wcfs.go(*) several locks that protect internal data structures are minor to Conn.atMu - they need to be taken only under atMu.R (to synchronize e.g. multiple fileh open running simultaneously), but do not need to be taken at all if atMu.W is taken. In data structures such locks are noted as follows sync::Mutex xMu; // atMu.W | atMu.R + xMu After atMu, Conn.filehMu protects registry of opened file handles (Conn._filehTab), and FileH.mmapMu protects registry of created Mappings (FileH.mmaps) and FileH.pinned. Several locks are RWMutex instead of just Mutex not only to allow more concurrency, but, in the first place for correctness: pinner thread being core element in handling WCFS isolation protocol, is effectively invoked synchronously from other threads via messages coming through wcfs server. For example Conn.resync sends watch request to wcfs server and waits for the answer. Wcfs server, in turn, might send corresponding pin messages to the pinner and _wait_ for the answer before answering to resync: - - - - - - | .···|·····. ----> = request pinner <------.↓ <···· = response | | wcfs resync -------^↓ | `····|····· - - - - - - client process This creates the necessity to use RWMutex for locks that pinner and other parts of the code could be using at the same time in synchronous scenarios similar to the above. This locks are: - Conn.atMu - Conn.filehMu Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs client calls into wcfs server via watchlink with mmapMu held. The ordering of locks is: Conn.atMu > Conn.filehMu > FileH.mmapMu The pinner takes the following locks: - wconn.atMu.R - wconn.filehMu.R - fileh.mmapMu (to read .mmaps + write .pinned) (*) see "Wcfs locking organization" in wcfs.go Handling of fork When a process calls fork, OS copies its memory and creates child process with only 1 thread. That child inherits file descriptors and memory mappings from parent. To correctly continue using Conn, FileH and Mappings, the child must recreate pinner thread and reconnect to wcfs via reopened watchlink. The reason here is that without reconnection - by using watchlink file descriptor inherited from parent - the child would interfere into parent-wcfs exchange and neither parent nor child could continue normal protocol communication with WCFS. For simplicity, since fork is seldomly used for things besides followup exec, wcfs client currently takes straightforward approach by disabling mappings and detaching from WCFS server in the child right after fork. This ensures that there is no interference into parent-wcfs exchange should child decide not to exec and to continue running in the forked thread. Without this protection the interference might come even automatically via e.g. Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS. ---------------------------------------- Some preliminary history: a8fa9178 X wcfs: move client tests into client/ 990afac1 X wcfs/client: Package overview (draft) 3f83469c X wcfs: client: Handle fork 0ed6b8b6 fixup! X wcfs: client: Handle fork 24378c46 X wcfs: client: Provide Conn.at()
10f7153a