• Kirill Smelkov's avatar
    wcfs: client: Provide client package to care about isolation protocol details · 10f7153a
    Kirill Smelkov authored
    This patch follows-up on previous patch, that added server-side part of
    isolation protocol handling, and adds client package that takes care about
    WCFS isolation protocol details and provides to clients simple interface to
    isolated view of bigfile data on WCFS similar to regular files: given a
    particular revision of database @at, it provides synthetic read-only bigfile
    memory mappings with data corresponding to @at state, but using /head/bigfile/*
    most of the time to build and maintain the mappings.
    
    The patch is organized as follows:
    
    - wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the
      main part of the implementation.
    
    - wcfs/client/client_test.py is tests.
    
    - The rest of the changes in wcfs/client/ are to support the implementation and tests.
    
    Quoting package documentation for the reference:
    
    ---- 8< ----
    
    Package wcfs provides WCFS client.
    
    This client package takes care about WCFS isolation protocol details and
    provides to clients simple interface to isolated view of bigfile data on
    WCFS similar to regular files: given a particular revision of database @at,
    it provides synthetic read-only bigfile memory mappings with data
    corresponding to @at state, but using /head/bigfile/* most of the time to
    build and maintain the mappings.
    
    For its data a mapping to bigfile X mostly reuses kernel cache for
    /head/bigfile/X with amount of data not associated with kernel cache for
    /head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
    case where many client workers simultaneously serve requests, their database
    views are a bit outdated, but close to head, which means that in practice
    the kernel cache for /head/bigfile/* is being used almost 100% of the time.
    
    A mapping for bigfile X@at is built from OS-level memory mappings of
    on-WCFS files as follows:
    
                                              ___        /@revA/bigfile/X
            __                                           /@revB/bigfile/X
                   _                                     /@revC/bigfile/X
                               +                         ...
         ───  ───── ──────────────────────────   ─────   /head/bigfile/X
    
    where @revR mmaps are being dynamically added/removed by this client package
    to maintain X@at data view according to WCFS isolation protocol(*).
    
    API overview
    
     - `WCFS` represents filesystem-level connection to wcfs server.
     - `Conn` represents logical connection that provides view of data on wcfs
       filesystem as of particular database state.
     - `FileH` represent isolated file view under Conn.
     - `Mapping` represents one memory mapping of FileH.
    
    A path from WCFS to Mapping is as follows:
    
     WCFS.connect(at)                    -> Conn
     Conn.open(foid)                     -> FileH
     FileH.mmap([blk_start +blk_len))    -> Mapping
    
    A connection can be resynced to another database view via Conn.resync(at').
    
    Documentation for classes provides more thorough overview and API details.
    
    --------
    
    (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
    
    .
    
    Wcfs client organization
    ~~~~~~~~~~~~~~~~~~~~~~~~
    
    Wcfs client provides to its users isolated bigfile views backed by data on
    WCFS filesystem. In the absence of Isolation property, wcfs client would
    reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On
    the other hand there is a simple, but inefficient, way to support isolation:
    for @at database view of bigfile f - directly use OS-level file wcfs/@at/f.
    The latter works, but is very inefficient because OS-cache for f data is not
    shared in between two connections with @at1 and @at2 views. The cache is
    also lost when connection view of the database is resynced on transaction
    boundary. To support isolation efficiently, wcfs client uses wcfs/head/f
    most of the time, but injects wcfs/@revX/f parts into mappings to maintain
    f@at view driven by pin messages that wcfs server sends to client in
    accordance to WCFS isolation protocol(*).
    
    Wcfs server sends pin messages synchronously triggered by access to mmaped
    memory. That means that a client thread, that is accessing wcfs/head/f mmap,
    is completely blocked while wcfs server sends pins and waits to receive acks
    from all clients. In other words on-client handling of pins has to be done
    in separate thread, because wcfs server can also send pins to client that
    triggered the access.
    
    Wcfs client implements pins handling in so-called "pinner" thread(+). The
    pinner thread receives pin requests from wcfs server via watchlink handle
    opened through wcfs/head/watch. For every pin request the pinner finds
    corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk
    appropriately.
    
    The same watchlink handle is used to send client-originated requests to wcfs
    server. The requests are sent to tell wcfs that client wants to observe a
    particular bigfile as of particular revision, or to stop watching it.
    Such requests originate from regular client threads - not pinner - via entry
    points like Conn.open, Conn.resync and FileH.close.
    
    Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
    dict is updated by pinner driven by pin messages, and is used when
    new fileh Mapping is created (FileH.mmap).
    
    In wendelin.core a bigfile has semantic that it is infinite in size and
    reads as all zeros beyond region initialized with data. Memory-mapping of
    OS-level files can also go beyond file size, however accessing memory
    corresponding to file region after file.size triggers SIGBUS. To preserve
    wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after
    wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and
    never shrink. It is indeed currently so, but will have to be revisited
    if/when wendelin.core adds bigfile truncation. Wcfs client restats
    wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
    in FileH._headfsize for use during one transaction(%).
    
    --------
    
    (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
    (+) currently, for simplicity, there is one pinner thread for each connection.
        In the future, for efficiency, it might be reworked to be one pinner thread
        that serves all connections simultaneously.
    (%) see _headWait comments on how this has to be reworked.
    
    Wcfs client locking organization
    
    Wcfs client needs to synchronize regular user threads vs each other and vs
    pinner. A major lock Conn.atMu protects updates to changes to Conn's view of
    the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync),
    and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking
    Conn.resync is not running).
    
    Similarly to wcfs.go(*) several locks that protect internal data structures
    are minor to Conn.atMu - they need to be taken only under atMu.R (to
    synchronize e.g. multiple fileh open running simultaneously), but do not
    need to be taken at all if atMu.W is taken. In data structures such locks
    are noted as follows
    
         sync::Mutex xMu;    // atMu.W  |  atMu.R + xMu
    
    After atMu, Conn.filehMu protects registry of opened file handles
    (Conn._filehTab), and FileH.mmapMu protects registry of created Mappings
    (FileH.mmaps) and FileH.pinned.
    
    Several locks are RWMutex instead of just Mutex not only to allow more
    concurrency, but, in the first place for correctness: pinner thread being
    core element in handling WCFS isolation protocol, is effectively invoked
    synchronously from other threads via messages coming through wcfs server.
    For example Conn.resync sends watch request to wcfs server and waits for the
    answer. Wcfs server, in turn, might send corresponding pin messages to the
    pinner and _wait_ for the answer before answering to resync:
    
           - - - - - -
          |       .···|·····.        ---->   = request
             pinner <------.↓        <····   = response
          |           |   wcfs
             resync -------^↓
          |      `····|·····
           - - - - - -
          client process
    
    This creates the necessity to use RWMutex for locks that pinner and other
    parts of the code could be using at the same time in synchronous scenarios
    similar to the above. This locks are:
    
         - Conn.atMu
         - Conn.filehMu
    
    Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
    client calls into wcfs server via watchlink with mmapMu held.
    
    The ordering of locks is:
    
         Conn.atMu > Conn.filehMu > FileH.mmapMu
    
    The pinner takes the following locks:
    
         - wconn.atMu.R
         - wconn.filehMu.R
         - fileh.mmapMu (to read .mmaps  +  write .pinned)
    
    (*) see "Wcfs locking organization" in wcfs.go
    
    Handling of fork
    
    When a process calls fork, OS copies its memory and creates child process
    with only 1 thread. That child inherits file descriptors and memory mappings
    from parent. To correctly continue using Conn, FileH and Mappings, the child
    must recreate pinner thread and reconnect to wcfs via reopened watchlink.
    The reason here is that without reconnection - by using watchlink file
    descriptor inherited from parent - the child would interfere into
    parent-wcfs exchange and neither parent nor child could continue normal
    protocol communication with WCFS.
    
    For simplicity, since fork is seldomly used for things besides followup
    exec, wcfs client currently takes straightforward approach by disabling
    mappings and detaching from WCFS server in the child right after fork. This
    ensures that there is no interference into parent-wcfs exchange should child
    decide not to exec and to continue running in the forked thread. Without
    this protection the interference might come even automatically via e.g.
    Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.
    
    ----------------------------------------
    
    Some preliminary history:
    
    a8fa9178    X wcfs: move client tests into client/
    990afac1    X wcfs/client: Package overview (draft)
    3f83469c    X wcfs: client: Handle fork
    0ed6b8b6    fixup! X wcfs: client: Handle fork
    24378c46    X wcfs: client: Provide Conn.at()
    10f7153a
_wcfs.pyx 9.57 KB