Commit 10f7153a authored by Kirill Smelkov's avatar Kirill Smelkov

wcfs: client: Provide client package to care about isolation protocol details

This patch follows-up on previous patch, that added server-side part of
isolation protocol handling, and adds client package that takes care about
WCFS isolation protocol details and provides to clients simple interface to
isolated view of bigfile data on WCFS similar to regular files: given a
particular revision of database @at, it provides synthetic read-only bigfile
memory mappings with data corresponding to @at state, but using /head/bigfile/*
most of the time to build and maintain the mappings.

The patch is organized as follows:

- wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the
  main part of the implementation.

- wcfs/client/client_test.py is tests.

- The rest of the changes in wcfs/client/ are to support the implementation and tests.

Quoting package documentation for the reference:

---- 8< ----

Package wcfs provides WCFS client.

This client package takes care about WCFS isolation protocol details and
provides to clients simple interface to isolated view of bigfile data on
WCFS similar to regular files: given a particular revision of database @at,
it provides synthetic read-only bigfile memory mappings with data
corresponding to @at state, but using /head/bigfile/* most of the time to
build and maintain the mappings.

For its data a mapping to bigfile X mostly reuses kernel cache for
/head/bigfile/X with amount of data not associated with kernel cache for
/head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
case where many client workers simultaneously serve requests, their database
views are a bit outdated, but close to head, which means that in practice
the kernel cache for /head/bigfile/* is being used almost 100% of the time.

A mapping for bigfile X@at is built from OS-level memory mappings of
on-WCFS files as follows:

                                          ___        /@revA/bigfile/X
        __                                           /@revB/bigfile/X
               _                                     /@revC/bigfile/X
                           +                         ...
     ───  ───── ──────────────────────────   ─────   /head/bigfile/X

where @revR mmaps are being dynamically added/removed by this client package
to maintain X@at data view according to WCFS isolation protocol(*).

API overview

 - `WCFS` represents filesystem-level connection to wcfs server.
 - `Conn` represents logical connection that provides view of data on wcfs
   filesystem as of particular database state.
 - `FileH` represent isolated file view under Conn.
 - `Mapping` represents one memory mapping of FileH.

A path from WCFS to Mapping is as follows:

 WCFS.connect(at)                    -> Conn
 Conn.open(foid)                     -> FileH
 FileH.mmap([blk_start +blk_len))    -> Mapping

A connection can be resynced to another database view via Conn.resync(at').

Documentation for classes provides more thorough overview and API details.

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.

.

Wcfs client organization
~~~~~~~~~~~~~~~~~~~~~~~~

Wcfs client provides to its users isolated bigfile views backed by data on
WCFS filesystem. In the absence of Isolation property, wcfs client would
reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On
the other hand there is a simple, but inefficient, way to support isolation:
for @at database view of bigfile f - directly use OS-level file wcfs/@at/f.
The latter works, but is very inefficient because OS-cache for f data is not
shared in between two connections with @at1 and @at2 views. The cache is
also lost when connection view of the database is resynced on transaction
boundary. To support isolation efficiently, wcfs client uses wcfs/head/f
most of the time, but injects wcfs/@revX/f parts into mappings to maintain
f@at view driven by pin messages that wcfs server sends to client in
accordance to WCFS isolation protocol(*).

Wcfs server sends pin messages synchronously triggered by access to mmaped
memory. That means that a client thread, that is accessing wcfs/head/f mmap,
is completely blocked while wcfs server sends pins and waits to receive acks
from all clients. In other words on-client handling of pins has to be done
in separate thread, because wcfs server can also send pins to client that
triggered the access.

Wcfs client implements pins handling in so-called "pinner" thread(+). The
pinner thread receives pin requests from wcfs server via watchlink handle
opened through wcfs/head/watch. For every pin request the pinner finds
corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk
appropriately.

The same watchlink handle is used to send client-originated requests to wcfs
server. The requests are sent to tell wcfs that client wants to observe a
particular bigfile as of particular revision, or to stop watching it.
Such requests originate from regular client threads - not pinner - via entry
points like Conn.open, Conn.resync and FileH.close.

Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
dict is updated by pinner driven by pin messages, and is used when
new fileh Mapping is created (FileH.mmap).

In wendelin.core a bigfile has semantic that it is infinite in size and
reads as all zeros beyond region initialized with data. Memory-mapping of
OS-level files can also go beyond file size, however accessing memory
corresponding to file region after file.size triggers SIGBUS. To preserve
wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after
wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and
never shrink. It is indeed currently so, but will have to be revisited
if/when wendelin.core adds bigfile truncation. Wcfs client restats
wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
in FileH._headfsize for use during one transaction(%).

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.
(+) currently, for simplicity, there is one pinner thread for each connection.
    In the future, for efficiency, it might be reworked to be one pinner thread
    that serves all connections simultaneously.
(%) see _headWait comments on how this has to be reworked.

Wcfs client locking organization

Wcfs client needs to synchronize regular user threads vs each other and vs
pinner. A major lock Conn.atMu protects updates to changes to Conn's view of
the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync),
and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking
Conn.resync is not running).

Similarly to wcfs.go(*) several locks that protect internal data structures
are minor to Conn.atMu - they need to be taken only under atMu.R (to
synchronize e.g. multiple fileh open running simultaneously), but do not
need to be taken at all if atMu.W is taken. In data structures such locks
are noted as follows

     sync::Mutex xMu;    // atMu.W  |  atMu.R + xMu

After atMu, Conn.filehMu protects registry of opened file handles
(Conn._filehTab), and FileH.mmapMu protects registry of created Mappings
(FileH.mmaps) and FileH.pinned.

Several locks are RWMutex instead of just Mutex not only to allow more
concurrency, but, in the first place for correctness: pinner thread being
core element in handling WCFS isolation protocol, is effectively invoked
synchronously from other threads via messages coming through wcfs server.
For example Conn.resync sends watch request to wcfs server and waits for the
answer. Wcfs server, in turn, might send corresponding pin messages to the
pinner and _wait_ for the answer before answering to resync:

       - - - - - -
      |       .···|·····.        ---->   = request
         pinner <------.↓        <····   = response
      |           |   wcfs
         resync -------^↓
      |      `····|·····
       - - - - - -
      client process

This creates the necessity to use RWMutex for locks that pinner and other
parts of the code could be using at the same time in synchronous scenarios
similar to the above. This locks are:

     - Conn.atMu
     - Conn.filehMu

Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
client calls into wcfs server via watchlink with mmapMu held.

The ordering of locks is:

     Conn.atMu > Conn.filehMu > FileH.mmapMu

The pinner takes the following locks:

     - wconn.atMu.R
     - wconn.filehMu.R
     - fileh.mmapMu (to read .mmaps  +  write .pinned)

(*) see "Wcfs locking organization" in wcfs.go

Handling of fork

When a process calls fork, OS copies its memory and creates child process
with only 1 thread. That child inherits file descriptors and memory mappings
from parent. To correctly continue using Conn, FileH and Mappings, the child
must recreate pinner thread and reconnect to wcfs via reopened watchlink.
The reason here is that without reconnection - by using watchlink file
descriptor inherited from parent - the child would interfere into
parent-wcfs exchange and neither parent nor child could continue normal
protocol communication with WCFS.

For simplicity, since fork is seldomly used for things besides followup
exec, wcfs client currently takes straightforward approach by disabling
mappings and detaching from WCFS server in the child right after fork. This
ensures that there is no interference into parent-wcfs exchange should child
decide not to exec and to continue running in the forked thread. Without
this protection the interference might come even automatically via e.g.
Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.

----------------------------------------

Some preliminary history:

a8fa9178    X wcfs: move client tests into client/
990afac1    X wcfs/client: Package overview (draft)
3f83469c    X wcfs: client: Handle fork
0ed6b8b6    fixup! X wcfs: client: Handle fork
24378c46    X wcfs: client: Provide Conn.at()
parent 6f0cdaff
...@@ -28,6 +28,27 @@ represents filesystem-level connection to joined wcfs server. If wcfs server ...@@ -28,6 +28,27 @@ represents filesystem-level connection to joined wcfs server. If wcfs server
for zurl is not yet running, it will be automatically started if join is given for zurl is not yet running, it will be automatically started if join is given
`autostart=True` option. `autostart=True` option.
The rest of wcfs.py merely wraps C++ wcfs client package:
- `WCFS` represents filesystem-level connection to wcfs server.
- `Conn` represents logical connection that provides view of data on wcfs
filesystem as of particular database state.
- `FileH` represent isolated file view under Conn.
- `Mapping` represents one memory mapping of FileH.
A path from WCFS to Mapping is as follows:
WCFS.connect(at) -> Conn
Conn.open(foid) -> FileH
FileH.mmap([blk_start +blk_len)) -> Mapping
Classes in wcfs.py logically mirror classes in ZODB:
wcfs.WCFS <-> ZODB.DB
wcfs.Conn <-> ZODB.Connection
Please see wcfs/client/wcfs.h for more thorough overview and further details.
Environment variables Environment variables
--------------------- ---------------------
...@@ -80,6 +101,10 @@ class Server: ...@@ -80,6 +101,10 @@ class Server:
# #
# Use join to create it. # Use join to create it.
# #
# The primary way to access wcfs is to open logical connection viewing on-wcfs
# data as of particular database state, and use that logical connection to create
# base-layer mappings. See .connect and Conn in C++ API for details.
#
# Raw files on wcfs can be accessed with ._path/._read/._stat/._open . # Raw files on wcfs can be accessed with ._path/._read/._stat/._open .
# #
# WCFS logically mirrors ZODB.DB . # WCFS logically mirrors ZODB.DB .
......
...@@ -23,12 +23,13 @@ ...@@ -23,12 +23,13 @@
# Package _wcfs provides Python-wrappers for C++ wcfs client package. # Package _wcfs provides Python-wrappers for C++ wcfs client package.
# #
# It wraps WCFS and WatchLink. # It wraps WCFS/Conn/FileH/Mapping and WatchLink to help client_test.py unit-test
# WCFS base-layer mmap functionality.
from golang cimport chan, structZ, string, error, refptr from golang cimport chan, structZ, string, error, refptr
from golang cimport context from golang cimport context, cxx
from libc.stdint cimport int64_t, uint64_t from libc.stdint cimport int64_t, uint64_t, uint8_t
from libcpp.utility cimport pair from libcpp.utility cimport pair
from libcpp.vector cimport vector from libcpp.vector cimport vector
...@@ -78,6 +79,53 @@ cdef extern from "wcfs/client/wcfs.h" namespace "wcfs" nogil: ...@@ -78,6 +79,53 @@ cdef extern from "wcfs/client/wcfs.h" namespace "wcfs" nogil:
string mountpoint string mountpoint
pair[WatchLink, error] _openwatch() pair[WatchLink, error] _openwatch()
pair[Conn, error] connect(Tid at)
cppclass _Conn:
Tid at()
pair[FileH, error] open(Oid foid)
error close()
error resync(Tid at)
cppclass Conn (refptr[_Conn]):
# Conn.X = Conn->X in C++
Tid at "_ptr()->at" ()
pair[FileH, error] open "_ptr()->open" (Oid foid)
error close "_ptr()->close" ()
error resync "_ptr()->resync" (Tid at)
cppclass _FileH:
size_t blksize
error close()
pair[Mapping, error] mmap(int64_t blk_start, int64_t blk_len) # `VMA *vma=nil` not exposed
cppclass FileH (refptr[_FileH]):
# FileH.X = FileH->X in C++
size_t blksize "_ptr()->blksize"
error close "_ptr()->close" ()
pair[Mapping, error] mmap "_ptr()->mmap" (int64_t blk_start, int64_t blk_len)
cppclass _Mapping:
FileH fileh
int64_t blk_start
int64_t blk_stop() const
uint8_t *mem_start
uint8_t *mem_stop
error unmap()
cppclass Mapping (refptr[_Mapping]):
# Mapping.X = Mapping->X in C++
FileH fileh "_ptr()->fileh"
int64_t blk_start "_ptr()->blk_start"
int64_t blk_stop "_ptr()->blk_stop" () const
uint8_t *mem_start "_ptr()->mem_start"
uint8_t *mem_stop "_ptr()->mem_stop"
error unmap "_ptr()->unmap" ()
cxx.dict[int64_t, Tid] _tfileh_pinned(FileH wfileh)
# ---- python bits ---- # ---- python bits ----
...@@ -85,6 +133,17 @@ cdef extern from "wcfs/client/wcfs.h" namespace "wcfs" nogil: ...@@ -85,6 +133,17 @@ cdef extern from "wcfs/client/wcfs.h" namespace "wcfs" nogil:
cdef class PyWCFS: cdef class PyWCFS:
cdef WCFS wc cdef WCFS wc
cdef class PyConn:
cdef Conn wconn
cdef readonly PyWCFS wc # PyWCFS that was used to create this PyConn
cdef class PyFileH:
cdef FileH wfileh
cdef class PyMapping:
cdef Mapping wmmap
cdef readonly PyFileH fileh
cdef class PyWatchLink: cdef class PyWatchLink:
cdef WatchLink wlink cdef WatchLink wlink
......
...@@ -28,7 +28,13 @@ ...@@ -28,7 +28,13 @@
from golang cimport pychan, pyerror, nil from golang cimport pychan, pyerror, nil
from golang cimport io from golang cimport io
from ZODB.utils import p64 cdef extern from *:
ctypedef bint cbool "bool"
from ZODB.utils import p64, u64
from cpython cimport PyBuffer_FillInfo
from libcpp.unordered_map cimport unordered_map
cdef class PyWCFS: cdef class PyWCFS:
...@@ -38,6 +44,130 @@ cdef class PyWCFS: ...@@ -38,6 +44,130 @@ cdef class PyWCFS:
def __set__(PyWCFS pywc, string v): def __set__(PyWCFS pywc, string v):
pywc.wc.mountpoint = v pywc.wc.mountpoint = v
def connect(PyWCFS pywc, pyat): # -> PyConn
cdef Tid at = u64(pyat)
with nogil:
_ = wcfs_connect_pyexc(&pywc.wc, at)
wconn = _.first
err = _.second
if err != nil:
raise pyerr(err)
cdef PyConn pywconn = PyConn.__new__(PyConn)
pywconn.wconn = wconn
pywconn.wc = pywc
return pywconn
cdef class PyConn:
def __dealloc__(PyConn pywconn):
pywconn.wconn = nil
def at(PyConn pywconn):
with nogil:
at = wconn_at_pyexc(pywconn.wconn)
return p64(at)
def close(PyConn pywconn):
with nogil:
err = wconn_close_pyexc(pywconn.wconn)
if err != nil:
raise pyerr(err)
def open(PyConn pywconn, pyfoid): # -> FileH
cdef Oid foid = u64(pyfoid)
with nogil:
_ = wconn_open_pyexc(pywconn.wconn, foid)
wfileh = _.first
err = _.second
if err != nil:
raise pyerr(err)
cdef PyFileH pywfileh = PyFileH.__new__(PyFileH)
pywfileh.wfileh = wfileh
return pywfileh
def resync(PyConn pywconn, pyat):
cdef Tid at = u64(pyat)
with nogil:
err = wconn_resync_pyexc(pywconn.wconn, at)
if err != nil:
raise pyerr(err)
cdef class PyFileH:
def __dealloc__(PyFileH pywfileh):
pywfileh.wfileh = nil
def close(PyFileH pywfileh):
with nogil:
err = wfileh_close_pyexc(pywfileh.wfileh)
if err != nil:
raise pyerr(err)
def mmap(PyFileH pywfileh, int64_t blk_start, int64_t blk_len):
with nogil:
_ = wfileh_mmap_pyexc(pywfileh.wfileh, blk_start, blk_len)
wmmap = _.first
err = _.second
if err != nil:
raise pyerr(err)
assert wmmap.fileh .eq (pywfileh.wfileh)
cdef PyMapping pywmmap = PyMapping.__new__(PyMapping)
pywmmap.wmmap = wmmap
pywmmap.fileh = pywfileh
return pywmmap
property blksize:
def __get__(PyFileH pywfileh):
return pywfileh.wfileh.blksize
# XXX for tests
property pinned:
def __get__(PyFileH pywfileh):
# XXX cast: needed for cython to automatically convert to py dict
cdef dict p = <unordered_map[int64_t, Tid]> _tfileh_pinned(pywfileh.wfileh)
for blk in p:
p[blk] = p64(p[blk]) # rev(int64) -> rev(bytes)
return p
cdef class PyMapping:
def __dealloc__(PyMapping pywmmap):
# unmap just in case (double unmap is ok)
with nogil:
err = wmmap_unmap_pyexc(pywmmap.wmmap)
pywmmap.wmmap = nil
if err != nil:
raise pyerr(err)
property blk_start:
def __get__(PyMapping pywmmap):
return pywmmap.wmmap.blk_start
property blk_stop:
def __get__(PyMapping pywmmap):
return pywmmap.wmmap.blk_stop()
def __getbuffer__(PyMapping pywmmap, Py_buffer *view, int flags):
PyBuffer_FillInfo(view, pywmmap, pywmmap.wmmap.mem_start,
pywmmap.wmmap.mem_stop - pywmmap.wmmap.mem_start, readonly=1, flags=flags)
property mem:
def __get__(PyMapping pywmmap) -> memoryview:
return memoryview(pywmmap)
def unmap(PyMapping pywmmap):
with nogil:
err = wmmap_unmap_pyexc(pywmmap.wmmap)
if err != nil:
raise pyerr(err)
# ----------------------------------------
cdef class PyWatchLink: cdef class PyWatchLink:
...@@ -153,6 +283,30 @@ cdef nogil: ...@@ -153,6 +283,30 @@ cdef nogil:
pair[WatchLink, error] wcfs_openwatch_pyexc(WCFS *wcfs) except +topyexc: pair[WatchLink, error] wcfs_openwatch_pyexc(WCFS *wcfs) except +topyexc:
return wcfs._openwatch() return wcfs._openwatch()
pair[Conn, error] wcfs_connect_pyexc(WCFS *wcfs, Tid at) except +topyexc:
return wcfs.connect(at)
Tid wconn_at_pyexc(Conn wconn) except +topyexc:
return wconn.at()
error wconn_close_pyexc(Conn wconn) except +topyexc:
return wconn.close()
pair[FileH, error] wconn_open_pyexc(Conn wconn, Oid foid) except +topyexc:
return wconn.open(foid)
error wconn_resync_pyexc(Conn wconn, Tid at) except +topyexc:
return wconn.resync(at)
error wfileh_close_pyexc(FileH wfileh) except +topyexc:
return wfileh.close()
pair[Mapping, error] wfileh_mmap_pyexc(FileH wfileh, int64_t blk_start, int64_t blk_len) except +topyexc:
return wfileh.mmap(blk_start, blk_len)
error wmmap_unmap_pyexc(Mapping wmmap) except +topyexc:
return wmmap.unmap()
error wlink_close_pyexc(WatchLink wlink) except +topyexc: error wlink_close_pyexc(WatchLink wlink) except +topyexc:
return wlink.close() return wlink.close()
......
This diff is collapsed.
This diff is collapsed.
...@@ -18,13 +18,65 @@ ...@@ -18,13 +18,65 @@
// See https://www.nexedi.com/licensing for rationale and options. // See https://www.nexedi.com/licensing for rationale and options.
// Package wcfs provides WCFS client. // Package wcfs provides WCFS client.
//
// This client package takes care about WCFS isolation protocol details and
// provides to clients simple interface to isolated view of bigfile data on
// WCFS similar to regular files: given a particular revision of database @at,
// it provides synthetic read-only bigfile memory mappings with data
// corresponding to @at state, but using /head/bigfile/* most of the time to
// build and maintain the mappings.
//
// For its data a mapping to bigfile X mostly reuses kernel cache for
// /head/bigfile/X with amount of data not associated with kernel cache for
// /head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
// case where many client workers simultaneously serve requests, their database
// views are a bit outdated, but close to head, which means that in practice
// the kernel cache for /head/bigfile/* is being used almost 100% of the time.
//
// A mapping for bigfile X@at is built from OS-level memory mappings of
// on-WCFS files as follows:
//
// ___ /@revA/bigfile/X
// __ /@revB/bigfile/X
// _ /@revC/bigfile/X
// + ...
// ─── ───── ────────────────────────── ───── /head/bigfile/X
//
// where @revR mmaps are being dynamically added/removed by this client package
// to maintain X@at data view according to WCFS isolation protocol(*).
//
//
// API overview
//
// - `WCFS` represents filesystem-level connection to wcfs server.
// - `Conn` represents logical connection that provides view of data on wcfs
// filesystem as of particular database state.
// - `FileH` represent isolated file view under Conn.
// - `Mapping` represents one memory mapping of FileH.
//
// A path from WCFS to Mapping is as follows:
//
// WCFS.connect(at) -> Conn
// Conn.open(foid) -> FileH
// FileH.mmap([blk_start +blk_len)) -> Mapping
//
// A connection can be resynced to another database view via Conn.resync(at').
//
// Documentation for classes provides more thorough overview and API details.
//
// --------
//
// (*) see wcfs.go documentation for WCFS isolation protocol overview and details.
#ifndef _NXD_WCFS_H_ #ifndef _NXD_WCFS_H_
#define _NXD_WCFS_H_ #define _NXD_WCFS_H_
#include <golang/libgolang.h> #include <golang/libgolang.h>
#include <golang/cxx.h>
#include <golang/sync.h>
#include <tuple> #include <tuple>
#include <utility>
#include "wcfs_misc.h" #include "wcfs_misc.h"
...@@ -33,10 +85,15 @@ ...@@ -33,10 +85,15 @@
namespace wcfs { namespace wcfs {
using namespace golang; using namespace golang;
using cxx::dict;
using cxx::set;
using std::tuple; using std::tuple;
using std::pair; using std::pair;
typedef refptr<struct _Conn> Conn;
typedef refptr<struct _Mapping> Mapping;
typedef refptr<struct _FileH> FileH;
typedef refptr<struct _WatchLink> WatchLink; typedef refptr<struct _WatchLink> WatchLink;
struct PinReq; struct PinReq;
...@@ -45,20 +102,185 @@ struct PinReq; ...@@ -45,20 +102,185 @@ struct PinReq;
// //
// Use wcfs.join in Python API to create it. // Use wcfs.join in Python API to create it.
// //
// The primary way to access wcfs is to open logical connection viewing on-wcfs
// data as of particular database state, and use that logical connection to
// create base-layer mappings. See .connect and Conn for details.
//
// WCFS logically mirrors ZODB.DB . // WCFS logically mirrors ZODB.DB .
// It is safe to use WCFS from multiple threads simultaneously. // It is safe to use WCFS from multiple threads simultaneously.
struct WCFS { struct WCFS {
string mountpoint; string mountpoint;
pair<Conn, error> connect(zodb::Tid at);
pair<WatchLink, error> _openwatch(); pair<WatchLink, error> _openwatch();
string String() const; string String() const;
error _headWait(zodb::Tid at);
// at OS-level, on-WCFS raw files can be accessed via ._path and ._open. // at OS-level, on-WCFS raw files can be accessed via ._path and ._open.
string _path(const string &obj); string _path(const string &obj);
tuple<os::File, error> _open(const string &path, int flags=O_RDONLY); tuple<os::File, error> _open(const string &path, int flags=O_RDONLY);
}; };
// Conn represents logical connection that provides view of data on wcfs
// filesystem as of particular database state.
//
// It uses /head/bigfile/* and notifications received from /head/watch to
// maintain isolated database view while at the same time sharing most of data
// cache in OS pagecache of /head/bigfile/*.
//
// Use WCFS.connect(at) to create Conn.
// Use .open to create new FileH.
// Use .resync to resync Conn onto different database view.
//
// Conn logically mirrors ZODB.Connection .
// It is safe to use Conn from multiple threads simultaneously.
typedef refptr<struct _Conn> Conn;
struct _Conn : os::_IAfterFork, object {
WCFS *_wc;
WatchLink _wlink; // watch/receive pins for mappings created under this conn
// atMu protects .at.
// While it is rlocked, .at is guaranteed to stay unchanged and Conn
// viewing the database at particular state. .resync write-locks this and
// knows noone is using the connection for reading simultaneously.
sync::RWMutex _atMu;
zodb::Tid _at;
sync::RWMutex _filehMu; // _atMu.W | _atMu.R + _filehMu
error _downErr; // !nil if connection is closed or no longer operational
dict<zodb::Oid, FileH> _filehTab; // {} foid -> fileh
sync::WorkGroup _pinWG; // pin/unpin messages from wcfs are served by _pinner
func<void()> _pinCancel; // spawned under _pinWG.
// don't new - create via WCFS.connect
private:
_Conn();
virtual ~_Conn();
friend pair<Conn, error> WCFS::connect(zodb::Tid at);
public:
void incref();
void decref();
public:
zodb::Tid at();
pair<FileH, error> open(zodb::Oid foid);
error close();
error resync(zodb::Tid at);
string String() const;
private:
error _pinner(context::Context ctx);
error __pinner(context::Context ctx);
error _pin1(PinReq *req);
error __pin1(PinReq *req);
void afterFork();
};
// FileH represent isolated file view under Conn.
//
// The file view is maintained to be as of @Conn.at database state even in the
// presence of simultaneous database changes. The file view uses
// /head/<file>/data primarily and /@revX/<file>/data pin overrides.
//
// Use .mmap to map file view into memory.
//
// It is safe to use FileH from multiple threads simultaneously.
enum _FileHState {
// NOTE order of states is semantically important
_FileHOpening = 0, // FileH open is in progress
_FileHOpened = 1, // FileH is opened and can be used
_FileHClosing = 2, // FileH close is in progress
_FileHClosed = 3, // FileH is closed
};
typedef refptr<struct _FileH> FileH;
struct _FileH : object {
Conn wconn;
zodb::Oid foid; // ZBigFile root object ID (does not change after fileh open)
// protected by wconn._filehMu
_FileHState _state; // opening/opened/closing/closed
int _nopen; // number of times Conn.open returned this fileh
chan<structZ> _openReady; // in-flight open completed
error _openErr; // error result from open
chan<structZ> _closedq; // in-flight close completed
os::File _headf; // file object of head/file
size_t blksize; // block size of this file (does not change after fileh open)
// head/file size is known to be at least headfsize (size ↑=)
// protected by .wconn._atMu
off_t _headfsize;
sync::Mutex _mmapMu; // atMu.W | atMu.R + _mmapMu
dict<int64_t, zodb::Tid> _pinned; // {} blk -> rev that wcfs already sent us for this file
vector<Mapping> _mmaps; // []Mapping ↑blk_start mappings of this file
// don't new - create via Conn.open
private:
_FileH();
~_FileH();
friend pair<FileH, error> _Conn::open(zodb::Oid foid);
public:
void decref();
public:
error close();
pair<Mapping, error> mmap(int64_t blk_start, int64_t blk_len);
string String() const;
error _open();
error _closeLocked(bool force);
void _afterFork();
};
// Mapping represents one memory mapping of FileH.
//
// The mapped memory is [.mem_start, .mem_stop)
// Use .unmap to release virtual memory resources used by mapping.
//
// Except unmap, it is safe to use Mapping from multiple threads simultaneously.
typedef refptr<struct _Mapping> Mapping;
struct _Mapping : object {
FileH fileh;
int64_t blk_start; // offset of this mapping in file
// protected by fileh._mmapMu
uint8_t *mem_start; // mmapped memory [mem_start, mem_stop)
uint8_t *mem_stop;
bool efaulted; // y after mapping was switched to be invalid (gives SIGSEGV on access)
int64_t blk_stop() const {
if (!((mem_stop - mem_start) % fileh->blksize == 0))
panic("len(mmap) % fileh.blksize != 0");
return blk_start + (mem_stop - mem_start) / fileh->blksize;
}
error unmap();
error _remmapblk(int64_t blk, zodb::Tid at);
error __remmapAsEfault();
error __remmapBlkAsEfault(int64_t blk);
// don't new - create via FileH.mmap
private:
_Mapping();
~_Mapping();
friend pair<Mapping, error> _FileH::mmap(int64_t blk_start, int64_t blk_len);
public:
void decref();
string String() const;
};
// for testing
dict<int64_t, zodb::Tid> _tfileh_pinned(FileH fileh);
} // wcfs:: } // wcfs::
......
...@@ -23,6 +23,7 @@ ...@@ -23,6 +23,7 @@
#include <golang/errors.h> #include <golang/errors.h>
#include <golang/fmt.h> #include <golang/fmt.h>
#include <golang/io.h> #include <golang/io.h>
#include <golang/sync.h>
using namespace golang; using namespace golang;
#include <inttypes.h> #include <inttypes.h>
...@@ -30,6 +31,7 @@ using namespace golang; ...@@ -30,6 +31,7 @@ using namespace golang;
#include <stdio.h> #include <stdio.h>
#include <string.h> #include <string.h>
#include <unistd.h> #include <unistd.h>
#include <sys/mman.h>
#include <algorithm> #include <algorithm>
#include <memory> #include <memory>
...@@ -40,6 +42,9 @@ namespace golang { ...@@ -40,6 +42,9 @@ namespace golang {
// os:: // os::
namespace os { namespace os {
global<error> ErrClosed = errors::New("file already closed");
// TODO -> os.PathError + err=syscall.Errno // TODO -> os.PathError + err=syscall.Errno
static error _pathError(const char *op, const string &path, int syserr); static error _pathError(const char *op, const string &path, int syserr);
static string _sysErrString(int syserr); static string _sysErrString(int syserr);
...@@ -131,6 +136,59 @@ static error _pathError(const char *op, const string &path, int syserr) { ...@@ -131,6 +136,59 @@ static error _pathError(const char *op, const string &path, int syserr) {
} }
// afterfork
static sync::Mutex _afterForkMu;
static bool _afterForkInit;
static vector<IAfterFork> _afterForkList;
// _runAfterFork runs handlers registered by RegisterAfterFork.
static void _runAfterFork() {
// we were just forked: This is child process and there is only 1 thread.
// The state of memory was copied from parent.
// There is no other mutators except us.
// -> go through _afterForkList *without* locking.
for (auto obj : _afterForkList) {
obj->afterFork();
}
// reset _afterFork* state because child could want to fork again
new (&_afterForkMu) sync::Mutex;
_afterForkInit = false;
_afterForkList.clear();
}
void RegisterAfterFork(IAfterFork obj) {
_afterForkMu.lock();
defer([&]() {
_afterForkMu.unlock();
});
if (!_afterForkInit) {
int e = pthread_atfork(/*prepare=*/nil, /*parent=*/nil, /*child=*/_runAfterFork);
if (e != 0) {
string estr = fmt::sprintf("pthread_atfork: %s", v(_sysErrString(e)));
panic(v(estr));
}
_afterForkInit = true;
}
_afterForkList.push_back(obj);
}
void UnregisterAfterFork(IAfterFork obj) {
_afterForkMu.lock();
defer([&]() {
_afterForkMu.unlock();
});
// _afterForkList.remove(obj)
_afterForkList.erase(
std::remove(_afterForkList.begin(), _afterForkList.end(), obj),
_afterForkList.end());
}
// _sysErrString returns string corresponding to system error syserr. // _sysErrString returns string corresponding to system error syserr.
static string _sysErrString(int syserr) { static string _sysErrString(int syserr) {
char ebuf[128]; char ebuf[128];
...@@ -141,6 +199,88 @@ static string _sysErrString(int syserr) { ...@@ -141,6 +199,88 @@ static string _sysErrString(int syserr) {
} // os:: } // os::
// mm::
namespace mm {
// map memory-maps f.fd[offset +size) somewhere into memory.
// prot is PROT_* from mmap(2).
// flags is MAP_* from mmap(2); MAP_FIXED must not be used.
tuple<uint8_t*, error> map(int prot, int flags, os::File f, off_t offset, size_t size) {
void *addr;
if (flags & MAP_FIXED)
panic("MAP_FIXED not allowed for map - use map_into");
addr = ::mmap(nil, size, prot, flags, f->fd(), offset);
if (addr == MAP_FAILED)
return make_tuple(nil, os::_pathError("mmap", f->name(), errno));
return make_tuple((uint8_t*)addr, nil);
}
// map_into memory-maps f.fd[offset +size) into [addr +size).
// prot is PROT_* from mmap(2).
// flags is MAP_* from mmap(2); MAP_FIXED is added automatically.
error map_into(void *addr, size_t size, int prot, int flags, os::File f, off_t offset) {
void *addr2;
addr2 = ::mmap(addr, size, prot, MAP_FIXED | flags, f->fd(), offset);
if (addr2 == MAP_FAILED)
return os::_pathError("mmap", f->name(), errno);
if (addr2 != addr)
panic("mmap(addr, MAP_FIXED): returned !addr");
return nil;
}
// unmap unmaps [addr +size) memory previously mapped with map & co.
error unmap(void *addr, size_t size) {
int err = ::munmap(addr, size);
if (err != 0)
return os::_pathError("munmap", "<memory>", errno);
return nil;
}
} // mm::
// io::ioutil::
namespace io {
namespace ioutil {
tuple<string, error> ReadFile(const string& path) {
// errctx is ok as returned by all calls.
os::File f;
error err;
tie(f, err) = os::open(path);
if (err != nil)
return make_tuple("", err);
string data;
vector<char> buf(4096);
while (1) {
int n;
tie(n, err) = f->read(&buf[0], buf.size());
data.append(&buf[0], n);
if (err != nil) {
if (err == io::EOF_)
err = nil;
break;
}
}
error err2 = f->close();
if (err == nil)
err = err2;
if (err != nil)
data = "";
return make_tuple(data, err);
}
}} // io::ioutil::
// xstrconv:: (strconv-like) // xstrconv:: (strconv-like)
namespace xstrconv { namespace xstrconv {
......
...@@ -61,6 +61,9 @@ namespace golang { ...@@ -61,6 +61,9 @@ namespace golang {
// os:: // os::
namespace os { namespace os {
extern global<error> ErrClosed;
// os::File mimics os.File from Go. // os::File mimics os.File from Go.
// its operations return error with full file context. // its operations return error with full file context.
typedef refptr<class _File> File; typedef refptr<class _File> File;
...@@ -104,8 +107,43 @@ tuple<File, error> open(const string &path, int flags = O_RDONLY, ...@@ -104,8 +107,43 @@ tuple<File, error> open(const string &path, int flags = O_RDONLY,
S_IRGRP | S_IWGRP | S_IXGRP | S_IRGRP | S_IWGRP | S_IXGRP |
S_IROTH | S_IWOTH | S_IXOTH); S_IROTH | S_IWOTH | S_IXOTH);
// afterfork
// IAfterFork is the interface that objects must implement to be notified after fork.
typedef refptr<struct _IAfterFork> IAfterFork;
struct _IAfterFork : public _interface {
// afterFork is called in just forked child process for objects that
// were previously registered in parent via RegisterAfterFork.
virtual void afterFork() = 0;
};
// RegisterAfterFork registers obj so that obj.afterFork is run after fork in
// the child process.
void RegisterAfterFork(IAfterFork obj);
// UnregisterAfterFork undoes RegisterAfterFork.
// It is noop if obj was not registered.
void UnregisterAfterFork(IAfterFork obj);
} // os:: } // os::
// mm::
namespace mm {
tuple<uint8_t*, error> map(int prot, int flags, os::File f, off_t offset, size_t size);
error map_into(void *addr, size_t size, int prot, int flags, os::File f, off_t offset);
error unmap(void *addr, size_t size);
} // mm::
// io::ioutil::
namespace io {
namespace ioutil {
tuple<string, error> ReadFile(const string& path);
}} // io::ioutil::
// ---- misc ---- // ---- misc ----
......
...@@ -63,6 +63,10 @@ pair<WatchLink, error> WCFS::_openwatch() { ...@@ -63,6 +63,10 @@ pair<WatchLink, error> WCFS::_openwatch() {
wlink->rx_eof = makechan<structZ>(); wlink->rx_eof = makechan<structZ>();
os::RegisterAfterFork(newref(
static_cast<os::_IAfterFork*>( wlink._ptr() )
));
context::Context serveCtx; context::Context serveCtx;
tie(serveCtx, wlink->_serveCancel) = context::with_cancel(context::background()); tie(serveCtx, wlink->_serveCancel) = context::with_cancel(context::background());
wlink->_serveWG = sync::NewWorkGroup(serveCtx); wlink->_serveWG = sync::NewWorkGroup(serveCtx);
...@@ -96,9 +100,24 @@ error _WatchLink::close() { ...@@ -96,9 +100,24 @@ error _WatchLink::close() {
if (err == nil) if (err == nil)
err = err3; err = err3;
os::UnregisterAfterFork(newref(
static_cast<os::_IAfterFork*>( &wlink )
));
return E(err); return E(err);
} }
// afterFork detaches from wcfs in child process right after fork.
void _WatchLink::afterFork() {
_WatchLink& wlink = *this;
// in child right after fork we are the only thread to run; in particular
// _serveRX is not running. Just release the file handle, that fork
// duplicated, to make sure that child cannot send anything to wcfs and
// interfere into parent-wcfs exchange.
wlink._f->close(); // ignore err
}
// closeWrite closes send half of the link. // closeWrite closes send half of the link.
error _WatchLink::closeWrite() { error _WatchLink::closeWrite() {
_WatchLink& wlink = *this; _WatchLink& wlink = *this;
......
...@@ -70,7 +70,7 @@ static_assert(sizeof(rxPkt) == 256, "rxPkt miscompiled"); // NOTE 128 is too low ...@@ -70,7 +70,7 @@ static_assert(sizeof(rxPkt) == 256, "rxPkt miscompiled"); // NOTE 128 is too low
// //
// It is safe to use WatchLink from multiple threads simultaneously. // It is safe to use WatchLink from multiple threads simultaneously.
typedef refptr<class _WatchLink> WatchLink; typedef refptr<class _WatchLink> WatchLink;
class _WatchLink : public object { class _WatchLink : public os::_IAfterFork, object {
WCFS *_wc; WCFS *_wc;
os::File _f; // head/watch file handle os::File _f; // head/watch file handle
string _rxbuf; // buffer for data already read from _f string _rxbuf; // buffer for data already read from _f
...@@ -123,6 +123,8 @@ private: ...@@ -123,6 +123,8 @@ private:
StreamID _nextReqID(); StreamID _nextReqID();
tuple<chan<rxPkt>, error> _sendReq(context::Context ctx, StreamID stream, const string &req); tuple<chan<rxPkt>, error> _sendReq(context::Context ctx, StreamID stream, const string &req);
void afterFork();
friend error _twlinkwrite(WatchLink wlink, const string &pkt); friend error _twlinkwrite(WatchLink wlink, const string &pkt);
}; };
......
...@@ -167,6 +167,21 @@ def unmap(const unsigned char[::1] mem not None): ...@@ -167,6 +167,21 @@ def unmap(const unsigned char[::1] mem not None):
return return
# map_zero_ro creats new read-only mmaping that all reads as zero.
# created mapping, even after it is accessed, does not consume memory.
def map_zero_ro(size_t size):
cdef void *addr
# mmap /dev/zero with MAP_NORESERVE and MAP_SHARED
# this way the mapping will be able to be read, but no memory will be allocated to keep it.
f = open("/dev/zero", "rb")
addr = mman.mmap(NULL, size, mman.PROT_READ, mman.MAP_SHARED | mman.MAP_NORESERVE, f.fileno(), 0)
f.close()
if addr == mman.MAP_FAILED:
PyErr_SetFromErrno(OSError)
return
return <unsigned char[:size:1]>addr
# advise advises kernel about use of mem's memory. # advise advises kernel about use of mem's memory.
# #
...@@ -180,3 +195,17 @@ def advise(const unsigned char[::1] mem not None, int advice): ...@@ -180,3 +195,17 @@ def advise(const unsigned char[::1] mem not None, int advice):
PyErr_SetFromErrno(OSError) PyErr_SetFromErrno(OSError)
return return
# protect sets protection on a region of memory.
#
# see mprotect(2) for details.
def protect(const unsigned char[::1] mem not None, int prot):
cdef const void *addr = &mem[0]
cdef size_t size = mem.shape[0]
cdef err = mman.mprotect(<void *>addr, size, prot)
if err:
PyErr_SetFromErrno(OSError)
return
...@@ -143,6 +143,16 @@ cdef unsigned char _read_exfault(const unsigned char *p) nogil except +topyexc: ...@@ -143,6 +143,16 @@ cdef unsigned char _read_exfault(const unsigned char *p) nogil except +topyexc:
return b return b
def read_mustfault(const unsigned char[::1] mem not None):
try:
read_exfault_nogil(mem)
except SegmentationFault:
# ok
pass
else:
raise AssertionError("not faulted")
# -------- # --------
......
...@@ -31,12 +31,18 @@ ...@@ -31,12 +31,18 @@
// head/bigfile/<bigfileX> which represents always latest bigfile data. // head/bigfile/<bigfileX> which represents always latest bigfile data.
// Clients that want to get isolation guarantee should subscribe for // Clients that want to get isolation guarantee should subscribe for
// invalidations and re-mmap invalidated regions to file with pinned bigfile revision for // invalidations and re-mmap invalidated regions to file with pinned bigfile revision for
// the duration of their transaction. See "Isolation protocol" for details. // the duration of their transaction. See "Isolation protocol" for details(*).
// //
// In the usual situation when bigfiles are big, and there are O(1)/δt updates, // In the usual situation when bigfiles are big, and there are O(1)/δt updates,
// there should be no need for any cache besides shared kernel cache of latest // there should be no need for any cache besides shared kernel cache of latest
// bigfile data. // bigfile data.
// //
// --------
//
// (*) wcfs servers comes accompanied by Python and C++ client packages that
// take care about isolation protocol details and provide to clients simple
// interface similar to regular files.
//
// //
// Filesystem organization // Filesystem organization
// //
......
...@@ -18,6 +18,9 @@ ...@@ -18,6 +18,9 @@
# See COPYING file for full licensing terms. # See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options. # See https://www.nexedi.com/licensing for rationale and options.
"""wcfs_test.py tests wcfs filesystem from outside as python client process. """wcfs_test.py tests wcfs filesystem from outside as python client process.
Virtmem layer provided by wcfs client package is unit-tested by
wcfs/client/client_test.py .
""" """
from __future__ import print_function, absolute_import from __future__ import print_function, absolute_import
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment