Commit 1ffd04a5 authored by Kirill Smelkov's avatar Kirill Smelkov

.

parent 0ec8ce51
- doc: notes on how things are organized in wendelin.core 2
wcfs:
- SIGSEGV is used only to track writes
This diff is collapsed.
==============================================
Additional notes to documentation in wcfs.go
==============================================
This file contains notes additional to usage documentation and internal
organization overview in wcfs.go .
Notes on OS pagecache control
=============================
The cache of snapshotted bigfile can be pre-made hot if invalidated region
was already in pagecache of head/bigfile/file:
- we can retrieve a region from pagecache of head/file with FUSE_NOTIFY_RETRIEVE.
- we can store that retrieved data into pagecache region of @<revX>/ with FUSE_NOTIFY_STORE.
- we can invalidate a region from pagecache of head/file with FUSE_NOTIFY_INVAL_INODE.
we have to disable FUSE_AUTO_INVAL_DATA to tell the kernel we are fully
responsible for invalidating pagecache. If we don't, the kernel will be
clearing whole cache of head/file on e.g. its mtime change.
Note: FUSE_AUTO_INVAL_DATA does not fully prevent kernel from automatically
invalidating pagecache - e.g. it will invalidate whole cache on file size changes:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/inode.c?id=e0bc833d10#n233
It was hoped that we could workaround it with using writeback mode (see !is_wb
in the link above), but it turned out that in writeback mode the kernel indeed
does not invalidate data cache on file size change, but neither it allows the
filesystem to set the size due to external event (see https://git.kernel.org/linus/8373200b12
"fuse: Trust kernel i_size only"). This prevents us to use writeback workaround
as we cannot even update the file from being empty to have some data.
-> we did the patch for FUSE to have proper flag for filesystem server to tell
the kernel it is fully responsible for invalidating pagecache. The patch is
part of Linux 5.2:
https://git.kernel.org/linus/ad2ba64dd489
Invalidations to wcfs clients are delayed until block access
============================================================
Initially it was planned that wcfs would send invalidation messages to its
clients right after receiving invalidation message from ZODB at transaction
boundary time. That simplifies logic but requires that for a particular file,
wcfs has to send to clients whole range of where the file was changed.
Emitting whole δR right at transaction-boundary time requires to keep whole
ZBigFile.blktab index in RAM. Even though from space point of view it is
somewhat acceptable (~ 0.01% of whole-file data size, i.e. ~ 128MB of index for
~ 1TB of data), it is not good from time overhead point of view - initial open
of a file this way would be potentially slow as full blktab scan - including
Trees _and_ Buckets nodes - would be required.
-> we took the approach where we send invalidation to client about a block
lazily only when the block is actually accessed.
XXX building δFtail lazily along serving fuse reads during scope of one
transaction is not trivial and creates concurrency bottlenecks if simple
locking scheme is used. With the main difficulty being to populate tracking set
of δBtree lazily. However as the first approach we can still build complete
tracking set for a BTree at the time of file open: we need to scan through all
trees but _not_ buckets: this way we'll know oid of all tree nodes: trees _and_
buckets, while avoiding loading buckets makes this approach practical: with
default LOBTree settings (1 bucket = 60·objects, 1 tree = 500·buckets) it will
require ~ 20 trees to cover 1TB of data. And we can scan those trees very
quickly even if doing so serially. For 1PB of data it will require to scan ~
10⁴ trees. If RTT to load 1 object is ~1ms this will become 10 seconds if done
serially. However if we load all those tree objects in parallel it will be
much less. Still the number of trees to scan is linear to the amount of data
and it would be good to address the shortcoming of doing whole file index scan
later.
Changing mmapping while under pagefault is possible
===================================================
We can change a mapping while a page from it is under pagefault:
- the kernel, upon handling pagefault, queues read request to filesystem
server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
fuse_readpages+1
read_pages+109
__do_page_cache_readahead+401
filemap_fault+635
__do_fault+31
__handle_mm_fault+3403
handle_mm_fault+220
__do_page_fault+598
page_fault+30
- however the read request is queued to be performed asynchronously -
the kernel does not wait for it in fuse_readpages, because
* git.kernel.org/linus/c1aa96a5,
* git.kernel.org/linus/9cd68455,
* and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
- the kernel then _releases_ client->mm->mmap_sem and then waits
for to-read pages to become ready:
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
- the filesystem server upon receiving the read request can manipulate
client's address space. This requires to write-lock client->mm->mmap_sem,
but we can be sure it won't deadlock because the kernel releases it
before waiting (see previous point).
in practice the manipulation is done by another client thread, because
on Linux it is not possible to change mm of another process. However
the main point here is that the manipulation is possible because
there will be no deadlock on client->mm->mmap_sem.
For the reference here is how filesystem server reply looks under trace:
kprobe:fuse_readpages_end
fuse_readpages_end+1
request_end+188
fuse_dev_do_write+1921
fuse_dev_write+78
do_iter_readv_writev+325
do_iter_write+128
vfs_writev+152
do_writev+94
do_syscall_64+85
entry_SYSCALL_64_after_hwframe+68
and a test program that demonstrates that it is possible to change
mmapping while under pagefault to it:
https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
Starting from Linux 5.1 mmap_sem should be generally released while doing any IO:
https://git.kernel.org/linus/6b4c9f4469
but before that the analysis remains FUSE-specific.
The property that changing mmapping while under pagefault is possible is
verified by wcfs testsuite in `test_wcfs_remmap_on_pin` test.
Client cannot be ptraced while under pagefault
==============================================
We cannot use ptrace to run code on client thread that is under pagefault:
The kernel sends SIGSTOP to interrupt tracee, but the signal will be
processed only when the process returns from kernel space, e.g. here
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
This way the tracer won't receive obligatory information that tracee
stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
ptrace commands will fail:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
My original idea was to use ptrace to run code in process to change it's
memory mappings, while the triggering process is under pagefault/read
to wcfs, and the above shows it won't work - trying to ptrace the
client from under wcfs will just block forever (the kernel will be
waiting for read operation to finish for ptrace, and read will be first
waiting on ptrace stopping to complete = deadlock)
Kernel locks page on read/cache store/... - we have to be careful not to deadlock
=================================================================================
The kernel, when doing FUSE operations, locks corresponding pages. For example
it locks a page, where it is going to read data into, before issuing FUSE read
request. Correspondingly, on e.g. cache store, the kernel also locks page where
data has to be stored.
It is easy to deadlock if we don't take this locks into account. For example
if we try to upload data to kernel pagecache from under serving read request,
this can deadlock.
Another case that needs to be cared about is interaction between uploadBlk and
zwatcher: zheadMu being RWMutex, does not allow new RLocks to be taken once
Lock request has been issued. Thus the following scenario is possible::
uploadBlk os.Read zwatcher
page.Lock
zheadMu.Rlock
zheadMu.Lock
page.Lock
zheadMu.Rlock
- zwatcher is waiting for uploadBlk to release zheadMu;
- uploadBlk is waiting for os.Read to release page;
- os.Read is waiting for zwatcher to release zheadMu;
- deadlock.
To avoid such deadlocks zwatcher asks OS cache uploaders to pause while it is
running, and retries taking zheadMu.Lock until all uploaders are indeed paused.
digraph {
subgraph {
rank=same;
ordering=in;
wcfs [label="wcfs"]
invProto [label="open/isolation\nprotocol", style=filled fillcolor=grey95]
client [label="client", style=filled fillcolor=grey97]
wcfs -> invProto;
invProto -> client [dir=back]; // XXX = invProto <- client
}
// wcfs -> wcfs_simple;
// wcfs -> Sinvtree;
// wcfs -> δR;
wcfs -> liveCacheControl;
wcfs -> autoexit [color=grey];
wcfs -> wcfsInvProcess;
wcfs -> wcfsRead;
wcfs -> wcfsGC [color=grey];
wcfsInvProcess -> ZODB_go_inv;
wcfsInvProcess -> zconnCacheGet;
wcfsInvProcess -> zobj2file;
wcfsInvProcess -> δFtail;
wcfsInvProcess -> fuseRetrieveCache;
wcfsInvProcess -> _wcfs_zhead;
ZODB_go_inv -> fs1_go_inv;
ZODB_go_inv -> zeo_go_inv;
ZODB_go_inv -> neo_go_inv;
ZODB_go_inv -> zcache_go_inv [style=dashed, color=grey]; // wcfs works without raw cache now
wcfsRead -> blktabGet;
wcfsRead -> δFtail;
wcfsRead -> setupWatch;
wcfsRead -> headWatch;
zobj2file -> zblk2file;
zobj2file -> zbtree2file;
zbtree2file -> δBTree [color=grey];
// wcfs_simple -> Btree_read;
// wcfs_simple -> ZBlk_read;
// wcfs_simple -> autoexit;
client -> wcfsRead;
client -> setupWatch;
client -> clientInvHandle;
// client -> δR;
client -> nowcfs;
// client -> zodburl;
// client -> wcfs_spawn;
clientInvHandle -> headWatch;
headWatch -> fileSock;
_wcfs_zhead -> fileSock;
// Btree_read -> ZODB_read;
// ZBlk_read -> ZODB_read;
// ZODB_read -> ogorek_persref;
// wcfs_simple [label="wcfs no\ninvalidations", style=filled fillcolor=grey95]
// wcfs_spawn [label="spawn wcfs", style=filled fillcolor=lightyellow]
nowcfs [label="!wcfs mode", style=filled fillcolor=grey95]
wcfsInvProcess [label="process\nZODB invalidations", style=filled fillcolor=grey95]
zconnCacheGet [label="zonn.\n.Cache.Get", style=filled fillcolor=lightyellow]
zobj2file [label="Z* → file/[]#blk", style=filled fillcolor=grey95]
zblk2file [label="ZBlk*\n↓\nfile/[]#blk", style=filled fillcolor=lightyellow]
zbtree2file [label="BTree/Bucket\n↓\nfile/[]#blk"]
δBTree [label="δ(BTree)", style=filled fillcolor=grey95]
fuseRetrieveCache [label="FUSE:\nretrieve cache", style=filled fillcolor=lightyellow]
_wcfs_zhead [label=".wcfs/\nzhead", style=filled fillcolor=lightyellow]
wcfsRead [label="read(#blk)", style=filled fillcolor=grey95]
blktabGet [label="blktab.Get(#blk):\nmanually + → ⌈rev(#blk)⌉", style=filled fillcolor=grey95]
δFtail [style=filled fillcolor=lightyellow]
setupWatch [label="watches:\nregister/maint", style=filled fillcolor=grey95]
clientInvHandle [label="process\n#blk invalidations", style=filled fillcolor=grey95]
headWatch [label="#blk ← head/watch", style=filled fillcolor=grey95]
fileSock [label="FileSock", style=filled fillcolor=lightyellow]
ZODB_go_inv [label="ZODB/go\ninvalidations", style=filled fillcolor=grey95]
fs1_go_inv [label="fs1/go\ninvalidations", style=filled fillcolor=lightyellow]
zeo_go_inv [label="zeo/go\ninvalidations", style=filled fillcolor=lightyellow]
neo_go_inv [label="neo/go\ninvalidations"]
zcache_go_inv [label="ZCache/go\n←watchq", color=grey, fontcolor=grey]
// Btree_read [label="BTree read", style=filled fillcolor=lightyellow]
// ZBlk_read [label="ZBigFile / ZBlk* read", style=filled fillcolor=lightyellow]
// ZODB_read [label="ZODB deserialize object", style=filled fillcolor=lightyellow]
// ogorek_persref [label="ogórek:\npersistent references", style=filled fillcolor=lightyellow];
// Sinvtree [label="server: inv. tree"]
// δR [label="δR encoding"]
// zodburl [label="zstor -> zurl", style=filled fillcolor=grey95]
wcfsGC [label="GC\n@rev/"]
liveCacheControl [label="ZODB/go\nLiveCache fix", style=filled fillcolor=grey95]
autoexit [label="autoexit\nif !activity"]
}
This diff is collapsed.
This diff is collapsed.
...@@ -67,74 +67,11 @@ type zBlk interface { ...@@ -67,74 +67,11 @@ type zBlk interface {
// returns data and revision of ZBlk. // returns data and revision of ZBlk.
loadBlkData(ctx context.Context) (data []byte, rev zodb.Tid, _ error) loadBlkData(ctx context.Context) (data []byte, rev zodb.Tid, _ error)
// inΔFtail returns pointer to struct zblkInΔFtail embedded into this ZBlk.
// inΔFtail() *zblkInΔFtail
// XXX kill - in favour of inΔFtail
/*
// bindFile associates ZBlk as being used by file to store block #blk.
//
// A ZBlk may be bound to several blocks inside one file, and to
// several files.
//
// The information is preserved even when ZBlk comes to ghost
// state, but is lost if ZBlk is garbage collected.
//
// it is safe to call multiple bindFile simultaneously.
// it is not safe to call bindFile and boundTo simultaneously.
//
// XXX link to overview.
bindFile(file *BigFile, blk int64)
// XXX unbindFile
// XXX zfile -> bind map for it
// blkBoundTo returns ZBlk association with file(s)/#blk(s).
//
// The association returned is that was previously set by bindFile.
//
// blkBoundTo must not be called simultaneously wrt bindFile.
blkBoundTo() map[*BigFile]SetI64
*/
} }
var _ zBlk = (*ZBlk0)(nil) var _ zBlk = (*ZBlk0)(nil)
var _ zBlk = (*ZBlk1)(nil) var _ zBlk = (*ZBlk1)(nil)
// XXX kill
/*
// ---- zBlkBase ----
// zBlkBase provides common functionality to implement ZBlk* -> zfile, #blk binding.
//
// The data stored by zBlkBase is transient - it is _not_ included into
// persistent state.
type zBlkBase struct {
bindMu sync.Mutex // used only for binding to support multiple loaders
infile map[*BigFile]SetI64 // {} file -> set(#blk)
}
// bindFile implements zBlk.
func (zb *zBlkBase) bindFile(file *BigFile, blk int64) {
zb.bindMu.Lock()
defer zb.bindMu.Unlock()
blkmap, ok := zb.infile[file]
if !ok {
blkmap = make(SetI64, 1)
if zb.infile == nil {
zb.infile = make(map[*BigFile]SetI64)
}
zb.infile[file] = blkmap
}
blkmap.Add(blk)
}
// blkBoundTo implementss zBlk.
func (zb *zBlkBase) blkBoundTo() map[*BigFile]SetI64 {
return zb.infile
}
*/
// ---- ZBlk0 ---- // ---- ZBlk0 ----
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment