Commit f34ea31d authored by Kirill Smelkov's avatar Kirill Smelkov

.

parent d1b58568
==============================================
Additional notes to documentation in wcfs.go
==============================================
This file contains notes additional to usage documentation and internal
organization overview in wcfs.go .
Changing mmapping while under pagefault is possible
===================================================
We can change a mapping while a page from it is under pagefault:
- the kernel, upon handling pagefault, queues read request to filesystem
server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
fuse_readpages+1
read_pages+109
__do_page_cache_readahead+401
filemap_fault+635
__do_fault+31
__handle_mm_fault+3403
handle_mm_fault+220
__do_page_fault+598
page_fault+30
- however the read request is queued to be performed asynchronously -
the kernel does not wait for it in fuse_readpages, because
* git.kernel.org/linus/c1aa96a5,
* git.kernel.org/linus/9cd68455,
* and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
- the kernel then _releases_ client->mm->mmap_sem and then waits
for to-read pages to become ready:
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
- the filesystem server upon receiving the read request can manipulate
client's address space. This requires to write-lock client->mm->mmap_sem,
but we can be sure it won't deadlock because the kernel releases it
before waiting (see previous point).
in practice the manipulation is done by another client thread, because
on Linux it is not possible to change mm of another process. However
the main point here is that the manipulation is possible because
there will be no deadlock on client->mm->mmap_sem.
For the reference here is how filesystem server reply looks under trace:
kprobe:fuse_readpages_end
fuse_readpages_end+1
request_end+188
fuse_dev_do_write+1921
fuse_dev_write+78
do_iter_readv_writev+325
do_iter_write+128
vfs_writev+152
do_writev+94
do_syscall_64+85
entry_SYSCALL_64_after_hwframe+68
and a test program that demonstrates that it is possible to change
mmapping while under pagefault to it:
https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
In the future mmap_sem might be released while doing any IO:
https://lwn.net/Articles/768857
but before that the analysis remains FUSE-specific.
Client cannot be ptraced while under pagefault
==============================================
We cannot use ptrace to run code on client thread that is under pagefault:
The kernel sends SIGSTOP to interrupt tracee, but the signal will be
processed only when the process returns from kernel space, e.g. here
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
This way the tracer won't receive obligatory information that tracee
stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
ptrace commands will fail:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
My original idea was to use ptrace to run code in process to change it's
memory mappings, while the triggering process is under pagefault/read
to wcfs, and the above shows it won't work - trying to ptrace the
client from under wcfs will just block forever (the kernel will be
waiting for read operation to finish for ptrace, and read will be first
waiting on ptrace stopping to complete = deadlock)
digraph {
wcfs -> wcfs_simple;
wcfs -> ZODB_go_inv;
wcfs -> Sinvtree;
wcfs -> δR;
// wcfs -> wcfs_simple;
// wcfs -> Sinvtree;
// wcfs -> δR;
wcfs -> autoexit;
wcfs_simple -> Btree_read;
wcfs_simple -> ZBlk_read;
wcfs_simple -> autoexit;
wcfs -> wcfsInvProcess;
wcfs -> wcfsRead;
client -> wcfs_spawn;
client -> δR;
wcfsInvProcess -> ZODB_go_inv;
wcfsInvProcess -> zconnCacheGet;
wcfsInvProcess -> zobj2file;
wcfsInvProcess -> δFtail;
wcfsInvProcess -> fuseRetrieveCache;
wcfsRead -> blktabGet;
wcfsRead -> δFtail;
wcfsRead -> mappingRegister;
wcfsRead -> headInv;
zobj2file -> zblk2file;
zobj2file -> zbtree2file;
zbtree2file -> δBTree;
// wcfs_simple -> Btree_read;
// wcfs_simple -> ZBlk_read;
// wcfs_simple -> autoexit;
client -> wcfsRead;
client -> mappingRegister;
client -> clientInvHandle;
// client -> δR;
client -> nowcfs;
client -> zodburl;
// client -> zodburl;
// client -> wcfs_spawn;
Btree_read -> ZODB_read;
ZBlk_read -> ZODB_read;
ZODB_read -> ZODB_binary;
ZODB_read -> ogorek_persref;
clientInvHandle -> headInv;
// Btree_read -> ZODB_read;
// ZBlk_read -> ZODB_read;
// ZODB_read -> ogorek_persref;
wcfs [label="wcfs"]
wcfs_simple [label="wcfs no\ninvalidations", style=filled fillcolor=grey95]
// wcfs_simple [label="wcfs no\ninvalidations", style=filled fillcolor=grey95]
client [label="client"]
wcfs_spawn [label="spawn wcfs", style=filled fillcolor=lightyellow]
// wcfs_spawn [label="spawn wcfs", style=filled fillcolor=lightyellow]
nowcfs [label="!wcfs mode"]
wcfsInvProcess [label="process\nZODB invalidations"]
zconnCacheGet [label="zconn.Cache.Get"]
zobj2file [label="Z* → file/[]#blk"]
zblk2file [label="ZBlk* → file/[]#blk"]
zbtree2file [label="BTree/Bucket → file/[]#blk"]
δBTree [label="δ(BTree)"]
fuseRetrieveCache [label="FUSE:\nretrieve cache"]
wcfsRead [label="read(#blk)"]
blktabGet [label="blktab.Get(#blk):\nmanually + → ⌈rev(#blk)⌉"]
mappingRegister [label="mmappings:\nregister/maint"]
clientInvHandle [label="process\n#blk invalidations"]
headInv [label="#blk ← head/inv."]
ZODB_go_inv [label="ZODB/go\ninvalidations"]
Btree_read [label="BTree read", style=filled fillcolor=lightyellow]
ZBlk_read [label="ZBigFile / ZBlk* read", style=filled fillcolor=lightyellow]
ZODB_read [label="ZODB deserialize object", style=filled fillcolor=lightyellow]
ZODB_binary [label="Adapt to zodbpickle.binary"];
ogorek_persref [label="ogórek:\npersistent references", style=filled fillcolor=lightyellow];
// Btree_read [label="BTree read", style=filled fillcolor=lightyellow]
// ZBlk_read [label="ZBigFile / ZBlk* read", style=filled fillcolor=lightyellow]
// ZODB_read [label="ZODB deserialize object", style=filled fillcolor=lightyellow]
// ogorek_persref [label="ogórek:\npersistent references", style=filled fillcolor=lightyellow];
Sinvtree [label="server: inv. tree"]
δR [label="δR encoding"]
// Sinvtree [label="server: inv. tree"]
// δR [label="δR encoding"]
test [label="? tests"]
zodburl [label="zstor -> zurl", style=filled fillcolor=grey95]
// zodburl [label="zstor -> zurl", style=filled fillcolor=grey95]
autoexit [label="autoexit\nif !activity"]
}
This diff is collapsed.
......@@ -337,90 +337,8 @@ package main
// and a client that wants @rev data will get @rev data, even if it was this
// "old" client that triggered the pagefault(*).
//
// (*) we can change a mapping while a page from it is under pagefault:
//
// - the kernel, upon handling pagefault, queues read request to filesystem
// server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
//
// kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
// fuse_readpages+1
// read_pages+109
// __do_page_cache_readahead+401
// filemap_fault+635
// __do_fault+31
// __handle_mm_fault+3403
// handle_mm_fault+220
// __do_page_fault+598
// page_fault+30
//
// - however the read request is queued to be performed asynchronously -
// the kernel does not wait for it in fuse_readpages, because
//
// * git.kernel.org/linus/c1aa96a5,
// * git.kernel.org/linus/9cd68455,
// * and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
//
// - the kernel then _releases_ client->mm->mmap_sem and then waits
// for to-read pages to become ready:
//
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
//
// - the filesystem server upon receiving the read request can manipulate
// client's address space. This requires to write-lock client->mm->mmap_sem,
// but we can be sure it won't deadlock because the kernel releases it
// before waiting (see previous point).
//
// in practice the manipulation is done by another client thread, because
// on Linux it is not possible to change mm of another process. However
// the main point here is that the manipulation is possible because
// there will be no deadlock on client->mm->mmap_sem.
//
// For the reference here is how filesystem server reply looks under trace:
//
// kprobe:fuse_readpages_end
// fuse_readpages_end+1
// request_end+188
// fuse_dev_do_write+1921
// fuse_dev_write+78
// do_iter_readv_writev+325
// do_iter_write+128
// vfs_writev+152
// do_writev+94
// do_syscall_64+85
// entry_SYSCALL_64_after_hwframe+68
//
// and a test program that demonstrates that it is possible to change
// mmapping while under pagefault to it:
//
// https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
//
// In the future mmap_sem might be released while doing any IO:
//
// https://lwn.net/Articles/768857
//
// but before that the analysis remains FUSE-specific.
//
//
// (+) the kernel sends SIGSTOP to interrupt tracee, but the signal will be
// processed only when the process returns from kernel space, e.g. here
//
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
//
// This way the tracer won't receive obligatory information that tracee
// stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
// ptrace commands will fail:
//
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
//
// My original idea was to use ptrace to run code in process to change it's
// memory mappings, while the triggering process is under pagefault/read
// to wcfs, and the above shows it won't work - trying to ptrace the
// client from under wcfs will just block forever (the kernel will be
// waiting for read operation to finish for ptrace, and read will be first
// waiting on ptrace stopping to complete = deadlock)
// (*) see "Changing mmapping while under pagefault is possible" in notes.txt
// (+) see "Client cannot be ptraced while under pagefault" in notes.txt
//
//
// XXX mmap(@at) open
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment