Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
W
wendelin.core
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Labels
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Commits
Open sidebar
Kirill Smelkov
wendelin.core
Commits
f34ea31d
Commit
f34ea31d
authored
Nov 22, 2018
by
Kirill Smelkov
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
.
parent
d1b58568
Changes
4
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
350 additions
and
253 deletions
+350
-253
wcfs/notes.txt
wcfs/notes.txt
+100
-0
wcfs/todo.dot
wcfs/todo.dot
+65
-25
wcfs/todo.svg
wcfs/todo.svg
+183
-144
wcfs/wcfs.go
wcfs/wcfs.go
+2
-84
No files found.
wcfs/notes.txt
0 → 100644
View file @
f34ea31d
==============================================
Additional notes to documentation in wcfs.go
==============================================
This file contains notes additional to usage documentation and internal
organization overview in wcfs.go .
Changing mmapping while under pagefault is possible
===================================================
We can change a mapping while a page from it is under pagefault:
- the kernel, upon handling pagefault, queues read request to filesystem
server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
fuse_readpages+1
read_pages+109
__do_page_cache_readahead+401
filemap_fault+635
__do_fault+31
__handle_mm_fault+3403
handle_mm_fault+220
__do_page_fault+598
page_fault+30
- however the read request is queued to be performed asynchronously -
the kernel does not wait for it in fuse_readpages, because
* git.kernel.org/linus/c1aa96a5,
* git.kernel.org/linus/9cd68455,
* and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
- the kernel then _releases_ client->mm->mmap_sem and then waits
for to-read pages to become ready:
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
- the filesystem server upon receiving the read request can manipulate
client's address space. This requires to write-lock client->mm->mmap_sem,
but we can be sure it won't deadlock because the kernel releases it
before waiting (see previous point).
in practice the manipulation is done by another client thread, because
on Linux it is not possible to change mm of another process. However
the main point here is that the manipulation is possible because
there will be no deadlock on client->mm->mmap_sem.
For the reference here is how filesystem server reply looks under trace:
kprobe:fuse_readpages_end
fuse_readpages_end+1
request_end+188
fuse_dev_do_write+1921
fuse_dev_write+78
do_iter_readv_writev+325
do_iter_write+128
vfs_writev+152
do_writev+94
do_syscall_64+85
entry_SYSCALL_64_after_hwframe+68
and a test program that demonstrates that it is possible to change
mmapping while under pagefault to it:
https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
In the future mmap_sem might be released while doing any IO:
https://lwn.net/Articles/768857
but before that the analysis remains FUSE-specific.
Client cannot be ptraced while under pagefault
==============================================
We cannot use ptrace to run code on client thread that is under pagefault:
The kernel sends SIGSTOP to interrupt tracee, but the signal will be
processed only when the process returns from kernel space, e.g. here
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
This way the tracer won't receive obligatory information that tracee
stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
ptrace commands will fail:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
My original idea was to use ptrace to run code in process to change it's
memory mappings, while the triggering process is under pagefault/read
to wcfs, and the above shows it won't work - trying to ptrace the
client from under wcfs will just block forever (the kernel will be
waiting for read operation to finish for ptrace, and read will be first
waiting on ptrace stopping to complete = deadlock)
wcfs/todo.dot
View file @
f34ea31d
digraph
{
digraph
{
wcfs
->
wcfs_simple
;
// wcfs -> wcfs_simple;
wcfs
->
ZODB_go_inv
;
// wcfs -> Sinvtree;
wcfs
->
Sinvtree
;
// wcfs -> δR;
wcfs
->
δ
R
;
wcfs
->
autoexit
;
wcfs
->
autoexit
;
wcfs_simple
->
Btree_read
;
wcfs
->
wcfsInvProcess
;
wcfs_simple
->
ZBlk_read
;
wcfs
->
wcfsRead
;
wcfs_simple
->
autoexit
;
client
->
wcfs_spawn
;
wcfsInvProcess
->
ZODB_go_inv
;
client
->
δ
R
;
wcfsInvProcess
->
zconnCacheGet
;
wcfsInvProcess
->
zobj2file
;
wcfsInvProcess
->
δ
Ftail
;
wcfsInvProcess
->
fuseRetrieveCache
;
wcfsRead
->
blktabGet
;
wcfsRead
->
δ
Ftail
;
wcfsRead
->
mappingRegister
;
wcfsRead
->
headInv
;
zobj2file
->
zblk2file
;
zobj2file
->
zbtree2file
;
zbtree2file
->
δ
BTree
;
// wcfs_simple -> Btree_read;
// wcfs_simple -> ZBlk_read;
// wcfs_simple -> autoexit;
client
->
wcfsRead
;
client
->
mappingRegister
;
client
->
clientInvHandle
;
// client -> δR;
client
->
nowcfs
;
client
->
nowcfs
;
client
->
zodburl
;
// client -> zodburl;
// client -> wcfs_spawn;
Btree_read
->
ZODB_read
;
ZBlk_read
->
ZODB_read
;
clientInvHandle
->
headInv
;
ZODB_read
->
ZODB_binary
;
ZODB_read
->
ogorek_persref
;
// Btree_read -> ZODB_read;
// ZBlk_read -> ZODB_read;
// ZODB_read -> ogorek_persref;
wcfs
[
label
=
"wcfs"
]
wcfs
[
label
=
"wcfs"
]
wcfs_simple
[
label
=
"wcfs no\ninvalidations"
,
style
=
filled
fillcolor
=
grey95
]
//
wcfs_simple [label="wcfs no\ninvalidations", style=filled fillcolor=grey95]
client
[
label
=
"client"
]
client
[
label
=
"client"
]
wcfs_spawn
[
label
=
"spawn wcfs"
,
style
=
filled
fillcolor
=
lightyellow
]
//
wcfs_spawn [label="spawn wcfs", style=filled fillcolor=lightyellow]
nowcfs
[
label
=
"!wcfs mode"
]
nowcfs
[
label
=
"!wcfs mode"
]
wcfsInvProcess
[
label
=
"process\nZODB invalidations"
]
zconnCacheGet
[
label
=
"zconn.Cache.Get"
]
zobj2file
[
label
=
"Z* → file/[]#blk"
]
zblk2file
[
label
=
"ZBlk* → file/[]#blk"
]
zbtree2file
[
label
=
"BTree/Bucket → file/[]#blk"
]
δ
BTree
[
label
=
"δ(BTree)"
]
fuseRetrieveCache
[
label
=
"FUSE:\nretrieve cache"
]
wcfsRead
[
label
=
"read(#blk)"
]
blktabGet
[
label
=
"blktab.Get(#blk):\nmanually + → ⌈rev(#blk)⌉"
]
mappingRegister
[
label
=
"mmappings:\nregister/maint"
]
clientInvHandle
[
label
=
"process\n#blk invalidations"
]
headInv
[
label
=
"#blk ← head/inv."
]
ZODB_go_inv
[
label
=
"ZODB/go\ninvalidations"
]
ZODB_go_inv
[
label
=
"ZODB/go\ninvalidations"
]
Btree_read
[
label
=
"BTree read"
,
style
=
filled
fillcolor
=
lightyellow
]
// Btree_read [label="BTree read", style=filled fillcolor=lightyellow]
ZBlk_read
[
label
=
"ZBigFile / ZBlk* read"
,
style
=
filled
fillcolor
=
lightyellow
]
// ZBlk_read [label="ZBigFile / ZBlk* read", style=filled fillcolor=lightyellow]
ZODB_read
[
label
=
"ZODB deserialize object"
,
style
=
filled
fillcolor
=
lightyellow
]
// ZODB_read [label="ZODB deserialize object", style=filled fillcolor=lightyellow]
ZODB_binary
[
label
=
"Adapt to zodbpickle.binary"
]
;
// ogorek_persref [label="ogórek:\npersistent references", style=filled fillcolor=lightyellow];
ogorek_persref
[
label
=
"ogórek:\npersistent references"
,
style
=
filled
fillcolor
=
lightyellow
]
;
Sinvtree
[
label
=
"server: inv. tree"
]
//
Sinvtree [label="server: inv. tree"]
δ
R
[
label
=
"δR encoding"
]
//
δR [label="δR encoding"]
test
[
label
=
"? tests"
]
// zodburl [label="zstor -> zurl", style=filled fillcolor=grey95]
zodburl
[
label
=
"zstor -> zurl"
,
style
=
filled
fillcolor
=
grey95
]
autoexit
[
label
=
"autoexit\nif !activity"
]
autoexit
[
label
=
"autoexit\nif !activity"
]
}
}
wcfs/todo.svg
View file @
f34ea31d
This diff is collapsed.
Click to expand it.
wcfs/wcfs.go
View file @
f34ea31d
...
@@ -337,90 +337,8 @@ package main
...
@@ -337,90 +337,8 @@ package main
// and a client that wants @rev data will get @rev data, even if it was this
// and a client that wants @rev data will get @rev data, even if it was this
// "old" client that triggered the pagefault(*).
// "old" client that triggered the pagefault(*).
//
//
// (*) we can change a mapping while a page from it is under pagefault:
// (*) see "Changing mmapping while under pagefault is possible" in notes.txt
//
// (+) see "Client cannot be ptraced while under pagefault" in notes.txt
// - the kernel, upon handling pagefault, queues read request to filesystem
// server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
//
// kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
// fuse_readpages+1
// read_pages+109
// __do_page_cache_readahead+401
// filemap_fault+635
// __do_fault+31
// __handle_mm_fault+3403
// handle_mm_fault+220
// __do_page_fault+598
// page_fault+30
//
// - however the read request is queued to be performed asynchronously -
// the kernel does not wait for it in fuse_readpages, because
//
// * git.kernel.org/linus/c1aa96a5,
// * git.kernel.org/linus/9cd68455,
// * and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
//
// - the kernel then _releases_ client->mm->mmap_sem and then waits
// for to-read pages to become ready:
//
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
// * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
//
// - the filesystem server upon receiving the read request can manipulate
// client's address space. This requires to write-lock client->mm->mmap_sem,
// but we can be sure it won't deadlock because the kernel releases it
// before waiting (see previous point).
//
// in practice the manipulation is done by another client thread, because
// on Linux it is not possible to change mm of another process. However
// the main point here is that the manipulation is possible because
// there will be no deadlock on client->mm->mmap_sem.
//
// For the reference here is how filesystem server reply looks under trace:
//
// kprobe:fuse_readpages_end
// fuse_readpages_end+1
// request_end+188
// fuse_dev_do_write+1921
// fuse_dev_write+78
// do_iter_readv_writev+325
// do_iter_write+128
// vfs_writev+152
// do_writev+94
// do_syscall_64+85
// entry_SYSCALL_64_after_hwframe+68
//
// and a test program that demonstrates that it is possible to change
// mmapping while under pagefault to it:
//
// https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
//
// In the future mmap_sem might be released while doing any IO:
//
// https://lwn.net/Articles/768857
//
// but before that the analysis remains FUSE-specific.
//
//
// (+) the kernel sends SIGSTOP to interrupt tracee, but the signal will be
// processed only when the process returns from kernel space, e.g. here
//
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
//
// This way the tracer won't receive obligatory information that tracee
// stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
// ptrace commands will fail:
//
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
//
// My original idea was to use ptrace to run code in process to change it's
// memory mappings, while the triggering process is under pagefault/read
// to wcfs, and the above shows it won't work - trying to ptrace the
// client from under wcfs will just block forever (the kernel will be
// waiting for read operation to finish for ptrace, and read will be first
// waiting on ptrace stopping to complete = deadlock)
//
//
//
//
// XXX mmap(@at) open
// XXX mmap(@at) open
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment