Y wcfs: Fix and enhance `wcfs status` to be reliable
Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of basic filesystem) `wcfs status` was implemented as just join and reporting ok if that worked. That, however, turned out to be not robust because if wcfs is deadlocked, accessing any file on the filesystem, even a simple file as .wcfs/zurl might hang and so the status could hang as well. We see lots of such hung `wcfs status` processes on current deployment. More, it might be the case that wcfs is deadlocked in another way - e.g. on zheadMu, and then accessing .wcfs/zurl will work ok, but the system is not in a good shape while `wcfs status` missed to report that. -> Rework `wcfs status` completely to try accessing different files on the filesystem and doing so in cautious way so that if wcfs is in problematic state `wcfs status` won't get hung and will report the details about wcfs server and also about filesystem clients: which files are kept open, and what is in-kernel traceback of the server and the clients in case wcfs is hung. Please see comments in the added status function for details. An example of "good" status output when everything is ok: (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ wcfs status file://D/1.fs INFO 1204 17:04:40.154 325432 __init__.py:506] wcfs: status file://D/1.fs ... ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 (0:39) ok - wcfs server: pid325414 kirr wcfs ok - stat mountpoint ok - read .wcfs/zurl ok - read .wcfs/stats And example of "bad" status output when wcfs was simulated to be seen in deadlocked state by trying to read from .wcfs/debug/zhead instead of .wcfs/stats: root@deca:/home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core# wcfs status file://D/1.fs INFO 1204 17:21:04.145 325658 __init__.py:506] wcfs: status file://D/1.fs ... ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 (0:39) ok - wcfs server: pid325414 kirr wcfs ok - stat mountpoint ok - read .wcfs/zurl fail - read .wcfs/stats: timed out (wcfs might be stuck) wcfs ktraceback: pid325414 kirr wcfs tid325414 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325418 kirr wcfs [<0>] hrtimer_nanosleep+0xc7/0x1b0 [<0>] __x64_sys_nanosleep+0xbe/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325419 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325420 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325421 kirr wcfs [<0>] do_epoll_wait+0x698/0x7d0 [<0>] do_compat_epoll_pwait.part.0+0xb/0x70 [<0>] __x64_sys_epoll_pwait+0x91/0x140 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325422 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325423 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325426 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325427 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325428 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325429 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325646 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce wcfs clients: pid325430 kirr bash ('bash',) cwd -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 pid325430 kirr bash tid325430 kirr bash [<0>] do_select+0x661/0x830 [<0>] core_sys_select+0x1ba/0x3a0 [<0>] do_pselect.constprop.0+0xe9/0x180 [<0>] __x64_sys_pselect6+0x53/0x80 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce pid325637 kirr ipython3 ('/usr/bin/python3', '/usr/bin/ipython3') fd/12 -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8/.wcfs/zurl pid325637 kirr ipython3 tid325637 kirr ipython3 [<0>] do_epoll_wait+0x698/0x7d0 [<0>] __x64_sys_epoll_wait+0x6f/0x110 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325638 kirr ipython3 [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325640 kirr ipython3 [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce Traceback (most recent call last): File "/home/kirr/src/wendelin/venv/z-dev/bin/wcfs", line 11, in <module> load_entry_point('wendelin.core', 'console_scripts', 'wcfs')() File "<decorator-gen-42>", line 2, in main File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 997, in main status(zurl) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 592, in status verify("read .wcfs/stats", xos.readfile, "%s/.wcfs/debug/zhead" % mnt.point) File "<decorator-gen-43>", line 2, in verify File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 570, in verify fail("%s: timed out (wcfs might be stuck)" % subj) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 512, in fail raise RuntimeError('(failed)') RuntimeError: (failed)
Showing