Commit e542211d authored by Kirill Smelkov's avatar Kirill Smelkov

Y wcfs: Fix and enhance `wcfs status` to be reliable

Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of
basic filesystem) `wcfs status` was implemented as just join and
reporting ok if that worked. That, however, turned out to be not robust
because if wcfs is deadlocked, accessing any file on the filesystem,
even a simple file as .wcfs/zurl might hang and so the status could
hang as well. We see lots of such hung `wcfs status` processes on
current deployment.

More, it might be the case that wcfs is deadlocked in another way - e.g.
on zheadMu, and then accessing .wcfs/zurl will work ok, but the system
is not in a good shape while `wcfs status` missed to report that.

-> Rework `wcfs status` completely to try accessing different files on
the filesystem and doing so in cautious way so that if wcfs is in
problematic state `wcfs status` won't get hung and will report the
details about wcfs server and also about filesystem clients: which files
are kept open, and what is in-kernel traceback of the server and the
clients in case wcfs is hung.

Please see comments in the added status function for details.

An example of "good" status output when everything is ok:

    (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ wcfs status  file://D/1.fs
    INFO 1204 17:04:40.154  325432 __init__.py:506] wcfs: status file://D/1.fs ...
    ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8  (0:39)
    ok - wcfs server: pid325414 kirr wcfs
    ok - stat mountpoint
    ok - read .wcfs/zurl
    ok - read .wcfs/stats

And example of "bad" status output when wcfs was simulated to be
seen in deadlocked state by trying to read from .wcfs/debug/zhead
instead of .wcfs/stats:

    root@deca:/home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core# wcfs status file://D/1.fs
    INFO 1204 17:21:04.145  325658 __init__.py:506] wcfs: status file://D/1.fs ...
    ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8  (0:39)
    ok - wcfs server: pid325414 kirr wcfs
    ok - stat mountpoint
    ok - read .wcfs/zurl
    fail - read .wcfs/stats: timed out (wcfs might be stuck)

    wcfs ktraceback:
    pid325414 kirr wcfs
    tid325414 kirr wcfs
    [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
    [<0>] fuse_dev_read+0x79/0xb0 [fuse]
    [<0>] vfs_read+0x239/0x310
    [<0>] ksys_read+0x6b/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325418 kirr wcfs
    [<0>] hrtimer_nanosleep+0xc7/0x1b0
    [<0>] __x64_sys_nanosleep+0xbe/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325419 kirr wcfs
    [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
    [<0>] fuse_dev_read+0x79/0xb0 [fuse]
    [<0>] vfs_read+0x239/0x310
    [<0>] ksys_read+0x6b/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325420 kirr wcfs
    [<0>] futex_wait_queue+0x60/0x90
    [<0>] futex_wait+0x185/0x270
    [<0>] do_futex+0x106/0x1b0
    [<0>] __x64_sys_futex+0x8e/0x1d0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325421 kirr wcfs
    [<0>] do_epoll_wait+0x698/0x7d0
    [<0>] do_compat_epoll_pwait.part.0+0xb/0x70
    [<0>] __x64_sys_epoll_pwait+0x91/0x140
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325422 kirr wcfs
    [<0>] futex_wait_queue+0x60/0x90
    [<0>] futex_wait+0x185/0x270
    [<0>] do_futex+0x106/0x1b0
    [<0>] __x64_sys_futex+0x8e/0x1d0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325423 kirr wcfs
    [<0>] futex_wait_queue+0x60/0x90
    [<0>] futex_wait+0x185/0x270
    [<0>] do_futex+0x106/0x1b0
    [<0>] __x64_sys_futex+0x8e/0x1d0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325426 kirr wcfs
    [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
    [<0>] fuse_dev_read+0x79/0xb0 [fuse]
    [<0>] vfs_read+0x239/0x310
    [<0>] ksys_read+0x6b/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325427 kirr wcfs
    [<0>] futex_wait_queue+0x60/0x90
    [<0>] futex_wait+0x185/0x270
    [<0>] do_futex+0x106/0x1b0
    [<0>] __x64_sys_futex+0x8e/0x1d0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325428 kirr wcfs
    [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
    [<0>] fuse_dev_read+0x79/0xb0 [fuse]
    [<0>] vfs_read+0x239/0x310
    [<0>] ksys_read+0x6b/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325429 kirr wcfs
    [<0>] futex_wait_queue+0x60/0x90
    [<0>] futex_wait+0x185/0x270
    [<0>] do_futex+0x106/0x1b0
    [<0>] __x64_sys_futex+0x8e/0x1d0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    tid325646 kirr wcfs
    [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
    [<0>] fuse_dev_read+0x79/0xb0 [fuse]
    [<0>] vfs_read+0x239/0x310
    [<0>] ksys_read+0x6b/0xf0
    [<0>] do_syscall_64+0x58/0xc0
    [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    wcfs clients:
      pid325430 kirr bash ('bash',)
            cwd     -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8

            pid325430 kirr bash
            tid325430 kirr bash
            [<0>] do_select+0x661/0x830
            [<0>] core_sys_select+0x1ba/0x3a0
            [<0>] do_pselect.constprop.0+0xe9/0x180
            [<0>] __x64_sys_pselect6+0x53/0x80
            [<0>] do_syscall_64+0x58/0xc0
            [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

      pid325637 kirr ipython3 ('/usr/bin/python3', '/usr/bin/ipython3')
            fd/12   -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8/.wcfs/zurl

            pid325637 kirr ipython3
            tid325637 kirr ipython3
            [<0>] do_epoll_wait+0x698/0x7d0
            [<0>] __x64_sys_epoll_wait+0x6f/0x110
            [<0>] do_syscall_64+0x58/0xc0
            [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

            tid325638 kirr ipython3
            [<0>] futex_wait_queue+0x60/0x90
            [<0>] futex_wait+0x185/0x270
            [<0>] do_futex+0x106/0x1b0
            [<0>] __x64_sys_futex+0x8e/0x1d0
            [<0>] do_syscall_64+0x58/0xc0
            [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

            tid325640 kirr ipython3
            [<0>] futex_wait_queue+0x60/0x90
            [<0>] futex_wait+0x185/0x270
            [<0>] do_futex+0x106/0x1b0
            [<0>] __x64_sys_futex+0x8e/0x1d0
            [<0>] do_syscall_64+0x58/0xc0
            [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

    Traceback (most recent call last):
      File "/home/kirr/src/wendelin/venv/z-dev/bin/wcfs", line 11, in <module>
        load_entry_point('wendelin.core', 'console_scripts', 'wcfs')()
      File "<decorator-gen-42>", line 2, in main
      File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe
        return f(*argv, **kw)
      File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 997, in main
        status(zurl)
      File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 592, in status
        verify("read .wcfs/stats", xos.readfile, "%s/.wcfs/debug/zhead" % mnt.point)
      File "<decorator-gen-43>", line 2, in verify
      File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe
        return f(*argv, **kw)
      File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 570, in verify
        fail("%s: timed out (wcfs might be stuck)" % subj)
      File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 512, in fail
        raise RuntimeError('(failed)')
    RuntimeError: (failed)
parent b0c8b32a
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment