• Kirill Smelkov's avatar
    Y wcfs: Fix and enhance `wcfs status` to be reliable · e542211d
    Kirill Smelkov authored
    Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of
    basic filesystem) `wcfs status` was implemented as just join and
    reporting ok if that worked. That, however, turned out to be not robust
    because if wcfs is deadlocked, accessing any file on the filesystem,
    even a simple file as .wcfs/zurl might hang and so the status could
    hang as well. We see lots of such hung `wcfs status` processes on
    current deployment.
    
    More, it might be the case that wcfs is deadlocked in another way - e.g.
    on zheadMu, and then accessing .wcfs/zurl will work ok, but the system
    is not in a good shape while `wcfs status` missed to report that.
    
    -> Rework `wcfs status` completely to try accessing different files on
    the filesystem and doing so in cautious way so that if wcfs is in
    problematic state `wcfs status` won't get hung and will report the
    details about wcfs server and also about filesystem clients: which files
    are kept open, and what is in-kernel traceback of the server and the
    clients in case wcfs is hung.
    
    Please see comments in the added status function for details.
    
    An example of "good" status output when everything is ok:
    
        (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ wcfs status  file://D/1.fs
        INFO 1204 17:04:40.154  325432 __init__.py:506] wcfs: status file://D/1.fs ...
        ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8  (0:39)
        ok - wcfs server: pid325414 kirr wcfs
        ok - stat mountpoint
        ok - read .wcfs/zurl
        ok - read .wcfs/stats
    
    And example of "bad" status output when wcfs was simulated to be
    seen in deadlocked state by trying to read from .wcfs/debug/zhead
    instead of .wcfs/stats:
    
        root@deca:/home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core# wcfs status file://D/1.fs
        INFO 1204 17:21:04.145  325658 __init__.py:506] wcfs: status file://D/1.fs ...
        ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8  (0:39)
        ok - wcfs server: pid325414 kirr wcfs
        ok - stat mountpoint
        ok - read .wcfs/zurl
        fail - read .wcfs/stats: timed out (wcfs might be stuck)
    
        wcfs ktraceback:
        pid325414 kirr wcfs
        tid325414 kirr wcfs
        [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
        [<0>] fuse_dev_read+0x79/0xb0 [fuse]
        [<0>] vfs_read+0x239/0x310
        [<0>] ksys_read+0x6b/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325418 kirr wcfs
        [<0>] hrtimer_nanosleep+0xc7/0x1b0
        [<0>] __x64_sys_nanosleep+0xbe/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325419 kirr wcfs
        [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
        [<0>] fuse_dev_read+0x79/0xb0 [fuse]
        [<0>] vfs_read+0x239/0x310
        [<0>] ksys_read+0x6b/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325420 kirr wcfs
        [<0>] futex_wait_queue+0x60/0x90
        [<0>] futex_wait+0x185/0x270
        [<0>] do_futex+0x106/0x1b0
        [<0>] __x64_sys_futex+0x8e/0x1d0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325421 kirr wcfs
        [<0>] do_epoll_wait+0x698/0x7d0
        [<0>] do_compat_epoll_pwait.part.0+0xb/0x70
        [<0>] __x64_sys_epoll_pwait+0x91/0x140
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325422 kirr wcfs
        [<0>] futex_wait_queue+0x60/0x90
        [<0>] futex_wait+0x185/0x270
        [<0>] do_futex+0x106/0x1b0
        [<0>] __x64_sys_futex+0x8e/0x1d0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325423 kirr wcfs
        [<0>] futex_wait_queue+0x60/0x90
        [<0>] futex_wait+0x185/0x270
        [<0>] do_futex+0x106/0x1b0
        [<0>] __x64_sys_futex+0x8e/0x1d0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325426 kirr wcfs
        [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
        [<0>] fuse_dev_read+0x79/0xb0 [fuse]
        [<0>] vfs_read+0x239/0x310
        [<0>] ksys_read+0x6b/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325427 kirr wcfs
        [<0>] futex_wait_queue+0x60/0x90
        [<0>] futex_wait+0x185/0x270
        [<0>] do_futex+0x106/0x1b0
        [<0>] __x64_sys_futex+0x8e/0x1d0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325428 kirr wcfs
        [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
        [<0>] fuse_dev_read+0x79/0xb0 [fuse]
        [<0>] vfs_read+0x239/0x310
        [<0>] ksys_read+0x6b/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325429 kirr wcfs
        [<0>] futex_wait_queue+0x60/0x90
        [<0>] futex_wait+0x185/0x270
        [<0>] do_futex+0x106/0x1b0
        [<0>] __x64_sys_futex+0x8e/0x1d0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        tid325646 kirr wcfs
        [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse]
        [<0>] fuse_dev_read+0x79/0xb0 [fuse]
        [<0>] vfs_read+0x239/0x310
        [<0>] ksys_read+0x6b/0xf0
        [<0>] do_syscall_64+0x58/0xc0
        [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        wcfs clients:
          pid325430 kirr bash ('bash',)
                cwd     -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8
    
                pid325430 kirr bash
                tid325430 kirr bash
                [<0>] do_select+0x661/0x830
                [<0>] core_sys_select+0x1ba/0x3a0
                [<0>] do_pselect.constprop.0+0xe9/0x180
                [<0>] __x64_sys_pselect6+0x53/0x80
                [<0>] do_syscall_64+0x58/0xc0
                [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
          pid325637 kirr ipython3 ('/usr/bin/python3', '/usr/bin/ipython3')
                fd/12   -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8/.wcfs/zurl
    
                pid325637 kirr ipython3
                tid325637 kirr ipython3
                [<0>] do_epoll_wait+0x698/0x7d0
                [<0>] __x64_sys_epoll_wait+0x6f/0x110
                [<0>] do_syscall_64+0x58/0xc0
                [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
                tid325638 kirr ipython3
                [<0>] futex_wait_queue+0x60/0x90
                [<0>] futex_wait+0x185/0x270
                [<0>] do_futex+0x106/0x1b0
                [<0>] __x64_sys_futex+0x8e/0x1d0
                [<0>] do_syscall_64+0x58/0xc0
                [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
                tid325640 kirr ipython3
                [<0>] futex_wait_queue+0x60/0x90
                [<0>] futex_wait+0x185/0x270
                [<0>] do_futex+0x106/0x1b0
                [<0>] __x64_sys_futex+0x8e/0x1d0
                [<0>] do_syscall_64+0x58/0xc0
                [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce
    
        Traceback (most recent call last):
          File "/home/kirr/src/wendelin/venv/z-dev/bin/wcfs", line 11, in <module>
            load_entry_point('wendelin.core', 'console_scripts', 'wcfs')()
          File "<decorator-gen-42>", line 2, in main
          File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe
            return f(*argv, **kw)
          File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 997, in main
            status(zurl)
          File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 592, in status
            verify("read .wcfs/stats", xos.readfile, "%s/.wcfs/debug/zhead" % mnt.point)
          File "<decorator-gen-43>", line 2, in verify
          File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe
            return f(*argv, **kw)
          File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 570, in verify
            fail("%s: timed out (wcfs might be stuck)" % subj)
          File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 512, in fail
            raise RuntimeError('(failed)')
        RuntimeError: (failed)
    e542211d
__init__.py 33.9 KB