Y wcfs: Fix and enhance `wcfs stop` to be more reliable

Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of basic filesystem) `wcfs stop` was implemented as just `fusermount -u`. That, however, turned out to be not robust because if wcfs is deadlocked, unmounting hangs, and if wcfs server is crashed, but there are still running client processes, unmount will fail with "Device or resource busy" error. For the deadlocked case we often see a situation where both wcfs and client zope processes are hung, kill -9 does not work on them (they still remain hung) and there is no easy way to do the unmount and restart wcfs. -> Fix `wcfs stop` to do that by first breaking the deadlock via /sys/fs/fuse/connection/<X>/abort and making sure that: 1) wcfs.go is not running, 2) all left clients are terminated, and 3) the mount is also gone In many ways this coincides with what Server.stop was already doing, so here we teach `wcfs stop` to work via that Server.stop codepath and adjust the latter to work ok if Server._proc is not only subprocess.Popen that current process spawned, but also an xos.Proc, that `wcfs stop` discovered. Which can be also None if wcfs.go crashed by itself. As explained in the comments I took the decision to kill client processes instead of doing the final unmount try lazily because # NOTE if we do `fusermount -uz` (lazy unmount = MNT_DETACH), we will # remove the mount from filesystem tree and /proc/mounts, but the # clients will be left alive and using the old filesystem which is # left in a bad ENOTCONN state. From this point of view restart of # the clients is more preferred compared to leaving them running # but actually disconnected from the data. # # TODO try to teach wcfs clients to detect filesystem in ENOTCONN state # and reconnect automatically instead of being killed. Then we could # use MNT_DETACH.

Y wcfs: Fix and enhance `wcfs stop` to be more reliable
Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of basic filesystem) `wcfs stop` was implemented as just `fusermount -u`. That, however, turned out to be not robust because if wcfs is deadlocked, unmounting hangs, and if wcfs server is crashed, but there are still running client processes, unmount will fail with "Device or resource busy" error. For the deadlocked case we often see a situation where both wcfs and client zope processes are hung, kill -9 does not work on them (they still remain hung) and there is no easy way to do the unmount and restart wcfs. -> Fix `wcfs stop` to do that by first breaking the deadlock via /sys/fs/fuse/connection/<X>/abort and making sure that: 1) wcfs.go is not running, 2) all left clients are terminated, and 3) the mount is also gone In many ways this coincides with what Server.stop was already doing, so here we teach `wcfs stop` to work via that Server.stop codepath and adjust the latter to work ok if Server._proc is not only subprocess.Popen that current process spawned, but also an xos.Proc, that `wcfs stop` discovered. Which can be also None if wcfs.go crashed by itself. As explained in the comments I took the decision to kill client processes instead of doing the final unmount try lazily because # NOTE if we do `fusermount -uz` (lazy unmount = MNT_DETACH), we will # remove the mount from filesystem tree and /proc/mounts, but the # clients will be left alive and using the old filesystem which is # left in a bad ENOTCONN state. From this point of view restart of # the clients is more preferred compared to leaving them running # but actually disconnected from the data. # # TODO try to teach wcfs clients to detect filesystem in ENOTCONN state # and reconnect automatically instead of being killed. Then we could # use MNT_DETACH.
4095241e · Kirill Smelkov · e45fd39f · 4095241e · 4095241e
Commit 4095241e authored Dec 04, 2024 by Kirill Smelkov
Expand all Show whitespace changes
Inline Side-by-side

Showing with 171 additions and 32 deletions

wcfs/__init__.py wcfs/__init__.py +150 -30

wcfs/wcfs_test.py wcfs/wcfs_test.py +21 -2

No files found.
--- a/wcfs/__init__.py
+++ b/wcfs/__init__.py
--- a/wcfs/wcfs_test.py
+++ b/wcfs/wcfs_test.py
@@ -410,9 +410,28 @@ class tWCFS(_tWCFS):
            assert not is_mountpoint(t.wc.mountpoint)
        defer(_)
        def _():
-            def onstuck():
+            def on_wcfs_stuck():
                fail("wcfs.go does not exit even after SIGKILL")
-            t.wc._wcsrv._stop(timeout(), _onstuck=onstuck)
+
+            # do not kill clients when the filesystem is still in use on stop
+            # and use -z (lazy) unmount instead because during tests it is more
+            # convenient that this last unmount unconditionally succeed and we
+            # do not care that much about file descriptors left open by a buggy
+            # test function.
+            #
+            # NOTE this behaviour is different from on-production stop behaviour
+            #      where we make sure that either stop fails or completes and there
+            #      is no more a) mount, b) wcfs.go running and c) clients using the old mount.
+            def on_fs_busy():
+                wcfs.log.warn("test: not killing clients during test run to avoid killing test driver itself)")
+            def on_last_unomount_try(mntpt):
+                wcfs.log.warn("test: -> unmount -z ...")
+                wcfs._fuse_unmount(mntpt, "-z")
+
+            t.wc._wcsrv._stop(timeout(),
+                              _on_wcfs_stuck=on_wcfs_stuck,
+                              _on_fs_busy=on_fs_busy,
+                              _on_last_unmount_try=on_last_unomount_try)
        defer(_)
        defer(t.wc.close)
        assert is_mountpoint(t.wc.mountpoint)