- 05 May, 2026 1 commit
-
-
By default, FUSE restricts access to the mounted filesystem to the user who performed the mount. This prevents other users from accessing WCFS, limiting multi-user deployments. FUSE's 'allow_other' option [1] enables access for all users, but this can create a security risk on systems where only some users are trusted. This patch introduces a new '-sharewith' flag that allows specifying an OS group with which WCFS access is shared. 'allow_other' is only enabled if this flag is set, preventing unintentional exposure. NOTE Automatically testing this feature is difficult because it requires privileged operations. Therefore, this patch adds a manual test at 'wcfs/testprog/wcfs_verify_permissions.py'. [1] See 'allow_other' option at https://docs.kernel.org/filesystems/fuse/fuse.html -------- kirr: - redevelop the test almost from scratch to be run automatically via unshare + subordinate uid/gid - also fix mode for directories, not only for files - activate "default...
-
- 18 Mar, 2026 4 commits
-
-
Kirill Smelkov authored
After previous patch and with recent pygolang wendelin.core compiles and passes tests on py3.13 . /cc @tomo /reviewed-by @levin.zimmermann, @jerome /reviewed-on nexedi/wendelin.core!43 + nexedi/slapos!1863 (comment 260064)
-
Kirill Smelkov authored
With Cython 3 the default for functions, that return void and have no explicit `except` specification, was switched from `noexcept` to `except *` even for functions that are nogil. As the result, when Cython 3 compiles e.g. wcfs/internal/wcfs_test.pyx, it complains with def install_sigbus_trap(): cdef sigaction_t act act.sa_sigaction = on_sigbus ^ ------------------------------------------------------------ wcfs/internal/wcfs_test.pyx:192:23: Cannot assign type 'void (int, siginfo_t *, void *) except * nogil' to 'void (*)(int, siginfo_t *, void *) noexcept'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'on_sigbus'. In wendelin.core, similarly to pygolang, there are many nogil functions and the perception of those are that they are unrelated to python world unless explicitly programmed via e.g. `with gil` sections inside. So a `noexcept` specification on such nogil functions could be a bit misleading to the reader suggestion that this noexcept is about e.g. C++ part or panic. -> Avoid this confusion by activating "legacy" mode of having "no py except" by default if that is not specified. It works on both Cython 3 and Cython 0.29.x because, added directive is simply ignored by Cython 0.29.x and the builtin behaviour is already ok. This patch is based on and mirrors the following pygolang patch: pygolang@b5bb9f7e /cc @tomo /reviewed-by @levin.zimmermann, @jerome /reviewed-on nexedi/wendelin.core!43 + nexedi/slapos!1863 (comment 260064) -
Kirill Smelkov authored
All tests are now passing with that python version as well. /cc @tomo /reviewed-by @levin.zimmermann, @jerome /reviewed-on nexedi/wendelin.core!43 + nexedi/slapos!1863 (comment 260064)
-
Kirill Smelkov authored
bigfile/py: Starting from py3.12 current exception state is kept only in current_exception no longer using curexc_type and curexc_traceback curexc_type and curexc_traceback can all be derived from current exception value. See https://github.com/python/cpython/commit/feec49c40736 and https://github.com/python/cpython/issues/101578 for details. -> Rework the code to only use current_exception on py ≥ 3.12 similarly to how we did for "handled" exception in 8071e2da (bigfile/py: Starting from py3.11 exception state is kept only in exc_value no longer using exc_type and exc_traceback. /cc @tomo /reviewed-by @levin.zimmermann, @jerome /reviewed-on !43 + slapos!1863 (comment 260064)
-
- 30 Jan, 2026 2 commits
-
-
On kernel 5.10.0-10-amd64, wcfs tests intermittently fail due to inconsistent FUSE error codes when the filesystem daemon is killed. Some read operations receive ECONNABORTED while others receive ENOTCONN, causing test assertions to fail. The inconsistency was observed on an older kernel (5.10.x) but not on a newer kernel (6.1.0-40-amd64+), suggesting the race has been fixed in later kernel versions. [1] Traceback is: ______________________________ test_start_after_crash _______________________________ @func def test_start_after_crash(): zurl = testzurl mntpt = testmntpt > wc = start_and_crash_wcfs(zurl, mntpt) wcfs/wcfs_test.py:214: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ zurl = 'file:///srv/slapgrid/slappart77/tmp/testdb_fs.wYO8Fy/1.fs', mntpt = '/dev/shm/wcfs/306b2220fc0d6299336a7d2279556c07d48a868a' def start_and_crash_wcfs(zurl, mntpt): # -> WCFS # /proc/mounts should not contain wcfs entry with raises(KeyError): procmounts_lookup_wcfs(zurl) # start the server with attached client wcsrv = wcfs.start(zurl) assert wcsrv.mountpoint == mntpt assert mntpt not in wcfs._wcregistry wc = wcfs.join(zurl, autostart=False) assert wcfs._wcregistry[mntpt] is wc assert wc.mountpoint == mntpt assert xos.readfile(mntpt + "/.wcfs/zurl") == zurl # /proc/mounts should now contain wcfs entry assert procmounts_lookup_wcfs(zurl) == mntpt # kill the server os.kill(wcsrv._proc.pid, SIGKILL) assert procwait_(context.background(), wcsrv._proc) # access to filesystem should raise "Transport endpoint not connected" with raises(IOError) as exc: xos.readfile(mntpt + "/.wcfs/zurl") > assert exc.value.errno == ENOTCONN E AssertionError: assert 103 == 107 E + where 103 = IOError(103, 'Software caused connection abort').errno E + where IOError(103, 'Software caused connection abort') = <ExceptionInfo IOError tblen=2>.value wcfs/wcfs_test.py:285: AssertionError -------- kirr: - go through all wcfs places and adjust them to also handle ECONNABORTED where ENOTCONN was previously there. This covers more places in tests and actual production code in wcfs/__init__.py - in tests assert that eventually we still reach ENOTCONN while ECONNABORTED is alowed to be interim. - explain the problem in comments. The problem turned out not to be specific to linux 5.10 and should be present on modern kernels as well: in do_exit, exit_files is called before exit_notify, that sets tsk->exit_state = EXIT_ZOMBIE: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/exit.c?h=v6.18-rc5#n897 but exit_files, even though it is called before exit_notify, releases opened file descriptors in delayed work: 5.10: exit_files https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n451 put_files_struct https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n427 close_files https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n402 filp_close https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/open.c?h=v5.10#n1266 fput_many https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v5.10#n335 6.19-rc7: exit_files https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n518 put_files_struct https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n506 close_files https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n474 filp_close https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/open.c?h=v6.19-rc7#n1542 fput_close https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v6.19-rc7#n582 __fput_deferred https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v6.19-rc7#n518 which might result in those fds release to happen both before and after exit_notify depending on scheduling. Now let's consider what happens when wcfs server is killed: the kernel goes to do_exit for wcfs, invokes exit_files() and then exit_notify(). Once exit_notify is done, wcfs->task_state is set to Z and wait on wcfs process becomes ready. However if delayed work is not executed yet the kernel still sees fd for /dev/fuse, which wcfs was using for FUSE exchange with the kernel, being still open. This way when the test in question tries to check access to any wcfs/anyfile, it results in fuse_lookup() being called, that fuse_lookup enters fuse_lookup_name -> fuse_request_simple -> fuse_get_req, still sees fuse_conn->connected=1, because last closure of /def/fuse fd did not yet happened, and so further creates and queues request to wcfs server. But once closure of /dev/fuse fd actually happens, fuse_dev_release -> fuse_dev_end_requests wakes up all queued requests to be completed with ECONNABORTED: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2712 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2535 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2407 That's how ECONNABORTED popped up in the problematic test run and it is not specific to 5.10 kernel at all. NOTE should /dev/fd closure actually happen _before_ test process wakeup and trial to access wcfs/anyfile, then fuse_lookup -> ... -> fuse_get_req would see fuse_conn->connected=0 and return ENOTCONN outright: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n217 Original patch: levin.zimmermann/wendelin.core@a99b7979 /reviewed-by @kirr /reviewed-on !34 -
Previously, Server._proc had a dual-type design where it could be either: 1. subprocess.Popen when the server was spawned by _start() 2. xos.Proc when the server was discovered via status() or stop() This dual-type design caused an error in _stuckdump() when trying to call .get() on Popen objects, since only xos.Proc has the .get() method for accessing process properties like ktraceback. The distinction between spawned vs. discovered processes was never actually leveraged - all operations only care about the PID, liveness checks, sending signals, and getting process information. None of these require knowing whether we spawned the process or discovered it. NOTE _procwait() and _proc_isalive() still support both types since they are general-purpose utilities used with subprocess.Popen objects in tests [1]. [1] https://lab.nexedi.com/nexedi/wendelin.core/-/blob/c0ffbcda/wcfs/wcfs_test.py#L1475-1477 -------- kirr: - add test - keep reference to spawned subprocess.Popen to avoid warnings: https://github.com/python/cpython/blob/v3.14.2-286-gde1b2cce302/Lib/subprocess.py#L1131-L1139 - assert that looked-up xos.Proc for spawned WCFS is not None - use regular wcfs.test._procwait without assert as that function raises an error when not succeeding. Use timeout() as waiting context instead of context.background() similarly to other in-test places. Without the fix added test fails as __________ test_wcfs_stuckdump_crash_after_start ___________ @func def test_wcfs_stuckdump_crash_after_start(): wcsrv = wcfs.start(testzurl) defer(wcsrv.stop) > wcsrv._stuckdump() # used to AttributeError: 'Popen' object has no attribute 'get' wcfs/wcfs_test.py:2060: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ <decorator-gen-40>:2: in _stuckdump ??? ../../tools/go/pygolang/golang/__init__.py:166: in _goframe return f(*argv, **kw) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ wcsrv = <wendelin.wcfs.Server instance at 0x7fe647b9af00> @func(Server) def _stuckdump(wcsrv): # -> str v = [] def emit(text=''): v.append(text+"\n") emit("\nwcfs ktraceback:") > emit(wcsrv._proc.get('ktraceback', eperm="strerror", gone="strerror")) E AttributeError: 'Popen' object has no attribute 'get' wcfs/__init__.py:754: AttributeError Original patch: levin.zimmermann/wendelin.core@4b4e1f22 /reviewed-by @kirr /reviewed-on nexedi/wendelin.core!34
-
- 17 Nov, 2025 6 commits
-
-
Kirill Smelkov authored
wcfs: Don't pinkill clients that become killed by OS or otherwise exit when they receive pin notification When a block becomes changed by new transaction WCFS notifies clients, that already have that block mmaped, about the need to remmap the block to particular revision with pin notification over watchlink channel. That pin notification mechanism should be collectively and cooperatively followed by both WCFS and _all_ clients for it to work reliably for WCFS to never enter a situation when corrupt data are provided to a client. The mechanism involves acknowledgements from clients and WCFS awaits for that acknowledgement before unpausing _all_ clients that try to read the block simultaneously(*). So if a client is slow to respond to pin notification, or does not respond at all - it creates "progress" problem for everyone else and the system can get stuck. To solve that problem WCFS implements protection against slow / faulty clients: such a client becomes killed by WCFS with SIGBUS to unblock the system and to make progress while maintaining the invariant that all alive clients are provided with correct data. That logic was implemented in c559ec1a (wcfs: Implement protection against faulty client) and is documented in (+). It works, but there is one peculiarity: if a client, upon receiving pin notification, becomes killed by separate mechanism, e.g. via kill signal from OS, WCFS still wants to kill that client and logs a huge warning about that, because stuck/incorrect clients are considered unnormal. -> Fix emission of that warning by first checking whether client is still alive when "bad pin reply" condition is detected, and avoiding the warning if the client is not there anymore. For the reference here is what happens when a client gets killed by OS: - the OS kernel starts to close all file descriptors of client process - which invokes closure of opened head/watch handles - which might trigger on WCFS side e.g. "peer closed its end" error when WCFS tries to send pin notification, or "unexpected EOF" / "await canceled" when WCFS awaits for the reply. - the OS kernel switched process' state to Z (Zombie) in process table This patch takes the following patch by Levin into account: levin.zimmermann/wendelin.core@ff8a2d1a . (*) Isolation protocol description: https://lab.nexedi.com/nexedi/wendelin.core/-/blob/c0ffbcda/wcfs/wcfs.go#L93-183 (+) Protection against slow or faulty clients: https://lab.nexedi.com/nexedi/wendelin.core/-/blob/c0ffbcda/wcfs/wcfs.go#L186-217 Co-authored-by:
Levin Zimmermann <levin.zimmermann@nexedi.com> /reviewed-on !33
-
When a WCFS client doesn't respond to a pin request in time, the server attempts to kill it [1]. However, there are cases where the client may stop for unrelated reasons (e.g. being restarted by another program) after the pin request is sent. In such situations, WCFS should not forcefully kill the client, as this leads to misleading logs suggesting the client was faulty, when in fact it was simply restarted. This commit adds tests that reproduce these scenarios and verify that WCFS only kills clients when truly necessary. Note: these tests currently fail, as WCFS still kills dead or dying clients. [1] nexedi/wendelin.core@c559ec1a -------- kirr: - use os._exit instead of sys.exit to simulate OS-level process kill - add 2·pinkill sleep before verifying that the process was not pinkilled by wcfs - mark added test with xfail - don't duplicate code - cosmetics Added test currently fails as e.g. INFO wcfs:__init__.py:301 starting for file:///tmp/testdb_fs.01bOZy/1.fs ... I1110 15:21:59.221350 2938668 wcfs.go:2754] start "/dev/shm/wcfs/5d8d6942d7f39fa05fe1024e4c8a8c21a44e1254" "file:///tmp/testdb_fs.01bOZy/1.fs" I1110 15:21:59.221419 2938668 wcfs.go:2760] (built with go1.25.4) W1110 15:21:59.221546 2938668 15:21] 9.221542 zodb: FIXME: open file:///tmp/testdb_fs.01bOZy/1.fs: raw cache is not ready for invalidations -> NoCache forced INFO wcfs:__init__.py:343 started pid2938668 @ /dev/shm/wcfs/5d8d6942d7f39fa05fe1024e4c8a8c21a44e1254 M: commit -> @at1 (0404bfc5fd8b75ee) M: f<0000000000000010> [2] M: commit -> @at2 (0404bfc5fdadfd00) M: f<0000000000000010> [2] C: setup watch f<0000000000000010> @at1 (0404bfc5fd8b75ee) # pinok: {2: @at1 (0404bfc5fd8b75ee)} E1110 15:22:00.463181 2938668 wcfs.go:1603] pid2938691: client failed to handle pin notification correctly and timely in 3s: pin #2 @0404bfc5fd8b75ee: sendReq: waiting for reply: context canceled E1110 15:22:00.463203 2938668 wcfs.go:1603] pid2938691: -> killing it because else 1) all other clients will remain stuck, and 2) we no longer can provide correct data to the faulty client. E1110 15:22:00.463209 2938668 wcfs.go:1603] pid2938691: (see "Protection against slow or faulty clients" in wcfs description for details) E1110 15:22:00.463217 2938668 wcfs.go:1642] pid2938691: <- SIGBUS E1110 15:22:00.463394 2938668 wcfs.go:1603] pid2938691: terminated E1110 15:22:00.463420 2938668 wcfs.go:2085] wlink 1: serve rx: unexpected EOF >>> Change history by file: f<0000000000000010>: 0 1 2 3 4 5 6 7 a b c d e f g h @at0 (0404bfc5fca35122) @at1 (0404bfc5fd8b75ee) 2 @at2 (0404bfc5fdadfd00) 2 INFO wcfs:__init__.py:418 unmount/stop wcfs pid2938668 @ /dev/shm/wcfs/5d8d6942d7f39fa05fe1024e4c8a8c21a44e1254 I1110 15:22:09.606301 2938668 wcfs.go:2942] stop "/dev/shm/wcfs/5d8d6942d7f39fa05fe1024e4c8a8c21a44e1254" "file:///tmp/testdb_fs.01bOZy/1.fs" FAILED ============================================ FAILURES ============================================= ___________________ test_wcfs_pinhfaulty_kill_on_watch[_bad_watch_stop_on_pin] ____________________ faulty = <function _bad_watch_stop_on_pin at 0x7f45ee31aad0>, with_prompt_pintimeout = None @mark.parametrize('faulty', [ _bad_watch_no_pin_read, _bad_watch_no_pin_reply, _bad_watch_stop_on_pin, _bad_watch_eof_pin_reply, _bad_watch_nak_pin_reply, ]) @func def test_wcfs_pinhfaulty_kill_on_watch(faulty, with_prompt_pintimeout): t = tDB(multiproc=True); zf = t.zfile defer(t.close) at1 = t.commit(zf, {2:'c1'}) at2 = t.commit(zf, {2:'c2'}) f = t.open(zf) f.assertData(['','','c2']) # launch faulty process that should be killed by wcfs on problematic pin during watch setup p = tFaultySubProcess(t, faulty, at=at1) defer(p.close) t.assertStats({'pinkill': 0}) # wait till faulty client issues its watch, receives pin and pauses/misbehaves p.send("start watch") if faulty != _bad_watch_no_pin_read: assert p.recv(t.ctx) == b"pin %s #%d @%s" % (h(zf._p_oid), 2, h(at1)) # issue our watch request - it should be served well and without any delay wl = t.openwatch() wl.watch(zf, at1, {2:at1}) # the faulty client must become killed by wcfs # but client that stops itself, must not be killed must_kill = (faulty != _bad_watch_stop_on_pin) if not must_kill: # give time to wcfs to detect wlink close and potentially initiate pinkill # do not wait on the process yet, so it remains in the OS process table in Z state xsleep(t.ctx, 2*t.pintimeout) p.join(t.ctx) assert p.exitcode is not None > t.assertStats({'pinkill': int(must_kill)}) wcfs_faultyprot_test.py:213: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ t = <wendelin.wcfs.wcfs_test.tDB object at 0x7f45edf9dd70>, kvok = {'pinkill': 0} def assertStats(t, kvok): # kstats loads stats subset with kvok keys. def kstats(): stats = t._loadStats() kstats = {} for k in kvok.keys(): kstats[k] = stats.get(k, None) return kstats # wait till stats reaches expected state ctx = timeout() while 1: kv = kstats() if kv == kvok: break if ctx.err() is not None: > assert kv == kvok, "stats did not reach expected state" E AssertionError: stats did not reach expected state E assert {'pinkill': 1} == {'pinkill': 0} E Differing items: E {'pinkill': 1} != {'pinkill': 0} E Full diff: E - {'pinkill': 1} E ? ^ E + {'pinkill': 0} E ? ^ wcfs_test.py:478: AssertionError Original patch: levin.zimmermann/wendelin.core@8d3fa76d /reviewed-by @kirr /reviewed-on nexedi/wendelin.core!33
-
Kirill Smelkov authored
Each wcfs test is run under timeout to detect e.g. that something is stuck and to unmount the filesystem forcibly on such case. That overall timeout is much less compared to regular 30s pinkill timeout. And is much high compared to 3s pinkill timeout used in faulty-protection tests. The faulty protection tests usually need to wait for a kill to happen for 2·pinkill timeout to reliably detect that event in the presence of surrounding OS load and that was working quite ok so far because the actual kill was usually happening in around 1·pinkill since start of the waiting. However in the next patch we will need to test whether wcfs does _not_ kill an innocent client, similarly waiting for that 2·pinkill time, but here it will be full 2·pinkill sleep without trimming because for negative condition (does _not_ kill) we need to predictably wait much longer compared to pinkill time. With that 6s just for the sleep, and test setup and other overhead things start to trigger overall timeout quite frequently. -> Increase that overall timeout from 10s to 15s to cover that "need to wait longer" situation while still maintaining invariant for the timeout to stay much less of regular pinkill time and much more of faultyprot tests pinkill time. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!33
-
Kirill Smelkov authored
Both WatchLink-level and raw-level faulty-protection tests receive pin messages and send their content to supervisor process. Tests that work at WatchLink-level receive pin-messages via WatchLink.recvReq that look like: pin <bigfileX> #<blk> @<rev> however raw-reading from opened head/watch handle returns same string prefixed with message-ID and suffixed with \n: <message-id> pin <bigfileX> #<blk> @<rev> \n The supervisor process does not care transport-level detail and wants to observe only the semantic on which pin message was received. The high-level tests already match that by sending exactly what WatchLink.recvReq gave them, while the raw-level test needs to trim received line to match that. -> Add corresponding comment to make that clear. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!33
-
Kirill Smelkov authored
In wcfs_faultyprot_test.py we have tests that exercise wcfs behaviour against faulty clients that do not handle pin notifications well. There are several scenarios tested. The scenarios that do the tests at WatchLink level already use common functions to setup the watchlink and perform shared actions. However tests that exercise behaviour with watchlink being opened at raw level with regular open syscall instead of using WatchLink class, were duplicating code for their setup. Those tests were added in c91fb14e (wcfs: tests: Extend faulty protection tests with more kinds of faulty clients). -> Refactor the code to remove duplication because soon we will need to add more tests that exercise behaviour with raw-level IO on watchlink, and so we need to first bring order and structure as a preparatory step for that. Plain code refacoring without semantic change. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!33
-
Kirill Smelkov authored
/reviewed-by @levin.zimmermann /reviewed-on !33
-
- 11 Nov, 2025 1 commit
-
-
This updates the version of go123. The new version supports Go 1.25, which is required because we want to compile wendelin.core with Go ≥ 1.24 to fix a memory leak in NEO/go [1]. [1] See kirr/neo!11 for more context. /reviewed-by @kirr /reviewed-on !45
-
- 26 Sep, 2025 1 commit
-
-
Kirill Smelkov authored
Kazuhiko reports that On one of my servers, from wendelin.bigarray.array_zodb import ZBigArray fails. >>> from wendelin.bigarray.array_zodb import ZBigArray Traceback (most recent call last): File "<console>", line 1, in <module> File "/(SR)/parts/wendelin.core/bigarray/array_zodb.py", line 32, in <module> from wendelin.bigfile.file_zodb import ZBigFile File "/(SR)/parts/wendelin.core/bigfile/file_zodb.py", line 166, in <module> from wendelin.bigfile._file_zodb import _ZBigFile File "bigfile/_file_zodb.pyx", line 1, in init wendelin.bigfile._file_zodb # -*- coding: utf-8 -*- File "/(SR)/parts/wendelin.core/wcfs/__init__.py", line 87, in <module> from wendelin.wcfs.internal import glog File "/(SR)/parts/wendelin.core/wcfs/internal/glog.py", line 31, in <module> from wendelin.wcfs.internal import os as xos File "/(SR)/parts/wendelin.core/wcfs/internal/os.py", line 32, in <module> from wendelin.wcfs.internal._os import gettid ImportError: /(SR)/parts/wendelin.core/wcfs/internal/_os.so: undefined symbol: gettid That happens because gettid is available only starting from glibc 2.30 https://www.man7.org/linux/man-pages/man2/gettid.2.html and the system there is likely older. We already use gettid via syscall(SYS_gettid) in wcfs/client/wcfs_misc.cpp: https://lab.nexedi.com/nexedi/wendelin.core/-/blob/e8a00ac0/wcfs/client/wcfs_misc.cpp#L254 -> Do the same thing in wcfs/internal/_os.pyx to fix it. /reported-by @kazuhiko /reported-on nexedi/erp5@c8998ed0 (comment 244984) /helped-and-reviewed-by @jerome /reviewed-on nexedi/wendelin.core!44
-
- 11 Jul, 2025 13 commits
-
-
Kirill Smelkov authored
Fix build and runtime issues discovered on py3.11 . See individual patches for details /cc @tomo, @levin.zimmermann /reviewed-by @jerome /reviewed-on !37
-
Kirill Smelkov authored
All tests are now passing with that python version as well. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
-------- kirr: In py3.11 inspect.getargspec was removed bigarray/array_zodb.py:40: in <module> _ = inspect.getargspec(BigArray.__init__) E AttributeError: module 'inspect' has no attribute 'getargspec' and https://docs.python.org/3/whatsnew/3.11.html says to replace it with inspect.getfullargspec. In fact inspect.getargspec was deprecated since py3, so we can do the change not only for py3.11+, but for any py3 version. For our use-case getfullargspec seems to be drop-in replacement for getargspec, so we should be ok doing that. /reviewed-by @kirr /reviewed-on !37 -
Kirill Smelkov authored
https://docs.python.org/3/whatsnew/3.11.html says to use PyFrame_GetBack() instead of direct ->f_back access which indeed starts to create frames on the fly in py3.11 if ->f_back was NULL: https://github.com/python/cpython/commit/ae0a2b756255 https://github.com/python/cpython/commit/68f5fa668343 -> change all usage of ->f_back to that function not to skip yet-unmaterialized frames while doing our checks. /reviewed-by @jerome /reviewed-on !37
-
Kirill Smelkov authored
bigfile/py: PyFrameObject->{f_code,f_locals,f_localsplus} can no longer be directly accessible on py3.11 bigfile/_bigfile.c: In function ‘pybigfile_loadblk’: bigfile/_bigfile.c:736:35: error: invalid use of incomplete typedef ‘PyFrameObject’ {aka ‘struct _frame’} 736 | fastlocals = f->f_localsplus; | ^~ bigfile/_bigfile.c:737:31: error: invalid use of incomplete typedef ‘PyFrameObject’ {aka ‘struct _frame’} 737 | for (j = f->f_code->co_nlocals; j >= 0; --j) { | ^~ bigfile/_bigfile.c:746:26: error: invalid use of incomplete typedef ‘PyFrameObject’ {aka ‘struct _frame’} 746 | if (f->f_locals != NULL) { | ^~ https://docs.python.org/3/whatsnew/3.11.html says to use PyFrame_GetCode() instead of ->f_code, PyFrame_GetLocals() instead of ->f_locals, and that there is no public API for ->f_localsplus. For ->f_code we do as suggested. For ->f_localsplus we do direct access via frame->f_frame->localsplus. This is internal API available only under Py_BUILD_CORE but Cython and other projects use this everywhere: https://github.com/cython/cython/blob/066adcb3/Cython/Utility/Exceptions.c#L849-L857 https://sources.debian.org/src/numba/0.61.2+dfsg-1/numba/_dispatcher.cpp/?hl=35#L35 https://sources.debian.org/src/systemtap/5.1-4.1/python/HelperSDT/_HelperSDT.c/?hl=154#L154 ... so we are ok to use this private bits as well. For ->f_locals we do not use PyFrame_GetLocals() because we want to check whether ->f_locals is NULL or not NULL while PyFrame_GetLocals creates new dict and fills it when ->f_locals was NULL. This way we, similarly to f_localsplus, use direct PyFrameObject->_PyInterpreterFrame->f_locals to do the access. /reviewed-by @jerome /reviewed-on !37 -
Kirill Smelkov authored
bigfile/py: Starting from py3.11 exception state is kept only in exc_value no longer using exc_type and exc_traceback bigfile/_bigfile.c: In function ‘pybigfile_loadblk’: bigfile/_bigfile.c:616:56: error: ‘_PyErr_StackItem’ {aka ‘struct _err_stackitem’} has no member named ‘exc_type’ 616 | XINC( save_exc_type = set0(&ts->exc_state.exc_type) ); | ^ exc_type and exc_traceback can all be derived from exc_value. See https://github.com/python/cpython/commit/396b58345f81 and https://github.com/python/cpython/issues/89874 for details. -> Rework the code to only use exc_value on py ≥ 3.11 . /reviewed-by @jerome /reviewed-on !37 -
Kirill Smelkov authored
bigfile/_bigfile.c: In function ‘pybigfile_loadblk’: bigfile/_bigfile.c:594:25: error: ‘PyThreadState’ {aka ‘struct _ts’} has no member named ‘frame’; did you mean ‘cframe’? 594 | ts_frame_orig = ts->frame; // just for checking | ^~~~~ | cframe https://docs.python.org/3/whatsnew/3.11.html says to replace access with PyThreadState_GetFrame(). /reviewed-by @jerome /reviewed-on !37 -
Kirill Smelkov authored
In the next patches we will need to add compatibility code for older py3 releases. This is no-longer needed only on py2. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
Kirill Smelkov authored
Those python3 versions are long EOL now (https://devguide.python.org/versions/) /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
Kirill Smelkov authored
After e7d161d1 (Restore Python3 support) Wendelin.core works ok on those python versions. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
Kirill Smelkov authored
This way it is easier to correlate definition lines in the matrix. Whitespace changes only. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
Kirill Smelkov authored
I accidently broke tox setup with c5e18c74 (bigfile/zodb: Teach ZBigFile backend to use WCFS) and adding }} at tail instead of properly closing the numpy section. As the result event `tox -l` was breaking with: py37-ZODB5-auto-zeo-numpy116-{!wcfs} failed with ConfigError: substitution key '!wcfs' not found at Traceback (most recent call last): File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1299, in run results[name] = cur_self.make_envconfig(name, section, subs, config) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1479, in make_envconfig res = meth(env_attr.name, env_attr.default, replace=replace) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1701, in getpath path = self.getstring(name, defaultpath, replace=replace) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1818, in getstring x = self._replace_if_needed(x, name, replace, crossonly) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1834, in _replace_if_needed x = self._replace(x, name=name, crossonly=crossonly) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1861, in _replace replaced = Replacer(self, crossonly=crossonly).do_replace(value) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1901, in do_replace expanded = substitute_once(value) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1895, in substitute_once return self.RE_ITEM_REF.sub(self._replace_match, x) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1968, in _replace_match return self._replace_substitution(sub_value) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 2003, in _replace_substitution val = self._substitute_from_other_section(sub_key) File "/home/kirr/src/wendelin/venv/z-dev/lib/python2.7/site-packages/tox/config/__init__.py", line 1998, in _substitute_from_other_section raise tox.exception.ConfigError("substitution key {!r} not found".format(key)) ConfigError: ConfigError: substitution key '!wcfs' not found -> Fix that. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
Kirill Smelkov authored
Else we are getting lots of DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead during test run on py3. /reviewed-by @jerome /reviewed-on nexedi/wendelin.core!37
-
- 09 Jun, 2025 2 commits
-
-
Kirill Smelkov authored
While reviewing the previous patch I noticed that Proc.fd misbehaves when there is permission error: ---- 8< ---- (x.py) from wendelin.wcfs.internal import os as xos pdbc = xos.ProcDB.open() ---- 8< ---- (neo) (py311.venv) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ python x.py Traceback (most recent call last): File "/home/kirr/src/wendelin/wendelin.core/x.py", line 3, in <module> pdbc = xos.ProcDB.open() ^^^^^^^^^^^^^^^^^ File "/home/kirr/src/wendelin/venv/py311.venv/lib/python3.11/site-packages/decorator.py", line 235, in fun return caller(func, *(extras + args), **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kirr/src/tools/go/pygolang-master/golang/__init__.py", line 166, in _goframe return f(*argv, **kw) ^^^^^^^^^^^^^^ File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/os.py", line 330, in open proc.get(name) File "/home/kirr/src/wendelin/venv/py311.venv/lib/python3.11/site-packages/decorator.py", line 235, in fun return caller(func, *(extras + args), **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kirr/src/tools/go/pygolang-master/golang/__init__.py", line 166, in _goframe return f(*argv, **kw) ^^^^^^^^^^^^^^ File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/os.py", line 590, in get v = eraise(v) ^^^^^^^^^ File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/os.py", line 544, in eraise raise e File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/os.py", line 563, in get v = f(proc) ^^^^^^^ File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/os.py", line 704, in fd ifd.pos = int(e.pop("pos")) ^ UnboundLocalError: cannot access local variable 'e' where it is not associated with a value The problem happens because Proc.fd, like many other methods catches OSError+IOError to see if it was transient ENOENT, but forgets to reraise the exception if it was not. -> Fix that. I also checked all other places that do such OSErrror+IOError filtering and Proc.fd was the only one to miss to reraise. Thorough tests for ProcDB and MountDB are still TODO. Fixes 7932bac5 (wcfs: os: Add ProcDB & co) /reviewed-by @levin.zimmermann /reviewed-on !40 (comment 237564) -
As described by Kirill, Linux kernels older than 5.14 do not yet support the 'ino' entry: "However after rechecking it looks like the ino entry was added only "recently" in 2021 in Linux 5.14 (https://git.kernel.org/linus/3845f256a8b5)." [1] Therefore we need to make the fetching of the 'ino' entry optional to support older kernels. [1] nexedi/slapos!1815 (comment 236971) -------- kirr: I checked the whole codebase and we do not use fd.ino anywhere yet, so it is ok to change the interface at this time. /reviewed-by @kirr /reviewed-on nexedi/wendelin.core!40
-
- 03 Jun, 2025 2 commits
-
-
Kirill Smelkov authored
Thomas reports that wcfs crashes in glog.basicConfig on py2 and indeed checking things it looks like this: wendelin.core/D$ python --version Python 2.7.18 wendelin.core/D$ wcfs status file://`pwd`/1.fs Traceback (most recent call last): File "/home/kirr/src/wendelin/venv/z-dev/bin/wcfs", line 11, in <module> load_entry_point('wendelin.core', 'console_scripts', 'wcfs')() File "<decorator-gen-42>", line 2, in main File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 166, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 1108, in main glog.basicConfig(stream=sys.stderr, level=logging.INFO) File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/glog.py", line 36, in basicConfig logging.setLogRecordFactory(LogRecord) AttributeError: 'module' object has no attribute 'setLogRecordFactory' This happens because while py3 logging has setLogRecordFactory py2 logging does not. -> Fix that by changing logging.LogRecord on py2 directly. However fresh review of glog.py module reveals more problems: On py2 there is no Formatter.formatMessage and so any logging attempt crashes with Traceback (most recent call last): File "/usr/lib/python2.7/logging/__init__.py", line 868, in emit msg = self.format(record) File "/usr/lib/python2.7/logging/__init__.py", line 741, in format return fmt.format(record) File "/usr/lib/python2.7/logging/__init__.py", line 469, in format s = self._fmt % record.__dict__ KeyError: 'levelchar' -> Fix that by moving .levelchar initialization to LogRecord constructor. Another problem is that glog.basicConfig was ignoring level argument. This way even if the user code from wcfs was invoking it with level=logging.INFO (see f9a40d36 "wcfs: py: Switch loglevel from WARNING -> INFO for wcfs.py commands") no log messages from info level were logged. -> Fix that by setting root's logger level as instructed. I'm not sure how I missed all those problem when preparing original patch. After hereby patch wcfs.py logging is hopefully back to be working properly on both py2 and py3. /fixes e51bef0d (wcfs: py: Log with date and time present) /reported-by @tomo /reported-on nexedi/slapos!1815 (comment 236620) /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!38 -
nexedi/wendelin.core!38 attempts to fix bugs introduced with nexedi/wendelin.core@e51bef0d. We didn't see these bugs in test results, because the WCFS CLI codepath is not covered in our tests. This patch adds coverage of this codepath to increase the likelihood that issues are quickly detected. -------- kirr: Rework original Levin's patch not to pollute global state of test process with e.g. glog logging setup. Original patch is here: 3cb3872f Added test currently fails with wcfs/wcfs_test.py::test_wcfs_main Exception in subprocess wcfs.wcfs_test._test_wcfs_main (pid936172): Traceback (most recent call last): File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/multiprocessing.py", line 84, in _start r = f(*argv, **kw) File "wcfs/wcfs_test.py", line 2047, in _test_wcfs_main wcfs.main() File "<decorator-gen-42>", line 2, in main File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 166, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 1108, in main glog.basicConfig(stream=sys.stderr, level=logging.INFO) File "/home/kirr/src/wendelin/wendelin.core/wcfs/internal/glog.py", line 36, in basicConfig logging.setLogRecordFactory(LogRecord) AttributeError: 'module' object has no attribute 'setLogRecordFactory' ... zurl = "file://abc" _("serve", ("-arg0", zurl), > (zurl, ("-arg0",))) ... if end.exc is not None: > raise end.exc E AttributeError: 'module' object has no attribute 'setLogRecordFactory' it will be fixed by the next patch. /reviewed-by @kirr, @levin.zimmermann /reviewed-on nexedi/wendelin.core!38, nexedi/wendelin.core!39
-
- 21 May, 2025 8 commits
-
-
Kirill Smelkov authored
bigfile/_bigfile.c: In function ‘pyfileh_dealloc’: bigfile/_bigfile.c:496:18: warning: unused variable ‘pyfile’ [-Wunused-variable] 496 | PyBigFile *pyfile; | ^~~~~~ bigfile/_bigfile.c:495:18: warning: unused variable ‘file’ [-Wunused-variable] 495 | BigFile *file = fileh->file; | ^~~~ These were there from the beginning started in 35eb95c2 (bigfile: Python wrapper around virtual memory subsystem). -
Kirill Smelkov authored
Hello @levin.zimmermann. This are the patches from our late-2024 trial to deploy WCFS which I think are already an improvement and ok to go. Please see the patches for details. Kirill /reviewed-by @levin.zimmermann /reviewed-on !36
-
Kirill Smelkov authored
Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of basic filesystem) `wcfs stop` was implemented as just `fusermount -u`. That, however, turned out to be not robust because if wcfs is deadlocked, unmounting hangs, and if wcfs server is crashed, but there are still running client processes, unmount will fail with "Device or resource busy" error. For the deadlocked case we often see a situation where both wcfs and client zope processes are hung, kill -9 does not work on them (they still remain hung) and there is no easy way to do the unmount and restart wcfs. -> Fix `wcfs stop` to do that by first breaking the deadlock via /sys/fs/fuse/connection/<X>/abort and making sure that: 1) wcfs.go is not running, 2) all left clients are terminated, and 3) the mount is also gone In many ways this coincides with what Server.stop was already doing, so here we teach `wcfs stop` to work via that Server.stop codepath and adjust the latter to work ok if Server._proc is not only subprocess.Popen that current process spawned, but also an xos.Proc, that `wcfs stop` discovered. Which can be also None if wcfs.go crashed by itself. As explained in the comments I took the decision to kill client processes instead of doing the final unmount try lazily because # NOTE if we do `fusermount -uz` (lazy unmount = MNT_DETACH), we will # remove the mount from filesystem tree and /proc/mounts, but the # clients will be left alive and using the old filesystem which is # left in a bad ENOTCONN state. From this point of view restart of # the clients is more preferred compared to leaving them running # but actually disconnected from the data. # # TODO try to teach wcfs clients to detect filesystem in ENOTCONN state # and reconnect automatically instead of being killed. Then we could # use MNT_DETACH. TODO tests. Levin also notes at nexedi/wendelin.core!36 (comment 233312) It would probably indeed be nicer, if `wcfs stop` wouldn't need to kill clients. But since wcfs.go already needs to send signals to clients (and we already need to set capacities), I too don't think it's urgent to teach WCFS clients to detect filesystem in ENOTCONN state. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!36
-
Kirill Smelkov authored
Use Sever._stuckdump we just added for `wcfs status` in the previous patch to dump that useful information about where wcfs is stuck and which processes have relation to it. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!36
-
Kirill Smelkov authored
Since wcfs beginning - since e3f2ee2d (wcfs: Initial implementation of basic filesystem) `wcfs status` was implemented as just join and reporting ok if that worked. That, however, turned out to be not robust because if wcfs is deadlocked, accessing any file on the filesystem, even a simple file as .wcfs/zurl might hang and so the status could hang as well. We see lots of such hung `wcfs status` processes on current deployment. More, it might be the case that wcfs is deadlocked in another way - e.g. on zheadMu, and then accessing .wcfs/zurl will work ok, but the system is not in a good shape while `wcfs status` missed to report that. -> Rework `wcfs status` completely to try accessing different files on the filesystem and doing so in cautious way so that if wcfs is in problematic state `wcfs status` won't get hung and will report the details about wcfs server and also about filesystem clients: which files are kept open, and what is in-kernel traceback of the server and the clients in case wcfs is hung. Please see comments in the added status function for details. An example of "good" status output when everything is ok: (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/nexedi/wendelin.core$ wcfs status file://D/1.fs INFO 1204 17:04:40.154 325432 __init__.py:506] wcfs: status file://D/1.fs ... ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 (0:39) ok - wcfs server: pid325414 kirr wcfs ok - stat mountpoint ok - read .wcfs/zurl ok - read .wcfs/stats And example of "bad" status output when wcfs was simulated to be seen in deadlocked state by trying to read from .wcfs/debug/zhead instead of .wcfs/stats: root@deca:/home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core# wcfs status file://D/1.fs INFO 1204 17:21:04.145 325658 __init__.py:506] wcfs: status file://D/1.fs ... ok - mount entry: /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 (0:39) ok - wcfs server: pid325414 kirr wcfs ok - stat mountpoint ok - read .wcfs/zurl fail - read .wcfs/stats: timed out (wcfs might be stuck) wcfs ktraceback: pid325414 kirr wcfs tid325414 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325418 kirr wcfs [<0>] hrtimer_nanosleep+0xc7/0x1b0 [<0>] __x64_sys_nanosleep+0xbe/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325419 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325420 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325421 kirr wcfs [<0>] do_epoll_wait+0x698/0x7d0 [<0>] do_compat_epoll_pwait.part.0+0xb/0x70 [<0>] __x64_sys_epoll_pwait+0x91/0x140 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325422 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325423 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325426 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325427 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325428 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325429 kirr wcfs [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325646 kirr wcfs [<0>] fuse_dev_do_read+0xa29/0xa50 [fuse] [<0>] fuse_dev_read+0x79/0xb0 [fuse] [<0>] vfs_read+0x239/0x310 [<0>] ksys_read+0x6b/0xf0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce wcfs clients: pid325430 kirr bash ('bash',) cwd -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8 pid325430 kirr bash tid325430 kirr bash [<0>] do_select+0x661/0x830 [<0>] core_sys_select+0x1ba/0x3a0 [<0>] do_pselect.constprop.0+0xe9/0x180 [<0>] __x64_sys_pselect6+0x53/0x80 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce pid325637 kirr ipython3 ('/usr/bin/python3', '/usr/bin/ipython3') fd/12 -> /dev/shm/wcfs/fccdb94842958d09c69261970b8037b0e5510fb8/.wcfs/zurl pid325637 kirr ipython3 tid325637 kirr ipython3 [<0>] do_epoll_wait+0x698/0x7d0 [<0>] __x64_sys_epoll_wait+0x6f/0x110 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325638 kirr ipython3 [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce tid325640 kirr ipython3 [<0>] futex_wait_queue+0x60/0x90 [<0>] futex_wait+0x185/0x270 [<0>] do_futex+0x106/0x1b0 [<0>] __x64_sys_futex+0x8e/0x1d0 [<0>] do_syscall_64+0x58/0xc0 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce Traceback (most recent call last): File "/home/kirr/src/wendelin/venv/z-dev/bin/wcfs", line 11, in <module> load_entry_point('wendelin.core', 'console_scripts', 'wcfs')() File "<decorator-gen-42>", line 2, in main File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 997, in main status(zurl) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 592, in status verify("read .wcfs/stats", xos.readfile, "%s/.wcfs/debug/zhead" % mnt.point) File "<decorator-gen-43>", line 2, in verify File "/home/kirr/src/tools/go/pygolang/golang/__init__.py", line 165, in _goframe return f(*argv, **kw) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 570, in verify fail("%s: timed out (wcfs might be stuck)" % subj) File "/home/kirr/src/wendelin/wendelin.core/wcfs/__init__.py", line 512, in fail raise RuntimeError('(failed)') RuntimeError: (failed) TODO tests. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!36
-
Kirill Smelkov authored
Previously when error on umount we were invoking lsof(8) to show list of files that are still opened on the filesystem. But lsof(8) turned out to be unreliable because it stats the filesystem and if e.g. wcfs server process is stopped lsof only prints WARNING:wcfs:# lsof /dev/shm/wcfs/1439df02dfcc41ab9dfb68e7ac4ad615f3b7d46e WARNING:wcfs:lsof: status error on /dev/shm/wcfs/1439df02dfcc41ab9dfb68e7ac4ad615f3b7d46e: Transport endpoint is not connected ... WARNING:wcfs:(lsof failed) fuser(1) from psmisc works a bit better: it can show list of still opened files on the mounted tree even if filesystem server is crashed. However with some version of fuser I still saw "Transport endpoint is not connected" once, and in the next patches we will also need to inspect "using" processes more, so if we are to use fuser we will need to parse its output which might get fragile. -> Do our own lsof utility instead. We have all the infrastructure in place to do so in the form of MountDB and ProcDB, and as implemented Mount.lsof() emits Proc'esses which can be inspected further conveniently. For now we do not do such inspection, but for `wcfs status` and `wcfs stop` we will want to poke with kernel tracebacks of those processes. /reviewed-by @levin.zimmermann /reviewed-on !36 -
Kirill Smelkov authored
Add ProcDB that represents database of processes with code to query it in several ways. We will need this functionality for `wcfs status`, `wcfs stop` and probably for more. TODO tests for ProcDB & co. /reviewed-by @levin.zimmermann /reviewed-on !36
-
Kirill Smelkov authored
Because mount entries provide more information compared to just one string mountpoint. For example later for `wcfs status` and `wcfs stop` we will need to use the ID of a "device" that is attached to the mount, and also the type of the filesystem that is serving the mount. -> Introduce internal.os.MountDB to retrieve information from OS registry of mounted filesystems and use its entries instead of plain mountpoint string. wcfs_test.py already had some rudimentary code to parse /proc/mounts which we also replace with querying MountDB. The API of MountDB might be viewed as a bit of overkill but it will align with API of upcoming ProcDB for which it will be reasonable. TODO tests for MountDB & co. /reviewed-by @levin.zimmermann /reviewed-on nexedi/wendelin.core!36
-