Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • W wendelin.core
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Merge requests 0
    • Merge requests 0
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Activity
  • Graph
  • Commits
Collapse sidebar
  • Kirill Smelkov
  • wendelin.core
  • Repository
  • wendelin.core
  • wcfs
  • __init__.py
Find file BlameHistoryPermalink
  • Levin Zimmermann's avatar
    wcfs/tests: Handle ECONNABORTED/ENOTCONN race in FUSE connection termination · ca119d6e
    Levin Zimmermann authored Nov 11, 2025 and Kirill Smelkov's avatar Kirill Smelkov committed Jan 30, 2026
    On kernel 5.10.0-10-amd64, wcfs tests intermittently fail due to
    inconsistent FUSE error codes when the filesystem daemon is killed.
    Some read operations receive ECONNABORTED while others receive
    ENOTCONN, causing test assertions to fail.
    
    The inconsistency was observed on an older kernel (5.10.x)
    but not on a newer kernel (6.1.0-40-amd64+), suggesting the race
    has been fixed in later kernel versions.
    
    [1] Traceback is:
    
        ______________________________ test_start_after_crash _______________________________
    
            @func
            def test_start_after_crash():
                zurl  = testzurl
                mntpt = testmntpt
    
        >       wc = start_and_crash_wcfs(zurl, mntpt)
    
        wcfs/wcfs_test.py:214:
        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
        zurl = 'file:///srv/slapgrid/slappart77/tmp/testdb_fs.wYO8Fy/1.fs', mntpt = '/dev/shm/wcfs/306b2220fc0d6299336a7d2279556c07d48a868a'
    
            def start_and_crash_wcfs(zurl, mntpt): # -> WCFS
                # /proc/mounts should not contain wcfs entry
                with raises(KeyError):
                    procmounts_lookup_wcfs(zurl)
    
                # start the server with attached client
                wcsrv = wcfs.start(zurl)
                assert wcsrv.mountpoint == mntpt
                assert mntpt not in wcfs._wcregistry
    
                wc = wcfs.join(zurl, autostart=False)
                assert wcfs._wcregistry[mntpt] is wc
                assert wc.mountpoint == mntpt
                assert xos.readfile(mntpt + "/.wcfs/zurl") == zurl
    
                # /proc/mounts should now contain wcfs entry
                assert procmounts_lookup_wcfs(zurl) == mntpt
    
                # kill the server
                os.kill(wcsrv._proc.pid, SIGKILL)
                assert procwait_(context.background(), wcsrv._proc)
    
                # access to filesystem should raise "Transport endpoint not connected"
                with raises(IOError) as exc:
                    xos.readfile(mntpt + "/.wcfs/zurl")
        >       assert exc.value.errno == ENOTCONN
        E       AssertionError: assert 103 == 107
        E        +  where 103 = IOError(103, 'Software caused connection abort').errno
        E        +    where IOError(103, 'Software caused connection abort') = <ExceptionInfo IOError tblen=2>.value
    
        wcfs/wcfs_test.py:285: AssertionError
    
    --------
    kirr:
    
    - go through all wcfs places and adjust them to also handle ECONNABORTED
      where ENOTCONN was previously there. This covers more places in tests
      and actual production code in wcfs/__init__.py
    - in tests assert that eventually we still reach ENOTCONN while
      ECONNABORTED is alowed to be interim.
    - explain the problem in comments.
    
    The problem turned out not to be specific to linux 5.10 and should be
    present on modern kernels as well: in do_exit, exit_files is called
    before exit_notify, that sets tsk->exit_state = EXIT_ZOMBIE:
    
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/exit.c?h=v6.18-rc5#n897
    
    but exit_files, even though it is called before exit_notify, releases opened file descriptors in delayed work:
    
    5.10:
    
        exit_files       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n451
        put_files_struct https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n427
        close_files      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v5.10#n402
        filp_close       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/open.c?h=v5.10#n1266
        fput_many        https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v5.10#n335
    
    6.19-rc7:
    
        exit_files       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n518
        put_files_struct https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n506
        close_files      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.19-rc7#n474
        filp_close       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/open.c?h=v6.19-rc7#n1542
        fput_close       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v6.19-rc7#n582
        __fput_deferred  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file_table.c?h=v6.19-rc7#n518
    
    which might result in those fds release to happen both before and after exit_notify depending on scheduling.
    
    Now let's consider what happens when wcfs server is killed: the kernel goes to
    do_exit for wcfs, invokes exit_files() and then exit_notify(). Once exit_notify
    is done, wcfs->task_state is set to Z and wait on wcfs process becomes ready.
    However if delayed work is not executed yet the kernel still sees fd for
    /dev/fuse, which wcfs was using for FUSE exchange with the kernel, being still
    open. This way when the test in question tries to check access to any
    wcfs/anyfile, it results in fuse_lookup() being called, that fuse_lookup enters
    fuse_lookup_name -> fuse_request_simple -> fuse_get_req, still sees
    fuse_conn->connected=1, because last closure of /def/fuse fd did not yet
    happened, and so further creates and queues request to wcfs server. But once
    closure of /dev/fuse fd actually happens, fuse_dev_release ->
    fuse_dev_end_requests wakes up all queued requests to be completed with ECONNABORTED:
    
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2712
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2535
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n2407
    
    That's how ECONNABORTED popped up in the problematic test run and it is not specific to 5.10 kernel at all.
    
    NOTE should /dev/fd closure actually happen _before_ test process wakeup and
    trial to access wcfs/anyfile, then fuse_lookup -> ... -> fuse_get_req would see
    fuse_conn->connected=0 and return ENOTCONN outright:
    
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v6.19-rc7#n217
    
    Original patch: levin.zimmermann/wendelin.core@a99b7979
    
    /reviewed-by @kirr
    /reviewed-on nexedi/wendelin.core!34
    ca119d6e
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7