- 20 Jun, 2022 2 commits
-
-
Jérome Perrin authored
If program output after pytest's summary line, our summary function would be confused and detect 0 test run. Make this matching more robust by expecting the actual status of tests reported by pytest: (x)passed, (x)failed, skipped errors or "no test run" From Kirill's suggestion on nexedi/nxdtest!17 (comment 162070) /reviewed-by @kirr /reviewed-on nexedi/nxdtest!17
-
Jérome Perrin authored
summaryf needs to always return the count of failures, errors, skips for two reasons: - nxdtest sets error_count += 1 when the test program's returncode is not zero, so if we don't set error_count explicitly, the number of errors will be 1 when there are only failures - without error_count / failure_count, _test_result_summary does not display the global test result as pass or fail but as ? /reviewed-by @kirr /reviewed-on nexedi/nxdtest!17
-
- 17 Jun, 2022 1 commit
-
-
Jérome Perrin authored
We did not wait long enough for the tested process to set their signal handler, as we can see on test history [1], this test fails often when running on test node. It always pass when I run locally, but if I adjust the _tmasterpoll patch set to 0.01 second, it fails the same way. 1: https://erp5.nexedi.net/test_result_module/20220616-13A460D9B/2/TestResultLine_viewResultHistory /reviewed-on nexedi/nxdtest!17
-
- 27 Jan, 2022 4 commits
-
-
Kirill Smelkov authored
To detect leaked processes in the end of the run, we are first waiting for remaining test processes via procps. If the main waiting loop was canceled without main test process first completed (p.poll calls never returned !None), then the waitpid(pid=p.pid) system call will be done via procps. Which leaves further waitpid(pid=p.pid) system call invoked by subprocess to get -ECHILD error and artificially report 0 exit status: https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/subprocess.py#L1094-L1107 -> Fix it by propagating .returncode from procps to Popen instance so that it does not get lost. On sample .nxdtest with TestCase('sleep', ['sleep', '10']) Before the patch the output with CTRL+C was: $ nxdtest ... >>> sleep $ sleep 10 ^C# Interrupt ok sleep 0.604s # 1t 0e 0f 0s <-- NOTE # test run canceled # ran 1 test case: 1·ok <-- NOTE After the patch the output becomes: $ nxdtest ... >>> sleep $ sleep 100 ^C# Interrupt error sleep 1.006s # 1t 1e 0f 0s <-- NOTE # test run canceled # ran 1 test case: 1·error <-- NOTE /reviewed-by @jerome /reviewed-on nexedi/nxdtest!16
-
Kirill Smelkov authored
When test run is canceled via signal, we want to send SIGTERM to trun and let it kill/wait the tested processes. However until now we were sending SIGTERM to trun and immediately checking whether some spawned tested processes remain without giving any time for trun to complete first. Fix this by moving the code that detects/cleanups leaked processes to trun itself, and in the main nxdtest driver to only send SIGTERM to trun and let trun do the cleanup/leaked processes detection. This fixes erroneous "leaked process detected" as noticed by @jerome in the following example: (!16 (comment 150835)) $ nxdtest ... >>> sleep $ sleep 10 ^C# Interrupt # stopping due to cancel # leaked pid=188877 'sleep' ['sleep', '10'] <-- "leaked" is a bit misleading here error sleep 1.030s # 1t 1e 0f 0s # test run canceled # ran 1 test case: 1·error After the patch the output in this case becomes: $ nxdtest ... >>> sleep $ sleep 10 ^C# Interrupt ok sleep 0.604s # 1t 0e 0f 0s # test run canceled # ran 1 test case: 1·ok There is no more "leaked pid=...." emitted, but the change from "error" to "ok" status is wrong. We will fix it in the next patch. /reviewed-by @jerome /reviewed-on !16
-
Kirill Smelkov authored
First we send SIGTERM to leaked processes, and then, after timeout, intend to send SIGKILL to the leaked processes and to the main test process if it is still alive. However there is a thiko in the code: we were sending SIGKILL only to the main test process, not to all leaked ones. -> Fix it. The bug was there since 0ad45a9c (Detect if a test leaks processes and terminate them). /reviewed-by @jerome /reviewed-on !16
-
Kirill Smelkov authored
In addition to canceling test run is master tells us to do so, also cancel the run if interrupted or terminated. With the following sample .nxdtest TestCase('sleep', ['sleep', '10']) before the patch it does not react to CTRL+C: $ nxdtest ... >>> sleep $ sleep 10 ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C# ran 0 test cases. <-- no reaction to CTRL+C, finishes after 10 seconds Traceback (most recent call last): File "/home/kirr/src/tools/go/py3.venv2/bin/nxdtest", line 33, in <module> sys.exit(load_entry_point('nxdtest', 'console_scripts', 'nxdtest')()) File "/home/kirr/src/tools/go/py3.venv2/lib/python3.9/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), **kw) File "/home/kirr/src/tools/go/pygolang-master/golang/__init__.py", line 103, in _ return f(*argv, **kw) File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 339, in main wg.wait() KeyboardInterrupt after the patch: $ nxdtest ... >>> sleep $ sleep 10 ^C# Interrupt <-- prompt reaction to CTRL+C # stopping due to cancel # leaked pid=188877 'sleep' ['sleep', '10'] error sleep 1.030s # 1t 1e 0f 0s # test run canceled # ran 1 test case: 1·error Needs pygolang!17 to work. /reviewed-by @jerome /reviewed-on !16
-
- 23 Dec, 2021 1 commit
-
-
Kirill Smelkov authored
Even though slapos.core tests seem to mock getgrnam calls[1], disk group is being looked up in /etc/groups for real which fails, if there is no such group, e.g. as ERROR: test_not_existing (slapos.tests.test_slapgrid.TestSlapgridWithDevPermManagerDevPermEmptyLsblk) ---------------------------------------------------------------------- Traceback (most recent call last): File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3246, in setUp self.setUpExpected() File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3230, in setUpExpected gid = grp.getgrnam("disk").gr_gid KeyError: 'getgrnam(): name not found: disk' -> Fix it up by also creating "disk" group in our namespace environment. I'm not sure, but maybe the better long-term fix would be for slapos.core tests not to access /etc/groups for real and to instead mock access to this database completely. Amends commits e6b7993c and b42ccfa5. /reported-by @tomo /reported-at slapos!1107 (comment 148758) [1] https://lab.nexedi.com/nexedi/slapos.core/blob/1.7.1-28-g0b6bf2af4/slapos/tests/test_slapformat.py#L160-166
-
- 20 Dec, 2021 13 commits
-
-
Kirill Smelkov authored
So that if a test run is canceled in ERP5 UI, nxdtest stops its run soon, instead of after several hours in case of SlapOS.SoftwareReleases.IntegrationTest-* tests. See the main (938b5455) and other patches for details. /helped-and-reviewed-by @jerome /reviewed-on !14 * y/cancel: Add test for cancel propagation tests: Run nxdtest.main for each test in a separate thread, so that pytest.timeout generally works Propagate cancellation to spawned test jobs Raise ctx.err() if test run was cancelled Stop spawned process softly on ctx cancel
-
Jérome Perrin authored
/reviewed-by @kirr /reviewed-on !14
-
Kirill Smelkov authored
Factor-out the driver to run_nxdtest in a separate thread done in 0ad45a9c (Detect if a test leaks processes and terminate them) from test_run_procleak into run_nxdtest, so that all tests are run this way for pytest.timeout, if requested, to work universally for all tests. The logic of how we run nxdtest under tests and handle timeout will be, hopefully, soon reworked more, but it is anyway good to keep this logic in only one place. /reviewed-by @jerome /reviewed-on !14
-
Kirill Smelkov authored
A user might cancel test result in ERP5 UI if e.g. some misbehaviour is detected and a new revision is ready to be tested. This works by test_result.start() returning None - indicating that there is no more test_result_lines to exercise. Master also indicates this cancellation via test_result.isAlive() returning False, but until now we were not using that information and were always waiting for completion of current test job that is already spawned. This works well in practice if individual tests are not long, but e.g. for SlapOS.SoftwareReleases.IntegrationTest-* it is not good, because there an individual test might takes _hours_ to execute. -> Fix it by first setting global context to where we'll propagate cancellation from test_result.isAlive, and by using that context as the base for all other activities. This should terminate spawned test process if test_result is canceled. The interval to check is picked up as 5 minutes not to overload master. @jerome says that We now have 341 active test nodes, but sometimes we are using more, we did in the past to stress test some new machines. For the developer, if we reduce the waiting time from a few hours to 1 minutes or 5 minutes seems more or less equivalent. For 350 testnodes and each nxdtest checking its test_result status via isAlive query to master every 5 minutes, it results in ~ 1 isAlive request/second to master on average. Had to change time to golang.time to use time.after(). Due to that time() and sleep() are changed to time.now() and time.sleep() correspondingly. /helped-and-reviewed-by @jerome /reviewed-on !14
-
Kirill Smelkov authored
This is normal rule to return an error from a task if the task has to abort due to cancellation. We already do this in tee, but not in the function that is waiting for spawned process to complete(*). -> Fix that. Wrap corresponding wg.wait() into try/except and check if it fails due to cancellation, upon which we should not raise, but instead should continue to finish current test_result_line and only after stop test run normally (again without raise, but with regular one line log entry). (*) added in 0ad45a9c (Detect if a test leaks processes and terminate them) /reviewed-by @jerome /reviewed-on !14
-
Kirill Smelkov authored
In 0ad45a9c (Detect if a test leaks processes and terminate them) we organized waiting for spawned processes with handing ctx and sending SIGKILL to the main spawned process on ctx cancel even though other leaked processes are always first sent with SIGTERM and - only after shutdown timeout - later with SIGKILL. This is too brutal. Rework the code to first send SIGTERM to the main spawned test process too, and leverage SIGKILL only later after shutdown timeout. This will be tested in a later patch which exercises how cancel from master is propagated. /reviewed-by @jerome /reviewed-on !14
-
Kirill Smelkov authored
to detect after each test run leaked temporary files, leaked mount entries, to isolate different test runs from each other, and to provide tmpfs on /tmp for every test. The main change and description is in patch1 (a191468f); the other patches fix that up step-by-step to work for real for all our tests. /helped-by @tomo /helped-and-reviewed-by @jerome /reviewed-on !13 * y/unshare: trun: Require FUSE to be working inside user-namespaces to activate them Factor checking whether user-namespaces are available into trun.userns_available() trun: Add test for how /etc/{passwd,group} is setup for spawned job trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database trun: Deactivate most capabilities before spawning user test Run each testcase with its own /tmp and /dev/shm
-
Kirill Smelkov authored
FUSE is needed for wendelin.core, and if we don't check that FUSE works inside and activate user namespaces on e.g. Linux 4.9 kernel, wendelin.core will fail to function. Since today wendelin.core is included into ERP5 as its base component(*) it is practical to require FUSE-in-userns support unconditionally. FUSE in user-namespaces started to work in Linux 4.18(+). Detect if it should work inside via checking whether running kernel is newer. I choosed this way for simplicity not to unroll test filesystem to try to mount inside and also not to slow-down startup time. (*) see slapos!874 (+) see https://git.kernel.org/linus/da315f6e0398 and https://git.kernel.org/linus/8cb08329b080. /reviewed-by @jerome /reviewed-on !13
-
Kirill Smelkov authored
In the next patch we'll also add detection whether FUSE works inside user namespaces. Before doing that stop duplicating related code in between trun.py and nxdtest_test.py . /reviewed-by @jerome /reviewed-on !13
-
Jérome Perrin authored
Even if we don't want to spawn full sshd to verify that ptys work, we can still minimally test that inside user/group database is setup as expected. /reviewed-by @kirr /reviewed-on !13
-
Kirill Smelkov authored
Even though libc.opentty stopped to insist on chown(group=tty) for /dev/pts/*, openssh still wants to do it and fails, preventing sshd from working. Fix it by spawning test workload with sole current user and group being present in the password database. We anyway don't have IDs for users/groups different from current uid/gid mapped from current namespace, and the kernel, when seeing a file owned by those, maps them to "nobody/nogroup" for existing files, and rejects chown to those original IDs obtained from parent's namespace /etc/{passwd,group} as EINVAL. For the same reason we don't try to mount our own /dev/pts instance, because we have only current uid/gid mapped to parent namespace and gid=5 maps to nogroup in parent. With existing /dev/pts mount entries are only listed as having nogroup, and from outside they _are_ owned by parent's tty group. If we would mount /dev/pts anew, parent won't see our /dev/pts/* at all which moves us a bit more far from desired behaviour. Still keep root and nobody/nogroup as @jerome reports that without those users Go tests fail on Debian9: slapos!1095 (comment 147177) slapos!1095 (comment 147201) See added comment about all this for more details. This patch fixes sshd to work under tryn.py(*) No test because libc.openpty works both with and without this patch, and it would need too spawn whole sshd under test to verify this. (*) the diff for sshd 1) running successfully under regular user, and 2) previously failing under trun.py without this patch is below: diff --git a/regular-nopam.txt b/trun-nopam.txt index 378ccb6..5b96c08 100644 --- a/regular-nopam.txt +++ b/trun-nopam.txt @@ -1,4 +1,4 @@ -(neo) (z-dev) (g.env) kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no +kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no debug1: sshd version OpenSSH_8.4, OpenSSL 1.1.1k 25 Mar 2021 debug1: private host key #0: ssh-rsa SHA256:y+ujVDqqFBXTclDM2NLy4GME7wReutLcUYOWAeriXdc debug1: setgroups() failed: Operation not permitted @@ -91,35 +91,13 @@ debug1: session_input_channel_req: session 0 req pty-req debug1: Allocating pty. debug1: session_new: session 0 debug1: SELinux support disabled -Attempt to write login records by non-root user (aborting) -debug1: session_pty_req: session 0 alloc /dev/pts/2 -debug1: server_input_channel_req: channel 0 request env reply 0 -debug1: session_by_channel: session 0 channel 0 -debug1: session_input_channel_req: session 0 req env -debug1: server_input_channel_req: channel 0 request shell reply 1 -debug1: session_by_channel: session 0 channel 0 -debug1: session_input_channel_req: session 0 req shell -Starting session: shell on pts/2 for kirr from 127.0.0.1 port 44106 id 0 -debug1: Setting controlling tty using TIOCSCTTY. - -debug1: Received SIGCHLD. -debug1: session_by_pid: pid 693948 -debug1: session_exit_message: session 0 channel 0 pid 693948 -debug1: session_exit_message: release channel 0 -debug1: session_by_tty: session 0 tty /dev/pts/2 -debug1: session_pty_cleanup2: session 0 release /dev/pts/2 -Attempt to write login records by non-root user (aborting) -debug1: session_by_channel: session 0 channel 0 -debug1: session_close_by_channel: channel 0 child 0 -Close session: user kirr from 127.0.0.1 port 44106 id 0 -debug1: channel 0: free: server-session, nchannels 1 -Received disconnect from 127.0.0.1 port 44106:11: disconnected by user -Disconnected from user kirr 127.0.0.1 port 44106 +chown(/dev/pts/2, 1000, 5) failed: Invalid argument debug1: do_cleanup debug1: temporarily_use_uid: 1000/1000 (e=1000/1000) debug1: restore_uid: (unprivileged) +debug1: session_pty_cleanup2: session 0 release /dev/pts/2 +Attempt to write login records by non-root user (aborting) +debug1: audit_event: unhandled event 12 debug1: do_cleanup debug1: temporarily_use_uid: 1000/1000 (e=1000/1000) debug1: restore_uid: (unprivileged) -debug1: audit_event: unhandled event 12 (see slapos!1095 (comment 147018)) /helped-and-reviewed-by @jerome /reviewed-on !13
-
Kirill Smelkov authored
In the previous patch we asked unshare to keep capabilities so that FUSE mounting works from under regular user. However full set of capabilities is too much, and in particular if cap_dac_override is present(*), it means that writes to files that have read-only permission, is not rejected by kernel. -> Adjust trun to retain only those capabilities that we actually need = CAP_SYS_ADMIN to mount things. This should fix the following Go build failure: --- FAIL: TestReadOnlyWriteFile (0.00s) ioutil_test.go:90: Expected an error when writing to read-only file /tmp/TestReadOnlyWriteFile3940340549/blurp.txt FAIL FAIL io/ioutil 0.053s P.S. And if we would unshare to root instead (unshare -Umr) it should be still a good idea to drop extra capabilities, as we still want to reject writes to read-only files. (*) see https://man7.org/linux/man-pages/man7/capabilities.7.html /helped-and-reviewed-by @jerome /reviewed-on !13
-
Kirill Smelkov authored
and detect leaked temporary files and mount entries after each test run. Background Currently we have several testing-related problems that are all connected to /tmp and similar directories: Problem 1: many tests create temporary files for each run. Usually tests are careful to remove them on teardown, but due to bugs, many kind of tests, test processes being hard-killed (SIGKILL, or SIGSEGV) and other reasons, in practice this cleanup does not work 100% reliably and there is steady growth of files leaked on /tmp on testnodes. Problem 2: due to using shared /tmp and /dev/shm, the isolation in between different test runs of potentially different users is not strong. For example @jerome reports that due to leakage of faketime's shared segments separate test runs affect each other and fail: https://erp5.nexedi.net/bug_module/20211125-1C8FE17 Problem 3: many tests depend on /tmp being a tmpfs instance. This are for example wendelin.core tests which are intensively writing to database, and, if /tmp is resided on disk, timeout due to disk IO stalls in fsync on every commit. The stalls are as much as >30s and lead to ~2.5x overall slowdown for test runs. However the main problem is spike of increased latency which, with close to 100% probability, always render some test as missing its deadline. This topic is covered in https://erp5.com/group_section/forum/Using-tmpfs-for--tmp-on-testnodes-JTocCtJjOd -------- There are many ways to try to address each problem separately, but they all come with limitations and drawbacks. We discussed things with @tomo and @jerome, and it looks like that all those problems can be addressed in one go if we run tests under user namespaces with private mounts for /tmp and /dev/shm. Even though namespaces is generally no-go in Nexedi, they seem to be ok to use in tests. For example they are already used via private_tmpfs option in SlapOS: https://lab.nexedi.com/nexedi/slapos/blob/1876c150/slapos/recipe/librecipe/execute.py#L87-103 https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo-input-schema.json#L121-124 https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L11-16 https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L30-34 https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L170-177 ... https://lab.nexedi.com/nexedi/slapos/blob/1876c150/stack/erp5/instance-zope.cfg.in#L227-230 Thomas says that using private tmpfs for each test would be a better solution than implementing tmpfs for whole /tmp on testnodes. He also reports that @jp is OK to use namespaces for test as long as there is a fallback if namespaces aren't available. -> So let's do that: teach nxdtest to run each test case in its own private environment with privately-mounted /tmp and /dev/shm if we can detect that user namespaces are available. In an environment where user namespaces are indeed available this addresses all 3 problems because isolation and being-tmpfs are there by design, and even if some files will leak, the kernel will free everything when test terminates and the filesystem is automatically unmounted. We also detect such leakage and report a warning so that such problems do not go completely unnoticed. Implementation We leverage unshare(1) for simplicity. I decided to preserve uid/gid instead of becoming uid=0 (= `unshare -Umr`) for better traceability, so that it is clear from test output under which real slapuser a test is run(*). Not changing uid requires to activate ambient capabilities so that mounting filesystems, including FUSE-based needed by wendelin.core, continue to work under regular non-zero uid. Please see https://git.kernel.org/linus/58319057b784 for details on this topic. And please refer to added trun.py for details on how per-test namespace is setup. Using FUSE inside user namespaces requires Linux >= 4.18 (see https://git.kernel.org/linus/da315f6e0398 and https://git.kernel.org/linus/8cb08329b080), so if we are really to use this patch we'll have to upgrade kernel on our testnodes, at least where wendelin.core is used in tests. "no namespaces" detection is implemented via first running `unshare ... true` with the same unshare options that are going to be used to create and enter new user namespace for real. If that fails, we fallback into "no namespaces" mode where no private /tmp and /dev/shm are mounted(%). (*) for example nxdtest logs information about the system on startup: date: Mon, 29 Nov 2021 17:27:04 MSK xnode: slapuserX@test.node ... (%) Here is how nxdtest is run in fallback mode on my Debian 11 with user namespaces disabled via `sysctl kernel.unprivileged_userns_clone=0` (neo) (z-dev) (g.env) kirr@deca:~/src/wendelin/nxdtest$ nxdtest date: Thu, 02 Dec 2021 14:04:30 MSK xnode: kirr@deca.navytux.spb.ru uname: Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz >>> pytest $ python -m pytest # user namespaces not available. isolation and many checks will be deactivated. <--- NOTE ===================== test session starts ====================== platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1 rootdir: /home/kirr/src/wendelin/nxdtest plugins: timeout-1.4.2 collected 23 items nxdtest/nxdtest_pylint_test.py .... [ 17%] nxdtest/nxdtest_pytest_test.py ... [ 30%] nxdtest/nxdtest_test.py ......xx [ 65%] nxdtest/nxdtest_unittest_test.py ........ [100%] ============= 21 passed, 2 xfailed in 2.67 seconds ============= ok pytest 3.062s # 23t 0e 0f 0s # ran 1 test case: 1·ok /helped-by @tomo /helped-and-reviewed-by @jerome /reviewed-on !13
-
- 09 Dec, 2021 2 commits
-
-
Kirill Smelkov authored
Also log if master told us that we have nothing to do, and if the mode to run is local. This should make it a bit more clear what is going on just by looking at nxdtest log. See previous patch for more details and context. For the reference: here is how updated output looks like in the normal case: date: Thu, 09 Dec 2021 04:20:37 CET xnode: slapuser7@rapidspace-testnode-001 uname: Linux rapidspace-testnode-001 4.9.0-16-amd64 #1 SMP Debian 4.9.272-1 (2021-06-21) x86_64 cpu: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz # running for test_result_module/20211209-170FD3998 >>> pytest $ python -m pytest ============================= test session starts ============================== platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.9.0, pluggy-0.13.1 rootdir: /srv/slapgrid/slappart7/t/dfp/soft/47cc86af27d234f0464630f2a0d22a6f/parts/zodbtools-dev collected 46 items zodbtools/test/test_analyze.py . [ 2%] zodbtools/test/test_commit.py .. [ 6%] zodbtools/test/test_dump.py ... [ 13%] zodbtools/test/test_restore.py .. [ 17%] zodbtools/test/test_tidrange.py ............................. [ 80%] zodbtools/test/test_zodb.py ......... [100%] ========================== 46 passed in 9.15 seconds =========================== ok pytest 12.433s # 46t 0e 0f 0s # ran 1 test case: 1·ok /reviewed-by @jerome /reviewed-on !15
-
Kirill Smelkov authored
Nxdtest logs system information (bd91f6f1 "Include system information into log output") and run summary at the end (9f413221 "Emit run summary at the end"). However all that information currently is printed only if master is successfully connected and actually tells to run the tests. This behaviour is not very useful, because if nxdtest log output on testnode is empty, it is not clear - whether it was "we have to do nothing", or nxdtest stuck somewhere or something else. For example https://nexedijs.erp5.net/#/test_result_module/20211208-47D165B7/12 is currently marked as Running for many long hours already. And the log on testnode regarding nxdtest run is just: 2021-12-08 15:42:17,314 INFO $ PATH=/srv/slapgrid/slappart13/srv/slapos/soft/2956f419073cb2249ed953507fa6b173/bin:/opt/slapos/parts/bison/bin:/opt/slapos/parts/bzip2/bin:/opt/slapos/parts/gettext/bin:/opt/slapos/parts/glib/bin:/opt/slapos/parts/libxml2/bin:/opt/slapos/parts/libxslt/bin:/opt/slapos/parts/m4/bin:/opt/slapos/parts/ncurses/bin:/opt/slapos/parts/openssl/bin:/opt/slapos/parts/pkgconfig/bin:/opt/slapos/parts/python2.7/bin:/opt/slapos/parts/readline/bin:/opt/slapos/parts/sqlite3/bin:/opt/slapos/parts/swig/bin:/opt/slapos/bin:/opt/slapos/parts/patch/bin:/opt/slapos/parts/socat/bin:/usr/bin:/usr/sbin:/sbin:/bin SLAPOS_TEST_LOG_DIRECTORY=/srv/slapgrid/slappart13/var/log/testnode/dgd-xStX9safSG SLAPOS_TEST_SHARED_PART_LIST=/srv/slapgrid/slappart13/srv/shared:/srv/slapgrid/slappart13/t/dgd/shared /bin/sh /srv/slapgrid/slappart13/t/dgd/i/0/bin/runTestSuite --master_url $DISTRIBUTOR_URL --revision slapos=13977-ec686a708633f689382426063c21efbe3b2eab04,slapos.core=8698-91edab77ed36c160da8017cfdc1673fe7a8e10de --test_node_title rapidspace-testnode-008-3Nodes-DEPLOYTASK0 --test_suite SLAPOS-SR-TEST --test_suite_title SlapOS.SoftwareReleases.IntegrationTest-kirr.Python2 --project_title 'Rapid.Space Project' without anything else. With this patch nxdtest would print system information and report how many tests it had run, if its invocation did not stuck. In this patch we only move code that calls system_info and defer summary log before code that connects to master. In the following patch we'll add more logging around connecting to master. /reviewed-by @jerome /reviewed-on !15
-
- 11 Nov, 2021 1 commit
-
-
Kirill Smelkov authored
Sometimes there is zero testcases to be executed on testnodes, and log output from nxdtest is just date: Wed, 10 Nov 2021 12:31:50 MSK xnode: kirr@deca.navytux.spb.ru uname: Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz it is not clear from such output did the run ended or the test got stuck. After this patch it becomes date: Wed, 10 Nov 2021 12:31:50 MSK xnode: kirr@deca.navytux.spb.ru uname: Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz # ran 0 test cases. And in general, when there are several testcases to be run, it is helpful to indicate end of such run and to print brief summary of result status for all ran test cases. Example output: wendelin.core$ nxdtest -k test.wcfs date: Wed, 10 Nov 2021 12:35:34 MSK xnode: kirr@deca.navytux.spb.ru uname: Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz >>> test.wcfs/fs:1 ... ok test.wcfs/fs:1 25.035s # 35t 0e 0f 0s >>> test.wcfs/fs:2 ... ok test.wcfs/fs:2 21.033s # 35t 0e 0f 0s >>> test.wcfs/fs: ... ok test.wcfs/fs: 21.056s # 35t 0e 0f 0s # ran 3 test cases: 3·ok /reviewed-by @jerome /reviewed-on !12
-
- 13 Aug, 2021 2 commits
-
-
Jérome Perrin authored
This parses pylint output with a simple regexp and counts one failure per message reported. This has not been tested yet, but we decided to apply this commit already. /acked-by @kirr /reviewed-on !11
-
Jérome Perrin authored
When using compile with the actual file path, we can have better tracebacks in case of errors. before: Traceback (most recent call last): File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module> sys.exit(nxdtest.main()) File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main tenv = loadNXDTestFile('.nxdtest') File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile six.exec_(src, g) File "<string>", line 77, in <module> NameError: name 'Pylint' is not defined after: Traceback (most recent call last): File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module> sys.exit(nxdtest.main()) File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main tenv = loadNXDTestFile('.nxdtest') File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile six.exec_(compile(src, os.path.realpath(path), 'exec'), g) File "/srv/slapgrid/slappart3/srv/runner/instance/slappart8/var/nxdtest/.nxdtest", line 77, in <module> summaryf=Pylint.summary, NameError: name 'Pylint' is not defined /reviewed-by @kirr /reviewed-in !10
-
- 12 Aug, 2021 1 commit
-
-
Kirill Smelkov authored
For every TestCase nxdtest spawns test process to run with stdout/stderr redirected to pipes that nxdtest reads. Nxdtest, in turn, tees those pipes to its stdout/stderr until the pipes become EOF. If the test process, in turn, spawns other processes, those other processes will inherit opened pipes, and so the pipes won't become EOF untill _all_ spawned test processes (main test process + other processes that it spawns) exit. Thus, if there will be any process, that the main test process spawned, but did not terminated upon its own exit, nxdtest will get stuck waiting for pipes to become EOF which won't happen at all if a spawned test subprocess persists not to terminate. I hit this problem for real on a Wendelin.core 2 test - there the main test processes was segfaulting and so did not instructed other spawned processes (ZEO, WCFS, ...) to terminate. As the result the whole test was becoming stuck instead of being promptly reported as failed: runTestSuite: Makefile:175: recipe for target 'test.wcfs' failed runTestSuite: make: *** [test.wcfs] Segmentation fault runTestSuite: wcfs: 2021/08/09 17:32:09 zlink [::1]:52052 - [::1]:23386: recvPkt: EOF runTestSuite: E0809 17:32:09.376800 38082 wcfs.go:2574] zwatch zeo://localhost:23386: zlink [::1]:52052 - [::1]:23386: recvPkt: EOF runTestSuite: E0809 17:32:09.377431 38082 wcfs.go:2575] zwatcher failed -> switching filesystem to EIO mode (TODO) <LONG WAIT> runTestSuite: PROCESS TOO LONG OR DEAD, GOING TO BE TERMINATED -> Fix it. /reviewed-by @jerome /reviewed-on !9
-
- 01 Dec, 2020 1 commit
-
-
Jérome Perrin authored
re.match only find matches where the pattern appears at the beginning of the string, whereas re.search matches if the pattern appears anywhere in the string. This is behavior is consistent with pytest, go test and ERP5's runUnitTest For more details, see the discussion from !6 (comment 121409) /reviewed-on: !8
-
- 26 Nov, 2020 1 commit
-
-
Kirill Smelkov authored
The reason is that with threading.Thread if exception happens in that spawned thread, this error is not propagated to main driver, while with sync.WorkGroup an exception from any spawned worker is propagated back to main. For example with the following injected error --- a/nxdtest/__init__.py +++ b/nxdtest/__init__.py @@ -267,6 +267,7 @@ def main(): # tee, similar to tee(1) utility, copies data from fin to fout appending them to buf. def tee(ctx, fin, fout, buf): + 1/0 while 1: before this patch nxdtest behaves like ... (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest date: Tue, 24 Nov 2020 14:55:08 MSK xnode: kirr@deco.navytux.spb.ru uname: Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64 cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz >>> pytest $ python -m pytest Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee 1/0 ZeroDivisionError: integer division or modulo by zero Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee 1/0 ZeroDivisionError: integer division or modulo by zero error pytest 0.583s # 1t 1e 0f 0s (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $? 0 Here the error in another thread is only printed, but nxdtest is not aborted. Above it reported "error", but e.g. when testing pygolang/py3 and raising an error in tee it even reported it was succeeding ( !6 (comment 121393) ): slapuser34@vifibcloud-rapidspace-hosting-007:~/srv/runner/instance/slappart0$ ./bin/runTestSuite date: Tue, 24 Nov 2020 12:51:23 MSK xnode: slapuser34@vifibcloud-rapidspace-hosting-007 uname: Linux vifibcloud-rapidspace-hosting-007 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 cpu: Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz >>> thread $ python -m pytest Exception in thread Thread-1: Traceback (most recent call last): File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee fout.write(data) TypeError: write() argument must be str, not bytes ok thread 9.145s # 1t 0e 0f 0s >>> gevent $ gpython -m pytest Exception in thread Thread-3: Traceback (most recent call last): File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee fout.write(data) TypeError: write() argument must be str, not bytes ok gevent 21.980s # 1t 0e 0f 0s After this patch nxdtest correctly handles and propagates an error originating in spawned thread back to main driver: (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest date: Tue, 24 Nov 2020 14:54:19 MSK xnode: kirr@deco.navytux.spb.ru uname: Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64 cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz >>> pytest $ python -m pytest Traceback (most recent call last): File "/home/kirr/src/wendelin/venv/z4-dev/bin/nxdtest", line 11, in <module> load_entry_point('nxdtest', 'console_scripts', 'nxdtest')() File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 230, in main wg.wait() File "golang/_sync.pyx", line 237, in golang._sync.PyWorkGroup.wait pyerr_reraise(pyerr) File "golang/_sync.pyx", line 217, in golang._sync.PyWorkGroup.go.pyrunf f(pywg._pyctx, *argv, **kw) File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee 1/0 ZeroDivisionError: integer division or modulo by zero (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $? 1 NOTE sync.WorkGroup requires every worker to handle context cancellation, so that whenever there is an error, all other workers are canceled. We add such cancellation handling to tee but only lightly: before going to block in read/write syscalls we check for whether ctx is canceled or not. However the proper handling would be to switch file descriptors into non-block mode and to select at every IO point on both potential IO events and potential cancellation. This is left as TODO for the future. /reviewed-on !7
-
- 24 Nov, 2020 5 commits
-
-
Jérome Perrin authored
These are the necessary changes to run `SlapOS.Eggs.UnitTest-*` and `SlapOS.SoftwareReleases.IntegrationTest-*` using nxdtest See merge request !6 /reviewed-by @kirr
-
Jérome Perrin authored
If test program output on stderr (which is unbuffered), the output of the test program will appear before output from nxdtest advertising the program that is about to be executed, because nxdtest stdout is buffered (testnode does not set PYTHONUNBUFFERED, and eventhough nxdtest sets PYTHONUNBUFFERED in its own environ, this only applies to sub processes)
-
Jérome Perrin authored
-
Jérome Perrin authored
While pytest sends everything in stdout, some other programs send on stderr.
-
Jérome Perrin authored
While treating output as text would not really be impossible, treating it as bytes seems a better choice because: - we don't have to make assumptions about what output encoding the test program is using for output - `tee` can just read stream output bytes by bytes without having to worry about multi-bytes characters - testnode protocol uses xmlrpc.client.Binary, which uses bytes. Because using bufsize=1 implies reading subprocess output as text, we use bufsize=0 instead in the subprocess.Popen call, to prevent buffering. To make manipulation of strings and bytes easier, we add a dependency on pygolang, so that we can use its strings utility functions. Also add a few tests to verify general functionality.
-
- 09 Nov, 2020 1 commit
-
-
Kirill Smelkov authored
For example wendelin.core 2 uses pytest_unconfigure to unmount wcfs servers that it spawned: https://lab.nexedi.com/kirr/wendelin.core/blob/1be4d730/conftest.py#L27-53 This usually leads to some test as shown in added test. -> Skip that text backwards and actually detect pytest summary line instead of returning empty dict, because empty dict forces Nexedi ERP5 to treat such test run to have UNKNOWN status, e.g. https://erp5.nexedi.net/test_result_module/20201108-C170850/11 /cc @jerome /reviewed-on !5
-
- 02 Nov, 2020 1 commit
-
-
Kirill Smelkov authored
This allows to establish tests for Nxdtest itself under Nexedi testing infrastructure. /cc @jerome /reviewed-on !4
-
- 30 Oct, 2020 2 commits
-
-
Kirill Smelkov authored
lab.nexedi.com/kirr/nxdtest -> lab.nexedi.com/nexedi/nxdtest slapos!839 (comment 119120) slapos!839 (comment 119142) /cc @jerome /reviewed-on !3
-
Kirill Smelkov authored
/cc @jerome /reviewed-on !1
-
- 28 Oct, 2020 1 commit
-
-
Kirill Smelkov authored
Verify pytest summary-parsing functionality.
-
- 20 Oct, 2020 1 commit
-
-
Kirill Smelkov authored
Create real nxdtest module, so that things could be imported from there. As the result switch `nxdtest` program from scripts to entry_point.
-