nexedi issueshttps://lab.nexedi.com/groups/nexedi/-/issues2022-11-10T12:49:10Zhttps://lab.nexedi.com/nexedi/wendelin.core/-/issues/9Inconsistency in behaviour between numpy.ndarray and ZBigArray2022-11-10T12:49:10ZLevin ZimmermannInconsistency in behaviour between numpy.ndarray and ZBigArrayHello,
I encountered an unexpected behavior when using `wendelin.bigarray.array_zodb.ZBigArray`.
When initializing a `ZBigArray` by using a list for the `shape` parameter, the append method will raise an error:
```python
>>> from wendelin.bigarray.array_zodb import ZBigArray
>>> zbig_array = ZBigArray(shape=[1], dtype=float)
>>> zbig_array.append([1])
ValueError: all the input array dimensions except for theconcatenation axis must match exactly
```
It will raise this error because it will come to `[] != ()` [here](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L232):
`zbig_array.shape` returns a list, but the `shape` property of the [adhoc created array which is supposed to be appended](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L228) returns a tuple.
It's fair to say this is actually an users error, because the [numpy documentation clearly specifies the `shape` parameter as a tuple of ints](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).
Nevertheless recent versions of `numpy.ndarray` *still do accept* a list as a valid input of `shape`:
```python
>>> import numpy
>>> numpy.__version__
1.23.1
>>> numpy.ndarray(shape=[2]).shape
(2,)
>>> numpy.ndarray(shape=[2]).shape == numpy.ndarray(shape=(2,)).shape
True
```
What is the preferred way how to handle this difference in behavior between `numpy.ndarray` and `wendelin.bigarray.array_zodb.ZBigArray`?
I've thought about:
1. Simply wrapping `shape` into a `tuple` call [here](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L95) (e.g. `self._shape = tuple(shape)`).
2. Somehow improve the error message. It's perhaps not a big deal, but it can be confusing and time-consuming when encountering the described `append` scenario with the bad `shape` parameter.
Best, Levin
/cc @kirrHello,
I encountered an unexpected behavior when using `wendelin.bigarray.array_zodb.ZBigArray`.
When initializing a `ZBigArray` by using a list for the `shape` parameter, the append method will raise an error:
```python
>>> from wendelin.bigarray.array_zodb import ZBigArray
>>> zbig_array = ZBigArray(shape=[1], dtype=float)
>>> zbig_array.append([1])
ValueError: all the input array dimensions except for theconcatenation axis must match exactly
```
It will raise this error because it will come to `[] != ()` [here](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L232):
`zbig_array.shape` returns a list, but the `shape` property of the [adhoc created array which is supposed to be appended](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L228) returns a tuple.
It's fair to say this is actually an users error, because the [numpy documentation clearly specifies the `shape` parameter as a tuple of ints](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).
Nevertheless recent versions of `numpy.ndarray` *still do accept* a list as a valid input of `shape`:
```python
>>> import numpy
>>> numpy.__version__
1.23.1
>>> numpy.ndarray(shape=[2]).shape
(2,)
>>> numpy.ndarray(shape=[2]).shape == numpy.ndarray(shape=(2,)).shape
True
```
What is the preferred way how to handle this difference in behavior between `numpy.ndarray` and `wendelin.bigarray.array_zodb.ZBigArray`?
I've thought about:
1. Simply wrapping `shape` into a `tuple` call [here](https://lab.nexedi.com/nexedi/wendelin.core/blob/master/bigarray/__init__.py#L95) (e.g. `self._shape = tuple(shape)`).
2. Somehow improve the error message. It's perhaps not a big deal, but it can be confusing and time-consuming when encountering the described `append` scenario with the bad `shape` parameter.
Best, Levin
/cc @kirrhttps://lab.nexedi.com/nexedi/slapos.core/-/issues/1slapos.cfg: Field names in the [networkcache] section2022-01-27T12:29:59ZBoxiang Sunslapos.cfg: Field names in the [networkcache] sectionWhen I'm setting the shacache upload info. I found some issues which maybe need to clarify a bit.
1. When running the command `/opt/slapos/bin/generate-signature-key /etc/opt/slapos/slapos.cfg`
The command `generate-signature-key` try to find field name as `signature-certificate-file` and `signature-private-key-file`: https://lab.nexedi.com/nexedi/slapos.libnetworkcache/blob/master/slapos/signature.py#L48
But the field names in `slapos.cfg` are `signature_private_key_file` and `signature_certificate_file`: https://lab.nexedi.com/nexedi/slapos.core/blob/master/slapos.cfg.example#L59
I suppose it is just a coding style issue. But we should make it consistent? For now, I just changed my `/etc/opt/slapos/slapos.cfg` manually and moved forward.
2. The files that Julian gave me are `ca.crt` and `boxiang.crt`. But in `/etc/opt/slapos/slapos.cfg`. It uses `*.cert` as the extension name. Maybe we should unify it?
/cc @romain @jmWhen I'm setting the shacache upload info. I found some issues which maybe need to clarify a bit.
1. When running the command `/opt/slapos/bin/generate-signature-key /etc/opt/slapos/slapos.cfg`
The command `generate-signature-key` try to find field name as `signature-certificate-file` and `signature-private-key-file`: https://lab.nexedi.com/nexedi/slapos.libnetworkcache/blob/master/slapos/signature.py#L48
But the field names in `slapos.cfg` are `signature_private_key_file` and `signature_certificate_file`: https://lab.nexedi.com/nexedi/slapos.core/blob/master/slapos.cfg.example#L59
I suppose it is just a coding style issue. But we should make it consistent? For now, I just changed my `/etc/opt/slapos/slapos.cfg` manually and moved forward.
2. The files that Julian gave me are `ca.crt` and `boxiang.crt`. But in `/etc/opt/slapos/slapos.cfg`. It uses `*.cert` as the extension name. Maybe we should unify it?
/cc @romain @jmhttps://lab.nexedi.com/nexedi/slapos.package/-/issues/1Missing option: python:executable2020-06-29T16:02:43ZLeo Le BouterMissing option: python:executableHello!
I am trying to port slapos to ppc64le (or ppc64el as debian calls it) and after applying [few modifications](https://gitlab.com/slapos-ppc64le/slapos/-/merge_requests/1) to slapos tree, I need to create re6st and slapos packaging based on this new tree.
I first started with re6st as it's the packaging that's non-trivial to modify in place, considering it works with the download-cache and extends-cache.
I ran:
```
git clone https://lab.nexedi.com/nexedi/slapos.package.git
# commit hash at time of writing: 722f0e8d03f093148379d1d18647e27ad193fe3c
cd slapos.package/obs/re6st
./make
```
And I was met with the following error, with not much of an idea what it is caused by:
```
# Processing buildout: sdist -> build/opt/re6st/.installed.cfg
While:
Installing.
Loading extensions.
Error: Missing option: python:executable
Traceback (most recent call last):
File "./make", line 323, in <module>
sys.exit(main())
File "./make", line 232, in main
tasks[t](args.dry_run)
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 162, in _run
deps.append((dep, dep(dry_run)))
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 162, in _run
deps.append((dep, dep(dry_run)))
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 186, in _run
self.run(self)
File "make.py", line 139, in buildout
check_call(("bin/buildout",), cwd=BUILD)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('bin/buildout',)' returned non-zero exit status 1
```
I tried deleting everything and re-doing it all again few times, with not much luck.
Any pointers welcome!
Thank you!Hello!
I am trying to port slapos to ppc64le (or ppc64el as debian calls it) and after applying [few modifications](https://gitlab.com/slapos-ppc64le/slapos/-/merge_requests/1) to slapos tree, I need to create re6st and slapos packaging based on this new tree.
I first started with re6st as it's the packaging that's non-trivial to modify in place, considering it works with the download-cache and extends-cache.
I ran:
```
git clone https://lab.nexedi.com/nexedi/slapos.package.git
# commit hash at time of writing: 722f0e8d03f093148379d1d18647e27ad193fe3c
cd slapos.package/obs/re6st
./make
```
And I was met with the following error, with not much of an idea what it is caused by:
```
# Processing buildout: sdist -> build/opt/re6st/.installed.cfg
While:
Installing.
Loading extensions.
Error: Missing option: python:executable
Traceback (most recent call last):
File "./make", line 323, in <module>
sys.exit(main())
File "./make", line 232, in main
tasks[t](args.dry_run)
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 162, in _run
deps.append((dep, dep(dry_run)))
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 162, in _run
deps.append((dep, dep(dry_run)))
File "./make", line 56, in __call__
self._run = run = run(dry_run)
File "./make", line 186, in _run
self.run(self)
File "make.py", line 139, in buildout
check_call(("bin/buildout",), cwd=BUILD)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('bin/buildout',)' returned non-zero exit status 1
```
I tried deleting everything and re-doing it all again few times, with not much luck.
Any pointers welcome!
Thank you!https://lab.nexedi.com/nexedi/ebulk/-/issues/1HTTP ERROR in POST requests: handshake_failure2020-06-09T10:20:01ZRoqueHTTP ERROR in POST requests: handshake_failureThe HTTP Post request made in wendelin_client::handleRequest (at ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/wendelin_client.rb) fails with error: **Received fatal alert: handshake_failure**
The code of the request:
`req = Net::HTTP::Post.new(uri)
res = Net::HTTP.start(uri.hostname, uri.port,
:use_ssl => (uri.scheme == 'https'),
:verify_mode => OpenSSL::SSL::VERIFY_NONE,
:ssl_timeout => 300, :open_timeout => 300, :read_timeout => 300,
) do |http|
http.request(req)
end`
The handshake_failure could be because of an incompatibility problem with the server due to certificates, SSL versions, etc.The HTTP Post request made in wendelin_client::handleRequest (at ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/wendelin_client.rb) fails with error: **Received fatal alert: handshake_failure**
The code of the request:
`req = Net::HTTP::Post.new(uri)
res = Net::HTTP.start(uri.hostname, uri.port,
:use_ssl => (uri.scheme == 'https'),
:verify_mode => OpenSSL::SSL::VERIFY_NONE,
:ssl_timeout => 300, :open_timeout => 300, :read_timeout => 300,
) do |http|
http.request(req)
end`
The handshake_failure could be because of an incompatibility problem with the server due to certificates, SSL versions, etc.https://lab.nexedi.com/nexedi/neoppod/-/issues/1Pip installation failed with neoppod 1.10 and python 3.62020-03-17T10:38:29ZGhost UserPip installation failed with neoppod 1.10 and python 3.6Hi. With python 3.6.7 I am getting the following error just by installing the pip package:
```
$ pip install neoppod
Collecting neoppod
Downloading https://files.pythonhosted.org/packages/02/c4/fc58a50ff0aec5ffe3c3c8834b1620224118d7492291732f0294e50ea04e/neoppod-1.10.tar.gz (320kB)
100% |████████████████████████████████| 327kB 5.5MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-m6hdctki/neoppod/setup.py", line 52, in <module>
_get_long_description = DistributionMetadata.get_long_description.im_func
AttributeError: 'function' object has no attribute 'im_func'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-m6hdctki/neoppod/
```Hi. With python 3.6.7 I am getting the following error just by installing the pip package:
```
$ pip install neoppod
Collecting neoppod
Downloading https://files.pythonhosted.org/packages/02/c4/fc58a50ff0aec5ffe3c3c8834b1620224118d7492291732f0294e50ea04e/neoppod-1.10.tar.gz (320kB)
100% |████████████████████████████████| 327kB 5.5MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-m6hdctki/neoppod/setup.py", line 52, in <module>
_get_long_description = DistributionMetadata.get_long_description.im_func
AttributeError: 'function' object has no attribute 'im_func'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-m6hdctki/neoppod/
```https://lab.nexedi.com/nexedi/wendelin.core/-/issues/8Protocol error when committing transaction in an existing ZODB in VM shared d...2020-03-21T17:54:22ZGhost UserProtocol error when committing transaction in an existing ZODB in VM shared directoryThe following code raises an ```OSError: [Errno 71] Protocol error``` at the **second transaction commit** when run with the shared directory of my VM as working directory. The wendelin version is 0.8.
The first set of instructions creates the database properly.
When we re-open it, the content is still available and correct.
But when the second set of instructions tries to add a new array and commit it, the error is raised.
I could not reproduce the error in another folder.
```python
import numpy as np
import transaction
from wendelin.bigarray.array_zodb import ZBigArray
from wendelin.lib.zodb import dbopen, dbclose
# Create the database
r = dbopen('test')
r['arr1'] = ZBigArray((30, 30), np.float)
transaction.commit()
dbclose(r)
# Re-open the same database
r = dbopen('test')
r['arr2'] = ZBigArray((30, 30), np.float)
transaction.commit() # OSError is raised here
dbclose(r)
```
Here is the stacktrace :boom:
```
Traceback (most recent call last):
File "/home/efn509/.PyCharmCE2016.3/config/scratches/scratch.py", line 9, in <module>
transaction.commit()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_manager.py", line 124, in commit
return self.get().commit()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 311, in commit
reraise(t, v, tb)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_compat.py", line 56, in reraise
raise value
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 302, in commit
self._commitResources()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 447, in _commitResources
reraise(t, v, tb)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_compat.py", line 56, in reraise
raise value
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 424, in _commitResources
rm.tpc_vote(self)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/Connection.py", line 786, in tpc_vote
s = vote(transaction)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/FileStorage/FileStorage.py", line 720, in tpc_vote
cp(self._tfile, self._file, dlen)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/utils.py", line 141, in cp
data = read(n)
OSError: [Errno 71] Protocol error
```
Here are the directory permissions:
```shell
abc999@debian8-abc999:/media/sf_shared$ ls -ld /media/sf_shared/
drwxrwx--- 1 root vboxsf 4096 Jan 4 15:04 /media/sf_shared/
```
And the user belongs to the ```vboxsf``` group and thus has all the permissions.
Also, here are the details of the newly created ZODB :
```shell
-rwxrwx--- 1 root vboxsf 898 Jan 4 15:25 test
-rwxrwx--- 1 root vboxsf 56 Jan 4 15:25 test.index
-rwxrwx--- 1 root vboxsf 7 Jan 4 15:25 test.lock
-rwxrwx--- 1 root vboxsf 727 Jan 4 15:25 test.tmp
```
The following code raises an ```OSError: [Errno 71] Protocol error``` at the **second transaction commit** when run with the shared directory of my VM as working directory. The wendelin version is 0.8.
The first set of instructions creates the database properly.
When we re-open it, the content is still available and correct.
But when the second set of instructions tries to add a new array and commit it, the error is raised.
I could not reproduce the error in another folder.
```python
import numpy as np
import transaction
from wendelin.bigarray.array_zodb import ZBigArray
from wendelin.lib.zodb import dbopen, dbclose
# Create the database
r = dbopen('test')
r['arr1'] = ZBigArray((30, 30), np.float)
transaction.commit()
dbclose(r)
# Re-open the same database
r = dbopen('test')
r['arr2'] = ZBigArray((30, 30), np.float)
transaction.commit() # OSError is raised here
dbclose(r)
```
Here is the stacktrace :boom:
```
Traceback (most recent call last):
File "/home/efn509/.PyCharmCE2016.3/config/scratches/scratch.py", line 9, in <module>
transaction.commit()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_manager.py", line 124, in commit
return self.get().commit()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 311, in commit
reraise(t, v, tb)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_compat.py", line 56, in reraise
raise value
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 302, in commit
self._commitResources()
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 447, in _commitResources
reraise(t, v, tb)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_compat.py", line 56, in reraise
raise value
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/transaction/_transaction.py", line 424, in _commitResources
rm.tpc_vote(self)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/Connection.py", line 786, in tpc_vote
s = vote(transaction)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/FileStorage/FileStorage.py", line 720, in tpc_vote
cp(self._tfile, self._file, dlen)
File "/home/efn509/anaconda3/envs/wendelin/lib/python3.5/site-packages/ZODB/utils.py", line 141, in cp
data = read(n)
OSError: [Errno 71] Protocol error
```
Here are the directory permissions:
```shell
abc999@debian8-abc999:/media/sf_shared$ ls -ld /media/sf_shared/
drwxrwx--- 1 root vboxsf 4096 Jan 4 15:04 /media/sf_shared/
```
And the user belongs to the ```vboxsf``` group and thus has all the permissions.
Also, here are the details of the newly created ZODB :
```shell
-rwxrwx--- 1 root vboxsf 898 Jan 4 15:25 test
-rwxrwx--- 1 root vboxsf 56 Jan 4 15:25 test.index
-rwxrwx--- 1 root vboxsf 7 Jan 4 15:25 test.lock
-rwxrwx--- 1 root vboxsf 727 Jan 4 15:25 test.tmp
```
https://lab.nexedi.com/nexedi/wendelin.core/-/issues/7wendelin.core can kill a zope process2020-03-21T17:54:22ZIvan Tyagovwendelin.core can kill a zope processI was able to kill a zope process by doing following at same time
1. Append to ZBigArray
2. Read from ZBigArray
Following traceback was show in stderr:
python2.7: bigfile/_bigfile.c:549: pybigfile_loadblk: Assertion `!(pybuf->ob_refcnt != 1)' failed.
I was able to kill a zope process by doing following at same time
1. Append to ZBigArray
2. Read from ZBigArray
Following traceback was show in stderr:
python2.7: bigfile/_bigfile.c:549: pybigfile_loadblk: Assertion `!(pybuf->ob_refcnt != 1)' failed.
Kirill Smelkovkirr@nexedi.comKirill Smelkovkirr@nexedi.comhttps://lab.nexedi.com/nexedi/wendelin.core/-/issues/6storeblk() vs DB lock deadlock2020-03-21T17:54:22ZKirill Smelkovkirr@nexedi.comstoreblk() vs DB lock deadlock The following was seen on a project: a Zope thread was stuck for a very long time with memory usage growing. The thing here is probably similar to loadblk() vs DB lock deadlock case we already solved in f49c11a3. Details:
In slapart9/zope-0 we are currently having:
```
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |grep -o 'Thread .*:' |sort |uniq
Thread 140164638697216:
Thread 140164647089920:
Thread 140164991964928:
Thread 140165000357632:
```
-> 4 long running threads. Here is latest info about them:
```
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |tail -4
2016-10-21 09:24:44,515 - Thread 140165000357632: Started on 1476162420.8; Running for 879463.7 secs; Same.
2016-10-21 09:24:44,515 - Thread 140164647089920: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164991964928: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164638697216: Started on 1476162461.7; Running for 879422.8 secs; Same.
```
so yes "Running for 879463.7 secs" means a thread ate 244 hour CPU time, but
"Started on 1476162420.8" means it was start on:
```
In [1]: import time
In [2]: time.ctime(1476162420.8)
Out[2]: 'Tue Oct 11 08:07:00 2016'
```
-> On Oct 11 morning.
Now I've went to slappart9/srv/backup/logrotate and tracked there in
longrequest_logger_zope-0.log-*.gz to those thread origins (by the way
verifying there was nothing new longrunning started/stopped in between),
so here is the original stucking info:
```
2016-10-11 05:07:41,917 - Thread 140165000357632: Started on 1476162420.8; Running for 41.1 secs; request: GET http://:/Control_Panel/timer_service/process_timer
retry count: 0
form: {}
other: {'ACTUAL_URL': 'http://:/Control_Panel/timer_service/process_timer',
'AUTHENTICATED_USER': <SpecialUser 'Anonymous User'>,
'AUTHENTICATION_PATH': '',
'AcceptLanguage': {},
'PARENTS': [<Activity Tool at /erp5/portal_activities>, <ERP5Site at /erp5>],
'PUBLISHED': <bound method TimerService.process_timer of <TimerService at /timer_service used for /Control_Panel>>,
'RESPONSE': TimerResponse(''),
'SERVER_URL': 'http://:',
'TraversalRequestNameStack': [],
'URL': 'http://:/Control_Panel/timer_service/process_timer',
'interval': 1,
'method': 'GET'}
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 138, in publish
request, bind=1)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/mapply.py", line 77, in mapply
if debug is not None: return debug(object,args,context)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 48, in call_object
result=apply(object,args) # Type s<cr> to step into published object.
File ".../parts/erp5/product/TimerService/TimerService.py", line 102, in process_timer
DateTime(prev_tick), DateTime(next_tick))
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1040, in process_timer
self.tic(processing_node_list.index(currentNode) + 1)
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1095, in tic
if not last.dequeueMessage(inner_self, processing_node):
File ".../parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 509, in dequeueMessage
transaction.commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
return self.get().commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
self._commitResources()
File ".../eggs/Products.TIDStorage-5.4.9-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 261, in _commitResources
result = original__commitResources(self, *args, **kw)
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
rm.commit(self)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 716, in commit
self.zfileh.dirty_writeout(WRITEOUT_STORE)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 406, in storeblk
def storeblk(self, blk, buf): return self.zself.storeblk(blk, buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 496, in storeblk
blkchanged = zblk.setblkdata(buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 311, in setblkdata
if (start % CHUNKSIZE) or len(chunk.data) > CHUNKSIZE:
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 860, in setstate
self._setstate(obj)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 901, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ClientStorage.py", line 833, in load
data, tid = self._server.loadEx(oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ServerStub.py", line 176, in loadEx
return self.rpc.call("loadEx", oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 763, in call
r_args = self.wait(msgid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 791, in wait
self.replies_cond.wait()
File ".../parts/python2.7/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
```
( 140165000357632 seems to be an activity executor thread triggered to
run under /timer_service/process_timer; some activity involving
wendelin.core has been run. then there is commit, which involves
wendelin.core writeout activity. wendelin.core wants to store an
object to ZEO and this in turn triggers some another load from ZEO,
and loading is stuck forever for some reason )
```
2016-10-11 05:07:45,739 - Thread 140164638697216: Started on 1476162461.7; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
```
```
2016-10-11 05:07:45,537 - Thread 140164991964928: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
```
( Above are just 2 std zope threads - they want to open the database
first but are stuck on the lock being taken )
```
2016-10-11 05:07:45,474 - Thread 140164647089920: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 751, in open
result.open(transaction_manager)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 534, in Connection_open
orig_Connection_open(self, transaction_manager, delegate)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 1046, in open
self._flush_invalidations()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 539, in _flush_invalidations
self._cache.invalidate(invalidated)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 152, in _p_invalidate
self._v_zfile.invalidateblk(self._v_blk)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 505, in invalidateblk
fileh.invalidate_page(blk) # XXX assumes blksize == pagesize
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 635, in invalidate_page
return self.zfileh.invalidate_page(pgoffset)
```
( this seems to be another worker thread in between transactions which
received invalidation from DB and this way also triggers wendelin.core
invalidations which then wants to acquire virtmem lock and waits
forever because that lock is already being held by writeout code )
All this looks similar to deadlock we already solved between loadblk()
and ZEO lock (https://lab.nexedi.com/nexedi/wendelin.core/commit/f49c11a3).
Here we are probably having some kind of storeblk() vs ZEO lock deadlock as
after I've checked storeblk() are still being run from under virtmem
lock), though it is not clear off-hand why the memory usage is growing.
/cc @Tyagov, @klaus The following was seen on a project: a Zope thread was stuck for a very long time with memory usage growing. The thing here is probably similar to loadblk() vs DB lock deadlock case we already solved in f49c11a3. Details:
In slapart9/zope-0 we are currently having:
```
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |grep -o 'Thread .*:' |sort |uniq
Thread 140164638697216:
Thread 140164647089920:
Thread 140164991964928:
Thread 140165000357632:
```
-> 4 long running threads. Here is latest info about them:
```
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |tail -4
2016-10-21 09:24:44,515 - Thread 140165000357632: Started on 1476162420.8; Running for 879463.7 secs; Same.
2016-10-21 09:24:44,515 - Thread 140164647089920: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164991964928: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164638697216: Started on 1476162461.7; Running for 879422.8 secs; Same.
```
so yes "Running for 879463.7 secs" means a thread ate 244 hour CPU time, but
"Started on 1476162420.8" means it was start on:
```
In [1]: import time
In [2]: time.ctime(1476162420.8)
Out[2]: 'Tue Oct 11 08:07:00 2016'
```
-> On Oct 11 morning.
Now I've went to slappart9/srv/backup/logrotate and tracked there in
longrequest_logger_zope-0.log-*.gz to those thread origins (by the way
verifying there was nothing new longrunning started/stopped in between),
so here is the original stucking info:
```
2016-10-11 05:07:41,917 - Thread 140165000357632: Started on 1476162420.8; Running for 41.1 secs; request: GET http://:/Control_Panel/timer_service/process_timer
retry count: 0
form: {}
other: {'ACTUAL_URL': 'http://:/Control_Panel/timer_service/process_timer',
'AUTHENTICATED_USER': <SpecialUser 'Anonymous User'>,
'AUTHENTICATION_PATH': '',
'AcceptLanguage': {},
'PARENTS': [<Activity Tool at /erp5/portal_activities>, <ERP5Site at /erp5>],
'PUBLISHED': <bound method TimerService.process_timer of <TimerService at /timer_service used for /Control_Panel>>,
'RESPONSE': TimerResponse(''),
'SERVER_URL': 'http://:',
'TraversalRequestNameStack': [],
'URL': 'http://:/Control_Panel/timer_service/process_timer',
'interval': 1,
'method': 'GET'}
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 138, in publish
request, bind=1)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/mapply.py", line 77, in mapply
if debug is not None: return debug(object,args,context)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 48, in call_object
result=apply(object,args) # Type s<cr> to step into published object.
File ".../parts/erp5/product/TimerService/TimerService.py", line 102, in process_timer
DateTime(prev_tick), DateTime(next_tick))
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1040, in process_timer
self.tic(processing_node_list.index(currentNode) + 1)
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1095, in tic
if not last.dequeueMessage(inner_self, processing_node):
File ".../parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 509, in dequeueMessage
transaction.commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
return self.get().commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
self._commitResources()
File ".../eggs/Products.TIDStorage-5.4.9-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 261, in _commitResources
result = original__commitResources(self, *args, **kw)
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
rm.commit(self)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 716, in commit
self.zfileh.dirty_writeout(WRITEOUT_STORE)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 406, in storeblk
def storeblk(self, blk, buf): return self.zself.storeblk(blk, buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 496, in storeblk
blkchanged = zblk.setblkdata(buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 311, in setblkdata
if (start % CHUNKSIZE) or len(chunk.data) > CHUNKSIZE:
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 860, in setstate
self._setstate(obj)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 901, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ClientStorage.py", line 833, in load
data, tid = self._server.loadEx(oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ServerStub.py", line 176, in loadEx
return self.rpc.call("loadEx", oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 763, in call
r_args = self.wait(msgid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 791, in wait
self.replies_cond.wait()
File ".../parts/python2.7/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
```
( 140165000357632 seems to be an activity executor thread triggered to
run under /timer_service/process_timer; some activity involving
wendelin.core has been run. then there is commit, which involves
wendelin.core writeout activity. wendelin.core wants to store an
object to ZEO and this in turn triggers some another load from ZEO,
and loading is stuck forever for some reason )
```
2016-10-11 05:07:45,739 - Thread 140164638697216: Started on 1476162461.7; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
```
```
2016-10-11 05:07:45,537 - Thread 140164991964928: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
```
( Above are just 2 std zope threads - they want to open the database
first but are stuck on the lock being taken )
```
2016-10-11 05:07:45,474 - Thread 140164647089920: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 751, in open
result.open(transaction_manager)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 534, in Connection_open
orig_Connection_open(self, transaction_manager, delegate)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 1046, in open
self._flush_invalidations()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 539, in _flush_invalidations
self._cache.invalidate(invalidated)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 152, in _p_invalidate
self._v_zfile.invalidateblk(self._v_blk)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 505, in invalidateblk
fileh.invalidate_page(blk) # XXX assumes blksize == pagesize
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 635, in invalidate_page
return self.zfileh.invalidate_page(pgoffset)
```
( this seems to be another worker thread in between transactions which
received invalidation from DB and this way also triggers wendelin.core
invalidations which then wants to acquire virtmem lock and waits
forever because that lock is already being held by writeout code )
All this looks similar to deadlock we already solved between loadblk()
and ZEO lock (https://lab.nexedi.com/nexedi/wendelin.core/commit/f49c11a3).
Here we are probably having some kind of storeblk() vs ZEO lock deadlock as
after I've checked storeblk() are still being run from under virtmem
lock), though it is not clear off-hand why the memory usage is growing.
/cc @Tyagov, @klaus Kirill Smelkovkirr@nexedi.comKirill Smelkovkirr@nexedi.comhttps://lab.nexedi.com/nexedi/wendelin.core/-/issues/5Memory is managed per-process not per whole-system2022-11-27T14:03:24ZKirill Smelkovkirr@nexedi.comMemory is managed per-process not per whole-systemWe had a case when there are several processes using wendelin.core working together at the same time, and if first process starts with some time advance, and allocates most of the memory, those who come later frequently get into OOM condition.
Here is what's happening:
- wendelin.core inside every process allocates RAM from /dev/shm
- when a page cannot be allocated wendelin.core tries to free some
already allocated memory and then retries allocation
- if wendelin.core sees it cannot free any memory, then it reports out of
memory (OOM) happened.
https://lab.nexedi.com/nexedi/wendelin.core/blob/e73e22ea/bigfile/virtmem.c#L566
- the problem is: when there is memory pressure different processes do
not feel the pressure caused by each other. I mean if process A is
tight on memory it only tries to reclaim memory A allocated and there
is no message to second process B to reclaim some memory.
- so what happens is: first process runs, after some time allocates almost all
memory from /dev/shm:
```
root@debian-iliya:/srv/slapgrid/slappart17/srv/runner/CollatzConjecture# df -h /dev/shm/
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.5G 1.5G 51M 97% /dev/shm
root@debian-iliya:/srv/slapgrid/slappart17/srv/runner/CollatzConjecture# lsof /dev/shm/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 21681 root DEL REG 0,17 3314022 /dev/shm/ramh.SeNAWD
python 21681 root 9u REG 0,17 1072080879616 3302369 /dev/shm/ramh.ANROVP (deleted)
python 21681 root 10u REG 0,17 1099442421760 3314022 /dev/shm/ramh.SeNAWD (deleted)
```
- when second process is run it has only very small pool of free RAM in
/dev/shm (51M) in the above example, and needs to constantly reclaim.
- the first process also continues to allocate memory and do periodic reclaims,
so at some point in time there can be situation when process 2 has really no
memory left to allocate from /dev/shm.
The only solution is to make reclaim part of virtual memory manager to work not
per-one process but system-wide. This is what wendelin.core v2 is supposed to do
by moving virtual memory manager back into the kernel.
/cc @Tyagov, @Thetechguy, @klaus We had a case when there are several processes using wendelin.core working together at the same time, and if first process starts with some time advance, and allocates most of the memory, those who come later frequently get into OOM condition.
Here is what's happening:
- wendelin.core inside every process allocates RAM from /dev/shm
- when a page cannot be allocated wendelin.core tries to free some
already allocated memory and then retries allocation
- if wendelin.core sees it cannot free any memory, then it reports out of
memory (OOM) happened.
https://lab.nexedi.com/nexedi/wendelin.core/blob/e73e22ea/bigfile/virtmem.c#L566
- the problem is: when there is memory pressure different processes do
not feel the pressure caused by each other. I mean if process A is
tight on memory it only tries to reclaim memory A allocated and there
is no message to second process B to reclaim some memory.
- so what happens is: first process runs, after some time allocates almost all
memory from /dev/shm:
```
root@debian-iliya:/srv/slapgrid/slappart17/srv/runner/CollatzConjecture# df -h /dev/shm/
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.5G 1.5G 51M 97% /dev/shm
root@debian-iliya:/srv/slapgrid/slappart17/srv/runner/CollatzConjecture# lsof /dev/shm/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 21681 root DEL REG 0,17 3314022 /dev/shm/ramh.SeNAWD
python 21681 root 9u REG 0,17 1072080879616 3302369 /dev/shm/ramh.ANROVP (deleted)
python 21681 root 10u REG 0,17 1099442421760 3314022 /dev/shm/ramh.SeNAWD (deleted)
```
- when second process is run it has only very small pool of free RAM in
/dev/shm (51M) in the above example, and needs to constantly reclaim.
- the first process also continues to allocate memory and do periodic reclaims,
so at some point in time there can be situation when process 2 has really no
memory left to allocate from /dev/shm.
The only solution is to make reclaim part of virtual memory manager to work not
per-one process but system-wide. This is what wendelin.core v2 is supposed to do
by moving virtual memory manager back into the kernel.
/cc @Tyagov, @Thetechguy, @klaus Kirill Smelkovkirr@nexedi.comKirill Smelkovkirr@nexedi.comhttps://lab.nexedi.com/nexedi/wendelin.core/-/issues/4fallocate problem on kernel 3.2.60-1+deb7u32021-11-23T09:24:45ZDouglasdouglas.camata@nexedi.comfallocate problem on kernel 3.2.60-1+deb7u3While trying to use wendelin.core (through ERP5's DataArray) I found the following error:
```
bigfile/ram_shmfs.c:129 shmfs_alloc_page WARN: fallocate() not supported
bigfile/virtmem.c:930 OOM BUG!
```
I checked the source a little bit and I realized that /dev/shm was being mounted as tmpfs partition. I did a quick search and from what I found it looks like tmpfs doesn't support fallocate. I'm not really sure about this, because I also found patches to add support for fallocate in tmpfs, but no information if they were merged into upstream.
My machine is running Debian 7.6 with kernel 3.2.60-1+deb7u3.
While trying to use wendelin.core (through ERP5's DataArray) I found the following error:
```
bigfile/ram_shmfs.c:129 shmfs_alloc_page WARN: fallocate() not supported
bigfile/virtmem.c:930 OOM BUG!
```
I checked the source a little bit and I realized that /dev/shm was being mounted as tmpfs partition. I did a quick search and from what I found it looks like tmpfs doesn't support fallocate. I'm not really sure about this, because I also found patches to add support for fallocate in tmpfs, but no information if they were merged into upstream.
My machine is running Debian 7.6 with kernel 3.2.60-1+deb7u3.
https://lab.nexedi.com/nexedi/wendelin.core/-/issues/3NEO / w.c test -> RAM leak2021-11-23T09:24:45ZKirill Smelkovkirr@nexedi.comNEO / w.c test -> RAM leak(reported by @Tyagov)
Hi,
I let it run over night my quick feed into NEO w.c. array.
Now simply out of space and array shape is
(23379050506,)
Here's my script:
This script will consume as much memory as machine has an eventually it will
be killed by kernel.
So it's worth investigating why we have this leak and why garbage collection
not working with w.c. array (my wild guess now)
```python
from wendelin.bigarray.array_zodb import ZBigArray
from wendelin.lib.zodb import dbopen, dbclose
import transaction
import numpy as np
root = dbopen('neo://neo-standalone-comp-2477@[2001:67c:1254:YY::ZZZZ]:2051')
#root['B'] = A = ZBigArray((10,), np.float64)
#transaction.commit()
A = root['B']
slice_index = 10000
shape = A.shape[0]
print shape
# iterate over array
for i in xrange(0, shape, slice_index):
np.average(A[i:i+slice_index])
#for i in range(20000):
# A.append( np.random.rand(4*1024*1024) ) #np.arange(4*1024*1024) )
# print i, A.shape
# transaction.commit()
```
To simulate:
1. ssh root@2001:67c:1254:XX::1
2. su ivan; cd /home/ivan
3. . test/bin/activate
4. ipython feed.py
Regards
Ivan
(reported by @Tyagov)
Hi,
I let it run over night my quick feed into NEO w.c. array.
Now simply out of space and array shape is
(23379050506,)
Here's my script:
This script will consume as much memory as machine has an eventually it will
be killed by kernel.
So it's worth investigating why we have this leak and why garbage collection
not working with w.c. array (my wild guess now)
```python
from wendelin.bigarray.array_zodb import ZBigArray
from wendelin.lib.zodb import dbopen, dbclose
import transaction
import numpy as np
root = dbopen('neo://neo-standalone-comp-2477@[2001:67c:1254:YY::ZZZZ]:2051')
#root['B'] = A = ZBigArray((10,), np.float64)
#transaction.commit()
A = root['B']
slice_index = 10000
shape = A.shape[0]
print shape
# iterate over array
for i in xrange(0, shape, slice_index):
np.average(A[i:i+slice_index])
#for i in range(20000):
# A.append( np.random.rand(4*1024*1024) ) #np.arange(4*1024*1024) )
# print i, A.shape
# transaction.commit()
```
To simulate:
1. ssh root@2001:67c:1254:XX::1
2. su ivan; cd /home/ivan
3. . test/bin/activate
4. ipython feed.py
Regards
Ivan
Kirill Smelkovkirr@nexedi.comKirill Smelkovkirr@nexedi.comhttps://lab.nexedi.com/nexedi/wendelin.core/-/issues/1wendline.core cannot be build on Python 3.5.12020-03-21T17:54:22ZKirill Smelkovkirr@nexedi.comwendline.core cannot be build on Python 3.5.1( reported by @kazuhiko )
Hi !
When I was preparing a demonstration of wendline.core with the latest
anaconda, based on Python 3.5.1, I found that wendline.core could not be build
on Python 3.5.1, because _PyThreadState_Current is now hidden.
Here are some links that might be helpful.
https://bugs.python.org/issue25150
https://bugs.python.org/issue26154
https://github.com/vmprof/vmprof-python/commit/28479e39639bd5338f637a8179c9889ad25a437a
https://github.com/vmprof/vmprof-python/commit/b8a237cc8bdfa06f9e2c09cdf1740865f950d82e
Can you please check this issue ?
Thanks in advance !
Kazuhiko( reported by @kazuhiko )
Hi !
When I was preparing a demonstration of wendline.core with the latest
anaconda, based on Python 3.5.1, I found that wendline.core could not be build
on Python 3.5.1, because _PyThreadState_Current is now hidden.
Here are some links that might be helpful.
https://bugs.python.org/issue25150
https://bugs.python.org/issue26154
https://github.com/vmprof/vmprof-python/commit/28479e39639bd5338f637a8179c9889ad25a437a
https://github.com/vmprof/vmprof-python/commit/b8a237cc8bdfa06f9e2c09cdf1740865f950d82e
Can you please check this issue ?
Thanks in advance !
Kazuhikohttps://lab.nexedi.com/nexedi/dream/-/issues/3Are machines with variable processingTime possible?2020-04-01T06:32:33ZGhost UserAre machines with variable processingTime possible?I'm trying to model a single machine can be reconfigured to manufacture numerous different parts. Each configuration of the machine creates a different part at a specific rate for that part. The machine only operates in one configuration at any given time. There is change-over/setup time between the configurations, then once the machine is in the new configuration, the machine's processing time would be different for the new parts it is producing. When the machine is configured to run for PartA might take 1 minute to process, whereas when configured to run PartB takes 3 minutes to process, etc.
Think of it like having a machine that can make 3 different donuts: [A] Old Fashioned [B] Jelly [C] French Cruller
![doughnuts](/uploads/d4f9c87f9912e2be5efa184c7f51ea06/doughnuts.jpg)
To make each different type of doughnut, the extruder mechanics on the machine are changed to match the desired doughnut, and based on the type of dough and other factors, their output rates are different.
How would you suggest handling this so that only one type of doughnut can be created at any given time, but the type of doughnut dictates the speed of the machine? FYI - we have some machines that can make 100's or 1000's of different parts depending on their configuration.
I'm not sure we'd model them all, but suffice to say they are very flexible machines :).
I'm trying to model a single machine can be reconfigured to manufacture numerous different parts. Each configuration of the machine creates a different part at a specific rate for that part. The machine only operates in one configuration at any given time. There is change-over/setup time between the configurations, then once the machine is in the new configuration, the machine's processing time would be different for the new parts it is producing. When the machine is configured to run for PartA might take 1 minute to process, whereas when configured to run PartB takes 3 minutes to process, etc.
Think of it like having a machine that can make 3 different donuts: [A] Old Fashioned [B] Jelly [C] French Cruller
![doughnuts](/uploads/d4f9c87f9912e2be5efa184c7f51ea06/doughnuts.jpg)
To make each different type of doughnut, the extruder mechanics on the machine are changed to match the desired doughnut, and based on the type of dough and other factors, their output rates are different.
How would you suggest handling this so that only one type of doughnut can be created at any given time, but the type of doughnut dictates the speed of the machine? FYI - we have some machines that can make 100's or 1000's of different parts depending on their configuration.
I'm not sure we'd model them all, but suffice to say they are very flexible machines :).
https://lab.nexedi.com/nexedi/dream/-/issues/2How to create a multi-threaded, non-blocking machine?2020-04-01T06:32:34ZGhost UserHow to create a multi-threaded, non-blocking machine?I'm trying to create a machine that's multi-threaded and can process many (say hundreds or thousands) of entities for their fixed duration without blocking and running single-threaded. If up to 500 entities can be in the machine at any given time, but each one must stay in the machine for say 2 hours. Creating a single queue to feed 100's/1,000's of instances of the same single-threaded machine instance seems like a bad way to approach it.
Example:
I'm thinking something along the lines of a continuous oven used for baking bread, etc. or anything that is a conveyor system loop. See this video to reference what I'm suggesting: https://www.youtube.com/watch?v=3UjUWfwWAC4&t=120
If the output speed is 500/hr but it takes 1 hour to traverse the continuous oven, then I don't think it's accurate to say any one item takes 1/500th of the time and use a regular machine with 500/hr. That might accurately model the output rate of the machine, but not the capacity (i.e. number of parts simultaneously held within) of the machine nor does it reflect the total time it takes an entity (loaf of bread) to pass through the system. Every entity takes the entire duration required to traverse the loop, but the machine (oven) can handle multiple items at the same time.
I'm trying to create a machine that's multi-threaded and can process many (say hundreds or thousands) of entities for their fixed duration without blocking and running single-threaded. If up to 500 entities can be in the machine at any given time, but each one must stay in the machine for say 2 hours. Creating a single queue to feed 100's/1,000's of instances of the same single-threaded machine instance seems like a bad way to approach it.
Example:
I'm thinking something along the lines of a continuous oven used for baking bread, etc. or anything that is a conveyor system loop. See this video to reference what I'm suggesting: https://www.youtube.com/watch?v=3UjUWfwWAC4&t=120
If the output speed is 500/hr but it takes 1 hour to traverse the continuous oven, then I don't think it's accurate to say any one item takes 1/500th of the time and use a regular machine with 500/hr. That might accurately model the output rate of the machine, but not the capacity (i.e. number of parts simultaneously held within) of the machine nor does it reflect the total time it takes an entity (loaf of bread) to pass through the system. Every entity takes the entire duration required to traverse the loop, but the machine (oven) can handle multiple items at the same time.
https://lab.nexedi.com/nexedi/dream/-/issues/1Overview question: What's the order of methods that are called in a core object?2020-04-01T06:32:34ZGhost UserOverview question: What's the order of methods that are called in a core object?Say I want to execute some code at just after a machine receives an entity, but before any internal machine processing has taken place? Similarly, how would I run some code after the machine has completed and just before hand-off to another object?
Is there a quick function rundown or diagram that easily shows what happens (and at what times) within each machine so I know where best to attach custom code to operate on entities passing through the system?
Here's a high level of how I'd see it working:
![high_level_machine](/uploads/27e2e13e4a42585eb7c02250995bca9c/high_level_machine.jpg)
Say I want to execute some code at just after a machine receives an entity, but before any internal machine processing has taken place? Similarly, how would I run some code after the machine has completed and just before hand-off to another object?
Is there a quick function rundown or diagram that easily shows what happens (and at what times) within each machine so I know where best to attach custom code to operate on entities passing through the system?
Here's a high level of how I'd see it working:
![high_level_machine](/uploads/27e2e13e4a42585eb7c02250995bca9c/high_level_machine.jpg)