Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
W wendelin.core
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 1
    • Issues 1
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • nexedi
  • wendelin.core
  • Issues
  • #6

Closed
Open
Created Oct 31, 2016 by Kirill Smelkov@kirrOwner

storeblk() vs DB lock deadlock

The following was seen on a project: a Zope thread was stuck for a very long time with memory usage growing. The thing here is probably similar to loadblk() vs DB lock deadlock case we already solved in f49c11a3. Details:

In slapart9/zope-0 we are currently having:

.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |grep -o 'Thread .*:' |sort |uniq
Thread 140164638697216:
Thread 140164647089920:
Thread 140164991964928:
Thread 140165000357632:

-> 4 long running threads. Here is latest info about them:

.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |tail -4
2016-10-21 09:24:44,515 - Thread 140165000357632: Started on 1476162420.8; Running for 879463.7 secs; Same.
2016-10-21 09:24:44,515 - Thread 140164647089920: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164991964928: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164638697216: Started on 1476162461.7; Running for 879422.8 secs; Same.

so yes "Running for 879463.7 secs" means a thread ate 244 hour CPU time, but "Started on 1476162420.8" means it was start on:

    In [1]: import time
    
    In [2]: time.ctime(1476162420.8)
    Out[2]: 'Tue Oct 11 08:07:00 2016'

-> On Oct 11 morning.

Now I've went to slappart9/srv/backup/logrotate and tracked there in longrequest_logger_zope-0.log-*.gz to those thread origins (by the way verifying there was nothing new longrunning started/stopped in between), so here is the original stucking info:

2016-10-11 05:07:41,917 - Thread 140165000357632: Started on 1476162420.8; Running for 41.1 secs; request: GET http://:/Control_Panel/timer_service/process_timer
retry count: 0
form: {}
other: {'ACTUAL_URL': 'http://:/Control_Panel/timer_service/process_timer',
 'AUTHENTICATED_USER': <SpecialUser 'Anonymous User'>,
 'AUTHENTICATION_PATH': '',
 'AcceptLanguage': {},
 'PARENTS': [<Activity Tool at /erp5/portal_activities>, <ERP5Site at /erp5>],
 'PUBLISHED': <bound method TimerService.process_timer of <TimerService at /timer_service used for /Control_Panel>>,
 'RESPONSE': TimerResponse(''),
 'SERVER_URL': 'http://:',
 'TraversalRequestNameStack': [],
 'URL': 'http://:/Control_Panel/timer_service/process_timer',
 'interval': 1,
 'method': 'GET'}
Traceback:
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
    response=b)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
    environ, debug, request, response)
  File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
    return publish_module_standard.original(*args, **kw)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
    response = publish(request, module_name, after_list, debug=debug)
  File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
    x = zope_publish(request, module_name, after_list, debug)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 138, in publish
    request, bind=1)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/mapply.py", line 77, in mapply
    if debug is not None: return debug(object,args,context)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 48, in call_object
    result=apply(object,args) # Type s<cr> to step into published object.
  File ".../parts/erp5/product/TimerService/TimerService.py", line 102, in process_timer
    DateTime(prev_tick), DateTime(next_tick))
  File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1040, in process_timer
    self.tic(processing_node_list.index(currentNode) + 1)
  File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1095, in tic
    if not last.dequeueMessage(inner_self, processing_node):
  File ".../parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 509, in dequeueMessage
    transaction.commit()
  File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
    return self.get().commit()
  File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
    self._commitResources()
  File ".../eggs/Products.TIDStorage-5.4.9-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 261, in _commitResources
    result = original__commitResources(self, *args, **kw)
  File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
    rm.commit(self)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 716, in commit
    self.zfileh.dirty_writeout(WRITEOUT_STORE)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 406, in storeblk
    def storeblk(self, blk, buf):   return self.zself.storeblk(blk, buf)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 496, in storeblk
    blkchanged = zblk.setblkdata(buf)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 311, in setblkdata
    if (start % CHUNKSIZE) or len(chunk.data) > CHUNKSIZE:
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 860, in setstate
    self._setstate(obj)
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 901, in _setstate
    p, serial = self._storage.load(obj._p_oid, '')
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ClientStorage.py", line 833, in load
    data, tid = self._server.loadEx(oid)
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ServerStub.py", line 176, in loadEx
    return self.rpc.call("loadEx", oid)
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 763, in call
    r_args = self.wait(msgid)
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 791, in wait
    self.replies_cond.wait()
  File ".../parts/python2.7/lib/python2.7/threading.py", line 340, in wait
    waiter.acquire()

( 140165000357632 seems to be an activity executor thread triggered to run under /timer_service/process_timer; some activity involving wendelin.core has been run. then there is commit, which involves wendelin.core writeout activity. wendelin.core wants to store an object to ZEO and this in turn triggers some another load from ZEO, and loading is stuck forever for some reason )

2016-10-11 05:07:45,739 - Thread 140164638697216: Started on 1476162461.7; Running for 4.0 secs; [No request]
Traceback:
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
    response=b)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
    environ, debug, request, response)
  File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
    return publish_module_standard.original(*args, **kw)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
    response = publish(request, module_name, after_list, debug=debug)
  File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
    x = zope_publish(request, module_name, after_list, debug)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
    object=request.traverse(path, validated_hook=validated_hook)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
    object=object.__bobo_traverse__(request)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
    conn = db.open()
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
    self._a()
  File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
    rc = self.__block.acquire(blocking)
2016-10-11 05:07:45,537 - Thread 140164991964928: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
    response=b)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
    environ, debug, request, response)
  File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
    return publish_module_standard.original(*args, **kw)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
    response = publish(request, module_name, after_list, debug=debug)
  File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
    x = zope_publish(request, module_name, after_list, debug)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
    object=request.traverse(path, validated_hook=validated_hook)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
    object=object.__bobo_traverse__(request)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
    conn = db.open()
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
    self._a()
  File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
    rc = self.__block.acquire(blocking)

( Above are just 2 std zope threads - they want to open the database first but are stuck on the lock being taken )

2016-10-11 05:07:45,474 - Thread 140164647089920: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
    response=b)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
    environ, debug, request, response)
  File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
    return publish_module_standard.original(*args, **kw)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
    response = publish(request, module_name, after_list, debug=debug)
  File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
    x = zope_publish(request, module_name, after_list, debug)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
    object=request.traverse(path, validated_hook=validated_hook)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
    object=object.__bobo_traverse__(request)
  File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
    conn = db.open()
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 751, in open
    result.open(transaction_manager)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 534, in Connection_open
    orig_Connection_open(self, transaction_manager, delegate)
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 1046, in open
    self._flush_invalidations()
  File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 539, in _flush_invalidations
    self._cache.invalidate(invalidated)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 152, in _p_invalidate
    self._v_zfile.invalidateblk(self._v_blk)
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 505, in invalidateblk
    fileh.invalidate_page(blk)  # XXX assumes blksize == pagesize
  File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 635, in invalidate_page
    return self.zfileh.invalidate_page(pgoffset)

( this seems to be another worker thread in between transactions which received invalidation from DB and this way also triggers wendelin.core invalidations which then wants to acquire virtmem lock and waits forever because that lock is already being held by writeout code )

All this looks similar to deadlock we already solved between loadblk() and ZEO lock (f49c11a3). Here we are probably having some kind of storeblk() vs ZEO lock deadlock as after I've checked storeblk() are still being run from under virtmem lock), though it is not clear off-hand why the memory usage is growing.

/cc @Tyagov, @klaus

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7