Commit b1154000 authored by Rafael Monnerat's avatar Rafael Monnerat

Fix for tidstorage backup

  Broken during some refactoring it was broken with the following traceback:

$ ./bin/tidstorage-repozo
No handlers could be found for logger "TIDStorage"
Traceback (most recent call last):
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/bin/tidstorage_repozo", line 267, in <module>
    import Products.TIDStorage.repozo.repozo_tidstorage
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/eggs/Products.TIDStorage-5.5.0-py2.7.egg/Products/T
IDStorage/__init__.py", line 30, in <module>
    import transaction_transaction
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/eggs/Products.TIDStorage-5.5.0-py2.7.egg/Products/T
IDStorage/transaction_transaction.py", line 74, in <module>
    from Products.CMFActivity.ActivityTool import getServerAddress
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/CMFActivity/ActivityTool.py", li
ne 37, in <module>
    from Products.ERP5Type.Tool.BaseTool import BaseTool
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/ERP5Type/Tool/BaseTool.py", line
 34, in <module>
    from Products.ERP5Type.Core.Folder import Folder
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/ERP5Type/Core/__init__.py", line
 5, in <module>
    import Products.CMFCategory
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/CMFCategory/__init__.py", line 4
0, in <module>
    import Category, CategoryTool
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/CMFCategory/Category.py", line 4
0, in <module>
    from Products.ERP5Type.Core.Folder import Folder
  File "/srv/slapgrid/slappart0/srv/runner/software/a0b0649fbf81fcec27b928dd72928d9f/parts/erp5/product/ERP5Type/Core/Folder.py", line 1
668, in <module>
    from Products.CMFActivity.ActivityTool import getCurrentNode
ImportError: cannot import name getCurrentNode
parent cc154770
......@@ -1665,7 +1665,29 @@ class Folder(OFSFolder2, CMFBTreeFolder, CMFHBTreeFolder, Base, FolderMixIn):
pass
return '%s/%s' % (url, icon)
from Products.CMFActivity.ActivityTool import getCurrentNode
try:
from Products.CMFActivity.ActivityTool import getCurrentNode
except ImportError:
def getCurrentNode():
""" Return current node identifier """
global currentNode
if currentNode is None:
currentNode = getattr(
getConfiguration(),
'product_config',
{},
).get('cmfactivity', {}).get('node-id')
if currentNode is None:
warnings.warn('Node name auto-generation is deprecated, please add a'
'\n'
'<product-config CMFActivity>\n'
' node-id = ...\n'
'</product-config>\n'
'section in your zope.conf, replacing "..." with a cluster-unique '
'node identifier.', DeprecationWarning)
currentNode = getServerAddress()
return currentNode
# We browse all used class from btree and hbtree and set not implemented
# class if one method defined on a class is not defined on other, thus if
......
......@@ -2,4 +2,7 @@
# because Folder needs PropertySheet.CategoryCore
# This is important for ZEO server which may need to import conflict-resolution
# classes from products that depend (at 'import' time) directly on Folder.
import Products.CMFCategory
try:
import Products.CMFCategory
except ImportError:
pass
  • This problem will also need a non regression test.

    Do you think it can be tested in Products.TIDStorage (where is this code BTW ? is it https://svn.erp5.org/repos/public/erp5/trunk/utils/Products.TIDStorage/ ? ) ?

    or shouldn't we add a test in https://lab.nexedi.com/nexedi/slapos/tree/master/software/erp5/test/ ? we could test that crontabs generate some backups they should generate, maybe using https://lab.nexedi.com/nexedi/slapos/tree/master/component/faketime/

    /cc @Nicolas

  • I'm not exactly sure but we probably need to move TIDStorage to git (maybe under ERP5 respository / Products) and write unit tests for it. As it is not clearly have ERP5 as dependency, so we just need to include it on our average test suites.

    The resilience tests for ERP5 should somehow detect the problem (for instances deployed after it got broken in June), but I could not spot this specific failure currently (the ones I checked are failing for something else).

  • Products.TIDStorage is still using SVN repository. @jm is there anything which prevent use to move it to git?

    @jm is it possible to revert the latest Products.TIDStorage (ie, not importing CMFActivity) change? If not, and if we add an Product.CMFActivity dependency to it, could we move it into the ERP5 repository directly?


    After checking my instance, I saw many crontab errors. The failed crontabs are only visible on the dcron supervisord stdout file with the:

    exit status XXX from user

    Maybe we should add such error reporting to the monitor? (by grepping and checking the file modification date, ugly, but immediate).


    @tomo @rafael Then, of course, the web runner must monitoring its instances promises, otherwise, they stay undetected. As the webrunner can be configured to be unmonitored if wanted, nothing prevent us to activate this functionnality, isn't it?


    @Nicolas does the ERP5 resiliency test check that repozo files can successfully be rebuild? Is it monitor? Is it tested?


    we could test that crontabs generate some backups they should generate

    One simple first and brutal check could be to ensure that the repozo timestamp is not too old compared to the zodb timestamp.

  • The earliest TIDStorage dies, the better. This is actually used by 2-3 projects and for other reasons, I hope they move to NEO as soon as possible. Moving such a broken piece of code to erp5.git just to trash it "shortly" after is frustrating.

    Anyway, the problem here does not seem to be a TIDStorage issue. I wonder why there's no issue with runzope.

  • What about this single change: in product/CMFActivity/ActivityTool.py, move

    from Products.ERP5Type.Tool.BaseTool import BaseTool

    and NO_DEFAULT_NODE_PREFERENCE just before class ActivityTool (BaseTool):.

  • The earliest TIDStorage dies, the better.

    +1

    Beyond this:

    • TIDStorage does not prevent backing up anything, so having a stale log does not mean there is no backup (this was actually a serious limitation of the first version, which was identified and fixed by @luke long ago - maybe 10 years ago now).
    • If one fails to restore a backup using the tidstorage-level tool, they can fall back to the good old repozo command. Of course, it means that inter-ZODB relations are not guaranteed, but it's much better than losing X years of data. The risk of inter-mount-point inconsistency should be low, unless the backup happened in the middle of a busy period for involved databases.
    • I think TIDStorage does not care about a zope id, but does care about its IP and port, and expects to be able to connect to these. So getting a textual node id would be useless and (again, if my memory is correct) could be useless/mileading to TIDStorage. I forgot why, and I realise I do not even have an SVN working copy anymore. Let me know if you need a more precise memory/analysis, and I'll do a fresh checkout.
  • @Nicolas does the ERP5 resiliency test check that repozo files can successfully be rebuild? Is it monitor? Is it tested?

    The script using repozo to rebuild the zodb exits with a correct status code if the zodb is inconsistent : https://lab.nexedi.com/nexedi/slapos/blob/master/stack/erp5/instance-zeo.cfg.in#L156-161

    But the webrunner surprisingly doesn't check for it : https://lab.nexedi.com/nexedi/slapos/blob/master/software/slaprunner/template/runner-import.sh.jinja2#L241-250 ( BAD !)

    So on the clone there is nothing telling us that a zodb couldn't be rebuilt. I'll commit this change. Then once the import script in the webrunner fails, there exist promises that will start to fail.

  • Re :

    In fact I lied : if a subcommand of the import script fails, then the import scrit fails (it is run with "set -e" : https://lab.nexedi.com/nexedi/slapos/blob/master/software/slaprunner/template/runner-import.sh.jinja2#L8)

    So yes if a zodb can not be rebuilt, then the resiliency fails, promises fail, it is visible in the monitoring app, and a ticket is created.

    Regards,

    Nicolas

  • So yes if a zodb can not be rebuilt, then the resiliency fails, promises fail, it is visible in the monitoring app, and a ticket is created.

    We did not have a ticket on the instance where we noticed this problem, but importer-consistency-promise is OK with this message: image

    Edit: I don't know if it's related, I never used monitor/ticket system so I might be misunderstanding. I don't know either what is this 2018-12-21 date.

    Edit: Also, the problem is that we no longer export, so probably the rebuild is still OK because it's still rebuilding the old repozo dumps.

    Edited by Jérome Perrin
  • Then it would mean that the resiliency process didn't reach runner1. Does the backup script in the zeo partition succeeded ? Does the export work ? Is the backup pushed daily on PBS ?

    The date on the screenshot seems definitly wrong... Has the monitoring be broken for one year ?

  • Does the backup script in the zeo partition succeeded ? Does the export work ?

    no, repozo fail because of the traceback from this commit message.

  • TIDStorage does not prevent backing up anything

    I don't know what you think this issue is about but actual problem is that the backup is totally broken because of this. Indeed, repozo doesn't generate anymore any incremental backup.

  • Here are the options:

    1. Revert change of TIDstorage and re-release the egg (or downgrate the egg version).
    2. Merge this change only on master.

    I will do one out the 2 tonight if nobody takes any action.

    If tidstorage should die or if we should move alll to NEO, we should start from a working point (even if means revert my change later). Currently, we have a regression in the current code that MUST be address as soon as possible.

    If you have preference that can be done Today, please let me know. It takes more them 24 hours, I will apply my change and revert after another merge request is out.

  • The change in TIDStorage is not there for nothing. Reverting would break ERP5 with WSGI...

    I'm sure we can do better than this MR. I'll see today.

  • Fixed in 2260fc70 & a93ea5f1.

  • Thanks @jm

Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment