Commit 9b0b7a8b authored by Julien Muchembled's avatar Julien Muchembled

erp5: disable broken promises

parent 58eec840
...@@ -23,7 +23,7 @@ md5sum = cfe6ab8ae54a521ecb269e9d9762cbeb ...@@ -23,7 +23,7 @@ md5sum = cfe6ab8ae54a521ecb269e9d9762cbeb
[template-mariadb] [template-mariadb]
filename = instance-mariadb.cfg.in filename = instance-mariadb.cfg.in
md5sum = 5c11abdef0c84cdef91f999490786244 md5sum = 99450bb6426770bd0c27f9dbec9ca542
[template-kumofs] [template-kumofs]
filename = instance-kumofs.cfg.in filename = instance-kumofs.cfg.in
...@@ -87,7 +87,7 @@ md5sum = 27d26c6380883cf3bd7b2f003f7888d8 ...@@ -87,7 +87,7 @@ md5sum = 27d26c6380883cf3bd7b2f003f7888d8
[template-balancer] [template-balancer]
filename = instance-balancer.cfg.in filename = instance-balancer.cfg.in
md5sum = a93c7fae7402db91142cc73e5af55c6c md5sum = c80648d5918ae865bc02cb8fccfc56e3
[template-haproxy-cfg] [template-haproxy-cfg]
filename = haproxy.cfg.in filename = haproxy.cfg.in
......
...@@ -299,6 +299,7 @@ promise-threshold = {{ slapparameter_dict['apachedex-promise-threshold'] }} ...@@ -299,6 +299,7 @@ promise-threshold = {{ slapparameter_dict['apachedex-promise-threshold'] }}
[{{ section('monitor-promise-apachedex-result') }}] [{{ section('monitor-promise-apachedex-result') }}]
recipe = slapos.cookbook:wrapper recipe = slapos.cookbook:wrapper
recipe =
wrapper-path = ${directory:promise}/check-apachedex-result wrapper-path = ${directory:promise}/check-apachedex-result
command-line = "{{ parameter_dict['promise-check-apachedex-result'] }}" --apachedex_file "${directory:apachedex}" --status_file ${monitor-directory:private}/apachedex.report.json --threshold "${apachedex-parameters:promise-threshold}" command-line = "{{ parameter_dict['promise-check-apachedex-result'] }}" --apachedex_file "${directory:apachedex}" --status_file ${monitor-directory:private}/apachedex.report.json --threshold "${apachedex-parameters:promise-threshold}"
......
...@@ -296,6 +296,7 @@ slowest_queries_threshold = {{ slapparameter_dict['slowest-query-threshold'] }} ...@@ -296,6 +296,7 @@ slowest_queries_threshold = {{ slapparameter_dict['slowest-query-threshold'] }}
[{{ section('monitor-promise-slowquery-result') }}] [{{ section('monitor-promise-slowquery-result') }}]
recipe = slapos.cookbook:wrapper recipe = slapos.cookbook:wrapper
recipe =
wrapper-path = ${directory:promise}/check-slow-query-pt-digest-result wrapper-path = ${directory:promise}/check-slow-query-pt-digest-result
command-line = "{{ parameter_dict['promise-check-slow-queries-digest-result'] }}" --ptdigest_file "${monitor-directory:private}/slowquery_digest.txt" --status_file ${monitor-directory:private}/mariadb_slow_query.report.json --max_queries_threshold "${slow-query-digest-parameters:max_queries_threshold}" --slowest_query_threshold "${slow-query-digest-parameters:slowest_queries_threshold}" command-line = "{{ parameter_dict['promise-check-slow-queries-digest-result'] }}" --ptdigest_file "${monitor-directory:private}/slowquery_digest.txt" --status_file ${monitor-directory:private}/mariadb_slow_query.report.json --max_queries_threshold "${slow-query-digest-parameters:max_queries_threshold}" --slowest_query_threshold "${slow-query-digest-parameters:slowest_queries_threshold}"
......
  • @nexedi Sorry but slapos gets mad whenever there's a bug in a promise, so it's impossible to maintain an instance. Could not find quickly what's wrong, hence this commit. (Yeah, I'm quite in bad mood for the lost time and the noise on the machine distorting benchmark results. Maybe not the only issue if services restarted unexpectedly.)

  • @jm, I didn't know remove recipe would work. If we take this attitude every time we expend time debugging something that was not introduced by the person that is debugging, I would remove/disable really lot of code. Maybe @hjuneja can help you to debug, as he was the one that introduced.

    @seb it would be helpful if the erp5testnodes fail if the instance fail to run some promise (this will ensure ERP5 works as it should). @hjuneja could take this task along with the ones he has.

  • @jm: Could you elaborate on how slapos is causing issues when promises fail ? So I (and likely others) can compare with my (our) experience and keep an eye out for the same symptoms.

  • @rafael. I'm fine with the idea of making promise failures visible in test results. We just need someone for doing it, with some attention to the status of what is running right now to avoid making all tests failing for something we were not watching before.

  • Whenever a promise fails while processing a partition, node instance does not consider the partition as fully instantiated and it is processed again at each invocation. This creates at least important load. A failing promise in the root partition can be worse with complex instantiation like NEO: it happened to me 3 weeks ago and NEO parts got uninstalled/installed all the time, plus even an unexpected restart of mariadb.

  • For me, there are things that should not be checked during node instance, like the presence/up-to-date/integrity of data produced by services/crons. If you setup a new instance, or resume one after a prolonged stop, you're likely to get failures for a while. You could ignore cases but this would make monitoring less useful.

    For the promises disabled here, Hardik can't fix without knowing what we want them to do, and in particular they should be monitor promises like before.

  • @seb, I think we should implement as optional first, or set on a specific test node, fix the problems, them put on the rest. Otherwise (if we don't ensure promises work), we are just hiding bugs and allow slapos instantiation be broken eventually.

    I'm not so sure why there is so much resistance (or lack of interest) on add/review/improve promises (or if you prefer, "consistency checks" in ERP5 Language).

    @jm, @jp asked to remove "monitor promises", specifically for those ones. He asked to have a single type of promises ( He refused to discuss any further on this topic, 😄, only one type of promise, if fail instance is reprocessed).

       If you setup a new instance, or resume one after a prolonged stop, you're likely to get failures for a while.
    

    If the promise is poorly implemented, sure .... otherwise, it should consider this case.

  •  A failing promise in the root partition can be worse with complex instantiation like NEO: it happened to me 3 weeks ago and NEO parts got uninstalled/installed all the time, plus even an unexpected restart of mariadb.
    

    I already explain this to you but I do again:

    • You added local changes somewhere (which is forbidden in theory)
    • Tristan implemented a promise to specifically detect if there are local changes and fail, this would indicate that the promises is running some old code and not restarted.
    • The way root request NEO is wrong (it works but it has consequences) from SlapOS point of view, as it make 2 requests editing the parameters on both cases. As the parameter change, between the 2 requests, the instance of NEO WILL run buildout... this is the design defined 4 years ago.
    • Nobody looked at promises, tickets or monitor UI, during 3 weeks (which is bad)

    So in the way things were written... I think all was expected.

    Edited by Rafael Monnerat
  • On a similar subject monitor-frontend-promise has been failing for some time (a few months) inside WebRunners on master branch. Hopefully no resilient webrunners have been deployed in the mean time else they also would be failing due to instanciation on clones failing.

    • Tristan implemented a promise to specifically detect if there are local changes and fail, this would indicate that the promises is running some old code and not restarted.

    I don't deny that for this promise, it's not a mistake by me. I didn't know it and wrongly thought it was a bug. My point was only to give an example of consequences when promises fail, and promises can be buggy.

    • You added local changes somewhere (which is forbidden in theory)

    I'd have things to say but that would be off-topic, like probably the "unexpected restart of mariadb" mentioned above (I guess a daily full 'node instance' would have lead to the same result).

    • Nobody looked at promises, tickets or monitor UI, during 3 weeks (which is bad)

    I don't know what you're referring to.

    The way root request NEO is wrong (it works but it has consequences) from SlapOS point of view, as it make 2 requests editing the parameters on both cases. As the parameter change, between the 2 requests, the instance of NEO WILL run buildout... this is the design defined 4 years ago.

    Declaring it wrong won't make the need disappear magically. I have always wanted something simpler but I found no solution.

    For NEO, a kind of DNS service inside the root partition may be a working replacement.

  • If you setup a new instance, or resume one after a prolonged stop, you're likely to get failures for a while.

    If the promise is poorly implemented, sure .... otherwise, it should consider this case.

    Can you provide full specs of promise-apachedex-result before discussing further ?

  • @jm indeed I think promise is badly implemented. so I take the blame for this one. while I am on it, I will enable the "instance-fail-if-promise-fail" on erp5testnodes as well, so it can be well tested.

    And as we discussed in the morning I will try to loosen the promise strictibility a bit (where ever possible) while thinking and discussing about more possible side cases which we can cover to avoid such cases.

  • mentioned in commit fa09269a

    Toggle commit list
  • Can you provide full specs of promise-apachedex-result before discussing further ?

    It is better I write a general document (or a Guideline) to write promises. I will do as soon as I can.

    In short, the promise should (what I remember in my mind):

    • Assert if the latest apachedex result is better them the threshold (I forgot it it is above or below the value)
    • Promise must not fail if instance was recently deployed (less them 36 hours), considering t0 + 24hours + expected execution time to generate the report
    • Promise must fail if the report was not generated in the latest 36 hours, also considering t0 + 24hours + expected execution time to generate the report
    • Promise should NOT generate the report, as promise MUST have a short execution time (0-20 seconds at worst)
    • Promise can indeed assert other aspects, to tell for example, why the last report fail.
    • Promise can optionally tell (by flagging or editing a file) to the responsible to generate the report, that it needs a new report (this is possible but probably not asked or implemented).

    If the promise is failing you can:

    • Change the threshold and put an absurd value that will never fail (so bad, but why not....)
    • Fix (with a patch I hope), the condition that prevents to the report to be produced.

    Slow queries follows the same principle.

    • Promise must not fail if instance was recently deployed (less them 36 hours), considering t0 + 24hours + expected execution time to generate the report

    How do you do that ? In particular, you seem to forget the case of an upgrade of a SR. I mean:

    • a partition is already deployed for a version that does not generate reports
    • a new version of the SR is available, bringing new reports among other changes
    • what date do you take when you instantiate the already existing partition for the new version ?
    • must not fail if instance was recently deployed (less them 36 hours), considering t0 + 24hours + expected execution time to generate the report

    For the moment, Hardik has code that checks the mtime of the folder that contains reports, if this folder is empty. That's not ideal because that does not detect bugs that would modify this mtime without actually leaving a report. (BTW, there's ongoing work to add creation time to Linux)

    Apart from that, what about the case of a promise that is executed while the report is being written to disk, in such a way that this report is considered corrupted ? Synchronizing crons (report generation and promise) is not enough because the promise may execute at any time by buildout.

    Edited by Julien Muchembled
  • Good questions....

    How do you do that ?

    My initial suggestion... same as logrotate? Promise run first time, check for var/run/promise_couscous_date, if missing or empty writes "creation date", else check if date is newer them the limited.... Incompatible upgrades that requires a new instantiation changes the filename...

    There might be other clever ways...

    what about the case of a promise that is executed while the report is being written to disk, in such a way that this report is considered corrupted ?

    If time to generate the file is short, it will fail once, them it will work after (I don't like random failures, so I would prefer avoid, but sometimes no way)... To prevent race conditions, you can use lock, but you will timeout on 20 to 30 sec.... (probably enough to write the file on disk).

    Edited by Rafael Monnerat
  • what about the case of a promise that is executed while the report is being written to disk, in such a way that this report is considered corrupted ?

    And similarly, I let you contemplate this beautiful error while running node instance:

    Updating promise-zope-1.
    While:
      Updating promise-zope-1.
    
    An internal error occurred due to a bug in either zc.buildout or in a
    recipe being used:
    Traceback (most recent call last):
      File ".../eggs/zc.buildout-2.5.2+slapos010-py2.7.egg/zc/buildout/buildout.py", line 2224, in main
        getattr(buildout, command)(args)
      File ".../eggs/zc.buildout-2.5.2+slapos010-py2.7.egg/zc/buildout/buildout.py", line 638, in install
        self._install_parts(install_args)
      File ".../eggs/zc.buildout-2.5.2+slapos010-py2.7.egg/zc/buildout/buildout.py", line 788, in _install_parts
        updated_files = self[part]._call(update)
      File ".../eggs/zc.buildout-2.5.2+slapos010-py2.7.egg/zc/buildout/buildout.py", line 1588, in _call
        return f()
      File ".../eggs/slapos.cookbook-1.0.53-py2.7.egg/slapos/recipe/librecipe/generic.py", line 68, in update
        return self.install()
      File ".../eggs/slapos.cookbook-1.0.53-py2.7.egg/slapos/recipe/check_port_listening.py", line 57, in install
        'port': self.options['port'],
      File ".../eggs/slapos.cookbook-1.0.53-py2.7.egg/slapos/recipe/librecipe/generic.py", line 96, in createExecutable
        return self.createFile(name, content, mode)
      File ".../eggs/slapos.cookbook-1.0.53-py2.7.egg/slapos/recipe/librecipe/generic.py", line 90, in createFile
        with open(name, 'w') as fileobject:
    IOError: [Errno 26] Text file busy: '/srv/slapgrid/slappart6/etc/promise/zope-1'
    While:
      Updating switch-softwaretype.
  • Never seem it before (seems unrelated to recent added promises), check_port_listening seems to be used everywhere and the code is quite old...

    Edited by Rafael Monnerat
  • I was commenting about race conditions, which seem to have been often ignored during slapos development. I explain this ETXTBSY error as follows:

    • the monitor is starting the etc/promise/zope-1 and at the very beginning of its execution is somewhat busy with it, maybe about to read the beginning to see what kind of executable it is (shebang?)
    • at the same time, buildout wants to truncate it

    BTW, catching ETXTBSY would be wrong. Modifying file inplace is always subject to race conditions (e.g. when the file is read by the interpreter) so the only solution is to write a new file next to the target and rename at the end.

Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment