More python3 compatibility
@jerome, I was trying to make zodbtools work with Python3 and along that road picked some bits of your work from !12. At present the migration to Python3 is not complete, and even though now I have the answer to how handle strings in both python2/3 in compatible and reasonable way (I can share details if you are interested), I have to put that work on hold for some time and use https://pypi.org/project/pep3134 directly in wcfs tests, since getting all string details right, even after figuring on how to do it, will take time. Anyway the bits presented here should be ready for master and could be merged now. Could you please have a look?
The changes were merged into master. The source branch has been removed.
Thanks a lot for that @kirr , I also believe all this is ready to go on master.
I noticed that there are a few type annotations in the patch, no problem for me, but I thought you prefer to leave them out.
The way to handle strings you're talking about is surely interesting. I can't imagine what is the relation with https://pypi.org/project/pep3134 ( if it's related ) in any case, if you can explain a bit the idea, I would appreciate.
@jerome, thanks for feedback. I will clarify my rationale: in wcfs tests I was frequently hitting situations, where exception raised by an assert was hidden by another exception raised in further generic teardown check. For example wcfs tests check that wcfs is unmounted after every test run: https://lab.nexedi.com/kirr/wendelin.core/blob/49e73a6d/wcfs/wcfs_test.py#L70 and if that fails it was hiding problems raised by an assert. As the result I was constantly guessing and adding code like https://lab.nexedi.com/kirr/wendelin.core/blob/49e73a6d/wcfs/wcfs_test.py#L853-857 to find what was actually breaking. This is kind of fixed in Python3 where exception raised by second raise will keep information about first exception and first traceback (see https://www.python.org/dev/peps/pep-3134/ for details). And so, as zodbtools are used in wcfs tests, this fact was my motivation to port zodbtools to Python3. For Python2 exception chaining could be imitatied by https://pypi.org/project/pep3134 which is why I mentioned it.
About strings. My work on porting zodbtools and everything else to Python3 is on hold now because it will take time to do and there is no time to do it now. However I believe I could come up with 2/3 strings problem solution which is both reasonable and backward compatible (I was already seeing mess of .encode() / .decode() everywhere on wip-py3 ERP5 branches and slapos.core and elsewhere and still it was failing for people from time to time). In short:
Pygolang will add its own
golang.strwill be based on
unicodeand will know how to automatically convert to both
golang.unicodewill know how to automatically convert to
bytes. Operations in between those types and in between std
unicodewill coerce to
golang.Xtypes When run under gpython
builtin.str, similarly for
unicode. By using AST transformation we can change string literals to produce
golang.strtype instead of builtin
unicode. AST transformation could be hooked into python very deeply (i.e. not only on import hook) thus being hooked into all code compilation paths (not only for import, but for e.g. compile and eval that are called from C level). I have drafts for everything mentioned above see e.g. here:
mergedToggle commit list
About type annotations: I thought that it is better to have them at least in some form, because they help human to understand the code. We are not pinning ourselves to particular type annotation form as that could be changed later.
Hope strings work could be useful for Nexedi in the long run...
Thanks, I think I understand about the exceptions. I first wanted to point out that with unittest if we have:
import unittest class X(unittest.TestCase): def test_errors(self): raise ValueError("in the test") def tearDown(self): raise ValueError("in teardown")
It will count as two errors:
$ python -m unittest test_repro.py EE ====================================================================== ERROR: test_errors (test_repro.X) ---------------------------------------------------------------------- Traceback (most recent call last): File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/test_repro.py", line 5, in test_errors raise ValueError("in the test") ValueError: in the test ====================================================================== ERROR: test_errors (test_repro.X) ---------------------------------------------------------------------- Traceback (most recent call last): File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/test_repro.py", line 7, in tearDown raise ValueError("in teardown") ValueError: in teardown ---------------------------------------------------------------------- Ran 1 test in 0.000s FAILED (errors=2)
A similar pattern in pytest also is reported as two errors with both tracebacks.
import pytest @pytest.fixture def teardown(): yield raise ValueError("in teardown") def test_x(teardown): raise ValueError("in test")
$ pytest test_repro_pytest.py ================================================================ test session starts ================================================================= platform linux -- Python 3.6.6, pytest-4.2.0, py-1.7.0, pluggy-0.8.1 rootdir: /srv/slapgrid/slappart8/srv/runner/project/zodbtools, inifile: collected 1 item test_repro_pytest.py FE [100%] ======================================================================= ERRORS ======================================================================= ____________________________________________________________ ERROR at teardown of test_x _____________________________________________________________ @pytest.fixture def teardown(): yield > raise ValueError("in teardown") E ValueError: in teardown test_repro_pytest.py:6: ValueError ====================================================================== FAILURES ====================================================================== _______________________________________________________________________ test_x _______________________________________________________________________ teardown = None def test_x(teardown): > raise ValueError("in test") E ValueError: in test test_repro_pytest.py:9: ValueError ========================================================= 1 failed, 1 error in 0.06 seconds ==========================================================
but this is about
golang.defer. I tried:
import golang def cleanup(): raise ValueError("in cleanup") @golang.func def f(): golang.defer(cleanup) raise ValueError("in test") f()
and that's true that even though on python3 we have the full exception stacks for both exception, on python2 there's only the traceback for the last error from cleanup.
I tried a bit https://pypi.org/project/pep3134/ ( and also http://python-future.org/compatible_idioms.html#raising-exceptions ) but on python2, I don't think it's only possible to raise the
During handling of the above exception, another exception occurred: ...unless by raising an exception which includes this message in its'
__str__, something like (this code is just the result of quick experimentations):
import golang import sys, os, os.path, subprocess, threading, inspect, traceback # xdefer is like defer, but makes sure exception raised before deferred # function is called is not lost. # # if deferred function raises exception itself - it prints previous exception to stderr. # # XXX xdefer is workaround for Python2 not having exception chanining (PEP 3134) # without which, if e.g. tDB.close() raises exception, it prevents to see # whether and which an assert in the test failed. # # XXX merge into defer? def xdefer(f): # hack - imitate as if defer called from under xdefer was called directly by caller func fgo = inspect.currentframe().f_back.f_back __goframe__ = fgo.f_locals['__goframe__'] _xdefer(f) def _xdefer(f): def _(): from six import reraise as raise_ # call f, but print previous exception if f raises exc_type, exc_value, exc_traceback = sys.exc_info() try: f() except Exception as e: err = e if exc_type: tb = sys.exc_info() import textwrap class DeferError(exc_type): def __str__(self): return textwrap.dedent(""" This error occurred in defer: %s Also, this error occurred while executing function: %s """) % ( "".join(traceback.format_exception_only(err.__class__, err)), "".join(traceback.format_exception(exc_type, exc_value, exc_traceback)), ) raise_(DeferError, e, tb) else: # if defer did not raise but func raised, raise exception raised in func raise golang.defer(_) def cleanup(): another_cleanup() def another_cleanup(): raise ValueError("in cleanup") @golang.func def main_func(): xdefer(cleanup) another_func() def another_func(): raise ValueError("in another_func") main_func()
Then it would fail with a "not too bad" traceback:
$ ./.envpygolang_py2/bin/gpython test_repro_pygolang.py Traceback (most recent call last): File "./.envpygolang_py2/bin/gpython", line 10, in <module> sys.exit(main()) File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/gpython/__init__.py", line 189, in main pymain(sys.argv[1:]) File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/gpython/__init__.py", line 108, in pymain _execfile(filepath, g) File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/gpython/__init__.py", line 118, in _execfile six.exec_(code, globals, locals) File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/six.py", line 709, in exec_ exec("""exec _code_ in _globs_, _locs_""") File "<string>", line 1, in <module> File "test_repro_pygolang.py", line 68, in <module> main_func() File "</srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/decorator.pyc:decorator-gen-1>", line 2, in main_func File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/golang/__init__.py", line 111, in _ return f(*argv, **kw) File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/golang/__init__.py", line 141, in __exit__ d() File "test_repro_pygolang.py", line 48, in _ raise_(DeferError, e, tb) File "test_repro_pygolang.py", line 28, in _ f() File "test_repro_pygolang.py", line 56, in cleanup another_cleanup() File "test_repro_pygolang.py", line 58, in another_cleanup raise ValueError("in cleanup") __main__.DeferError: This error occurred in defer: ValueError: in cleanup Also, this error occurred while executing function: Traceback (most recent call last): File "/srv/slapgrid/slappart8/srv/runner/project/zodbtools/.envpygolang_py2/lib/python2.7/site-packages/golang/__init__.py", line 111, in _ return f(*argv, **kw) File "test_repro_pygolang.py", line 63, in main_func another_func() File "test_repro_pygolang.py", line 66, in another_func raise ValueError("in another_func") ValueError: in another_func
So it could be
deferresponsability to produce helpful traceback ?
About the strings, if I understand, the idea is at evaluation ("compile") time to replace strings literals by a kind of "magic" objects that behave in a way compatible with python2. It seems optimistic, but maybe it can work :). I would not be surprised that we run into crazy things.
My feeling on all this is that if application is developed since the beginning by separating bytes and text, then python3 approach is better because we have to think "is this bytes or text ?" while writing the code and then there are no surprise later, whereas in python2, it just works until some input is non ascii. On the other hand, when migrating old code from python2 to python3, this is extremely hard to figure out if it was bytes or text (and sometimes it's kind of "both"). So, for new code (and that's just my opinion, I don't know Nexedi's opinion on that), if we can write it as bytes/str it seems better, but the problem is that we have to change all the old code.
About this idea, did you have in mind a transition that we could control on a per-file basis (a bit like
__future__imports) ? In a way that we could have "compatibility mode" for all legacy code and "normal python3 mode" for new (or already updated) code.
Some random notes on this topic:
Some migration hacks are based on a
coding:( see https://github.com/asottile/future-fstrings ), but this looks crazy, also in your explanation you explained why it's good to target a lower level.
You probably know that zope uses something a bit similar when executing code from in-ZODB python script ( https://github.com/zopefoundation/RestrictedPython ) which basically rewrites implicits getattr into calls to
guarded_getattrwhich check if current user is allowed to access the attribute (which have been
@jerome, thanks for feedback. Your analysis about errors/exceptions is right with one amendment: defer, at present, is just a kind of syntatic sugar for try/finally. A test thus could be doing something like:
try: assert1 assert2 ... finally: cleanup() # this can raise itself
and then there will be the same similar problem as with current defer - on py2 the exception raised by cleanup will override exception raised by assert. It is true that, as you write, since we are controlling defer we can amend the exception raised. Go123 already has some code on this which could likely be reused one way or another.
Maybe another way to try is: in defer inject
__cause__and similar attributes to raised exception, and patch the code that prints exceptions in stdlib to take that into account. This way has the advantage that it does not change type of raised exception and thus it should not break catching code which tries to catch exceptions judged by types. We are already doing something similar for
In general I agree we should consider amending
deferitself to integrate adding cause functionality on python2. This way defer in addition to being nicer syntaticly will also gain slight semantic advantage over try/finally.
About strings: I'm trying to adopt UTF-8 based unicode byte-strings for 2 reasons:
- practical backward compatibility with python2,
- easier to work with compared to split bytes/unicode.
You seem to agree with "1" but not with "2". I will try to explain (please forgive me if I'm not good at explainig since I'm short on time; we can return to this topic later): UTF-8 was specifically introduced as the way to handle unicode strings in backward-compatible manner to how 1-byte ASCII strings were handled before. UTF-8 is thus not just another encoding among many of them, but plays the central role in string processing in the era of unicode. In a sense it is the next ASCII. Many strings algorythms don't require random access - they usually need to iterate over the string, compare prefix or suffix, find substring or similar. For such cases keeping the string as UTF-8, even though characters have variable-size encoding, is enough. There are of course cases when one needs to have random access to unicode-characters, and for such cases it is required to represent the string as UCS (array of UCS). But once again those cases are rare and don't dominate strings usage scenario.
Python3 approach, despite that many strings operation don't require random character access, is to represent strings as UCS. Whenever we get a byte stream we have to decode it explicitly, even though that UTF-8 encoding is hardcoded as being default. Similarly, even though UTF-8 encoding is put there to be the default, automatic conversion of UCS -> bytes don't work and strings has to be explicitly encoded. Besides using more memory for strings that is actually neccessarry, constant
.decode()creates noise and makes it harder to read programs.
Another approach to strings processing could be: don't use UCS for strings and use explicitly UCS only for cases that actually need unicode-level random character access. For such cases we can explicitly convert a string to unicode and work with it there. Since the cases where UCS is needed are rare, the default string representation could remain being array of bytes. What we can do to treat those array of bytes as being unicode string is to make iterator of it not to return single bytes, but decode next UTF-8 character. Thus algorythms that process strings will be unicode - not ASCII - aware, and still with extended version of the iterator we can provide support for not-UTF-8 encoded strings: if the next byte(s) are invalid according to UTF-8, the extended version of the iterator should return "Invalid Unicode Character" and the caller will know it can consume the next byte and process it as he/she seems fit.
Please see https://blog.golang.org/strings for overview of such approach.
From above point of view we can remain Python str to be bytes based assuming it is likely UTF-8 encoded and offloading programmer from constant stream of encode/decode. And only when unicode-level random access is needed, string should be explicitly converted to unicode. Since unicode -> UTF-8 works always that conversion can be either explicit or implicit, since it is unambiguous.
This is what I'm trying to do with
golang.unicode. Transforming the string literals turned out to be required because for e.g.
if s in "abc": ...
if s is bytes - not unicode - based, python3 will complain that one cannot check for non string in string and we cannot overload this operation via changing
s's methods whatever. This case is there in python stdlib and imports stop working if it is not supported and builtin's str is changed to be golang.str. This literals change is also logical if we are trying to use our string type universally.
__future__: techically since we have the whole module AST when transforming it, we can indeed check for some kind of future flags to adjust transformer behaviour. I did not thought about it in all details yet, but my feeling is that even for new code, the constant
.decode()dance is noisy and error prone and makes the code hard to read unneccessarily. So even though theoretically separate unicode and bytes are more clean, in practice it is not quite so. I'm thus contemplating not to add such a
__future__checking. I had not made the decision on this and open to thinking and arguments.
(and that's just my opinion, I don't know Nexedi's opinion on that)
I suggest to read https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018 section "Python 3 Losses: Nexedi Perspective" to get some idea on this.
Replies to random notes:
Some migration hacks are based on a coding: ( see https://github.com/asottile/future-fstrings ), but this looks crazy, also in your explanation you explained why it's good to target a lower level.
Thanks, I was not aware of this hack. Still as you say working on lowest-possible level is better because it should cover all cases (once again compared to import hook hooking into AST in general works under compile/eval & friends triggerred from C code too).
Yes, I know this. In my experience, even though the task of convertion is easy technically, it creates a significant gap for techology adoption. It is much easier if something is available out of the box compared to even small but required additional translation step, especially if build system was not used before. I also have a very bad experience with autoconf/automake & friends family which work by generating scripts/makefiles by macros and especially gettext, which heavily relies on several levels of code generations. In my experience the more generation layeres there is, the more harder it becomes to understand the system. And often those generations are unneccessary - for example in my build system gettext .po processing was fast and light while under autoconf/automake/gettext it is slow and hard to understand. In general I try to avoid extra generation layers unless there is no way to do things without it.
You probably know that zope uses something a bit similar when executing code from in-ZODB python script ( https://github.com/zopefoundation/RestrictedPython ) which basically rewrites implicits getattr into calls to guarded_getattr which check if current user is allowed to access the attribute (which have been declareProtected before ).
Yes, I'm aware of how RestrictedPython works.
Thanks a lot for taking the time to write this, it's a good point and an interesting read.
Quite often I see bugs in production because of string encoding (
UnicodeError, garbled text ) typically on some code that have been running fine for years but this time fail because it receive some non ASCII string for the first time. I only considered this aspect of the problem, because I experienced it so many times (as "it does not work" from customers). It was interesting to read about other aspects that I did not consider much.bytes/str looks like a way to have this failure earlier, so this seemed interesting to me. But to be honest I don't have much experience of writing complex python3 programs, so maybe in practice it would not prevent me from writing code with such bugs ... anyway this can only help when writing new code, not with existing code base.
Thanks again for sharing your thoughts.
@jerome you are welcome, thanks for feedback. And I understand the frustration of getting
UnicodeErroron non-ASCII and the temptation to fix it with split
unicode. It is just that the solution, contrary to Python3 path taken, can be backward compatible and still having the same robustness as Python3 has: if one gets non-UTF-8 input, the now-omnipresent
input.decode()will fail the same way as it will fail if
inputis thoughtlessly used in my UTF8-bytes string scheme without caring explicitly that it might contain non-valid UTF-8. So in the end to me the conclusion is that Python3 made programs more noisy, broke backward compatibility for no good reason and did not give more compared to what could be achieved in backward-compatible way.
I'm not claiming I understand the issue fully 100% and so might be missing something. However I've tried to explain my arguments and rationale. In the meantime, as I've put it on hold and am not actively working on this topic now, we can have this discussion/thoughts continue in the background to distill what our approach should be.
Quite often I see bugs in production because of string encoding ( UnicodeError, garbled text ) typically on some code that have been running fine for years but this time fail because it receive some non ASCII string for the first time.
@jerome, by the way, today gpython release already sets default encoding to be always UTF-8 even on Python2, so by running your programs via gpython even over python2, you can avoid this kind of problems automatially. Please compare:
(neo) (z-dev) (g.env) kirr@deco:~$ python Python 2.7.15+ (default, Feb 3 2019, 13:13:16) [GCC 8.2.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = "мир" >>> unicode(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
(neo) (z-dev) (g.env) kirr@deco:~$ gpython Python 2.7.15+ (default, Feb 3 2019, 13:13:16) [GCC 8.2.0] [GPython 0.0.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) >>> s = "мир" >>> unicode(s) u'\u043c\u0438\u0440'
true, I see that thanks to this we also have:
>>> print "jé" + "rôme" jérôme >>> print u"jé" + "rôme" jérôme >>> print "jé" + u"rôme" jérôme >>> print "jé%s" % u"rôme" jérôme >>> print u"jé%s" % "rôme" jérôme
(of course) this does not do what one can expect
>>> print "jérôme"[:2] j�
but that's same as python2 and this works:
>>> print unicode("jérôme")[:2] jé
Yes, the last example - when you want to extract two unicode characters - is related to what I was talking about with "There are of course cases when one needs to have random access to unicode-characters, and for such cases it is required to represent the string as UCS (array of UCS). But once again those cases are rare and don't dominate strings usage scenario." (Related, because extraction is not random and if one needs to get only two first characters, that use case can be also served with iterating twice over byte-string with the iterator returning unicode character).
In the end, it seems, what gives the most to our systems and customers is to set
sys.default_encodingto be always
UTF-8while still using Python2. /cc @jm
good ! the way to ipython/pytest integration is nice.