- 25 Oct, 2022 1 commit
-
-
Kirill Smelkov authored
In the patch "golang_str: bstr/ustr index access" we added __getitem__ implementation for bstr/ustr and thorough corresponding tests to cover all access cases: [i], [i:j] and [i:j:k]. The tests, however, are run via pytest which does AST rewriting, and, as it turned out, always invokes __getitem__ even for [i:j] case even on py2. Which differs from plain python2 behaviour to invoke __getslice__ for [i:j] case if __getslice__ slot is present. Since on py2 both str and unicode provide __getslice__ implementation, and bstr/ustr inherit from those types, they also inherit __getslice__. And oops, then on py2 e.g. bstr[i:j] was returning str instead of bstr: In [1]: bs = b('αβγ') In [2]: bs Out[2]: b('αβγ') In [3]: bs[0] Out[3]: b(b'\xce') In [4]: bs[0:1] Out[4]: '\xce' <-- NOTE not b(...) In [5]: type(_) Out[5]: str <-- NOTE not bstr -> Fix it by explicitly whiting out __getslice__ slot for bstr and ustr.
-
- 09 Oct, 2022 24 commits
-
-
Kirill Smelkov authored
-
Kirill Smelkov authored
bstr and ustr currently claim, that: - bstr → ustr → bstr is always identity even if bytes data is not valid UTF-8, and - ustr → bstr → ustr is always identity even if bytes data is not valid UTF-8. this is indeed true for any bytes data. But for some (incorrect) unicode, the conversion from ustr → bstr might currently fail as the following example demonstrates: # py3 In [1]: x = u'\udc00' In [2]: x.encode('utf-8') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed In [3]: x.encode('utf-8', 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed I know how to fix this by adjusting UTF-8b(*) encoding process a bit, but I currently lack time to do it. -> Let's place corresponding todo entry. Please note, once again, that for arbitrary bytes input the conversion from bstr → ustr → bstr always succeeds and works ok already. And it is this particular conversion that is most relevant in practice. (*) aka surrogateescape in python speak. See http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html for original explanation from 2000.
-
Kirill Smelkov authored
So far we've overridden almost all string methods, that bstr/ustr inherited from bytes and unicode. However 2 of the methods remained intact until now: unicode.encode() and bytes.decode(). Let's override them too for completeness: - we want ustr.encode() to follow signature of unicode.encode and for ustr.encode('utf-8') to return bstr. - for consistency we also want ustr.encode() to return the same type irregardless of which encoding/errors pair is used in the arguments. - => ustr.encode() always returns bstr. - we want bstr.decode() to follow signature of bytes.decode and for bstr.decode('utf-8') to return ustr. - for consistency we also want bstr.decode() to return the same type irregardless of which encoding/errors pair is used in the arguments. - -> bstr.decode() always returns ustr. So ustr.encode() -> bstr and bstr.decode() -> ustr. Let's implement this carrying out encoding/decoding process internally similarly to regular bytes and unicode and wrapping the result into corresponding pygolang type at the end.
-
Kirill Smelkov authored
Similarly to %-formatting, let's add support for .format(). This is easier to do because we can leverage string.Formatting and hook into the process by proper subclassing. We do not need to implement parsing and need to only customize handling of 's' and 'r' specifiers. For testing we mostly reuse existing tests for %-formatting by amending them a bit to exercise both %-formatting and format-formatting at the same time: by converting %-format specification into corresponding {}-format specification and verifying formatting result for that to be as expected. Some explicit tests for {}-style .format() are also added.
-
Kirill Smelkov authored
Teach bstr/ustr to do % formatting similarly to how unicode does, but with treating bytes as UTF8-encoded strings - all in line with general idea for bstr/ustr to treat bytes as strings. The following approach is used to implement this: 1. both bstr and ustr format via bytes-based _bprintf. 2. we parse the format string and handle every formatting specifier separately: 3. for formats besides %s/%r we use bytes.__mod__ directly. 4. for %s we stringify corresponding argument specially with all, potentially internal, bytes instances treated as UTF8-encoded strings: '%s' % b'\xce\xb2' -> "β" '%s' % [b'\xce\xb2'] -> "['β']" 5. for %r, similarly to %s, we prepare repr of corresponding argument specially with all, potentially internal, bytes instances also treated as UTF8-encoded strings: '%r' % b'\xce\xb2' -> "b'β'" '%r' % [b'\xce\xb2'] -> "[b'β']" For "2" we implement %-format parsing ourselves. test_strings_mod has good coverage for this phase to make sure we get it right and behaving exactly the same way as standard Python does. For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called from under bstr.__mod__(). See _bstringify for details. For "5", similarly to "4", we rely on adjustments to bytes.__repr__ . See _bstringify_repr for details. I initially tried to avoid parsing format specification myself and wanted to reuse original bytes.__mod__ and just adjust its behaviour a bit somehow. This did not worked quite right as the following comment explains: # Rejected alternative: try to format; if we get "TypeError: %b requires a # bytes-like object ..." retry with that argument converted to bstr. # # Rejected because e.g. for `%(x)s %(x)r` % {'x': obj}` we need to use # access number instead of key 'x' to determine which accesses to # bstringify. We could do that, but unfortunately on Python2 the access # number is not easily predictable because string could be upgraded to # unicode in the midst of being formatted and so some access keys will be # accesses not once. # # Another reason for rejection: b'%r' and u'%r' handle arguments # differently - on b %r is aliased to %a. That's why full %-format parsing and handling is implemented in this patch. Once again to make sure its behaviour is really the same compared to Python's builtin %-formatting, we have good test coverage for both %-format parsing itself, and for actual formatting of many various cases. See test_strings_mod for details.
-
Kirill Smelkov authored
bstr/ustr constructors either convert or stringify its argument. For example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the argument is bytes, bstr treats it as UTF-8 encoded bytestring: >>> x = u'β'.encode() >>> x b'\xce\xb2' >>> bstr(x) b('β') however if that same bytes argument is placed inside container - e.g. inside list - currently it is not stringified as bytestring: >>> bstr([x]) b("[b'\\xce\\xb2']") <-- NOTE not b("['β']") which is not consistent with our intended approach that bstr/ustr treat bytes in their arguments as UTF-8 encoded strings. This happens because when a list is stringified, list.__str__ implementation goes through its arguments and invokes __repr__ of the arguments. And in general a container might be arbitrary deep, e.g. dict -> list -> list -> bytes, and even when stringifying that deep dict, we want to handle that leaf bytes as UTF-8 encoded string. There are many containers in Python - lists, tuples, dicts, collections.OrderedDict, collections.UserDict, collections.namedtuple, collections.defaultdict, etc, and also there are many user-defined containers - including implemented at C level - which we can not even know all in advance. It means that we cannot do some, probably deep/recursive typechecking, inside bstringify and implement kind of parallel stringification of arbitrary complex structure with adjustment to stringification of bytes. We cannot also create object clone - for stringification - with bytes instances replaced with str (e.g. via DeepReplacer - see recent previous patch), and then stringify the clone. That would generally be incorrect, because in this approach we cannot know whether an object is being stringified as it is, or whether it is being used internally for data storage and is not stringified directly. In the latter case if we replace bytes with unicode, it might break internal invariant of custom container class and break its logic. What we can do however, is to hook into bytes.__repr__ implementations, and to detect - if this implementation is called from under bstringify - then we know we should adjust it and treat this bytes as bytestring. Else - use original bytes.__repr__ implementation. This way we can handle arbitrary complex data structures. Hereby patch implements that approach for bytes, unicode on py2, and for bytearray. See added comments that start with # patch bytes.{__repr__,__str__} and ... for details. After this patch stringification of bytes inside containers treat them as UTF-8 bytestrings: >>> bstr([x]) b("['β']")
-
Kirill Smelkov authored
Take all str/unicode methods, such as .capitalize(), .split(), .join(), etc, and implement them for bstr/ustr. For example bstr.split() behaves like unicode.split(), but returns list of bstr instead of list of unicode. And similarly for all other methods. Organize testing of this via verifying every method behaviour on all unicode and bstr/ustr. If the results match by modulo of deep replacing unicode to bstr/ustr - everything is ok.
-
Kirill Smelkov authored
deepReplace returns object's clone with replacing all internal objects selected by predicate via provided replacement function. We will use this functionality in the following patches to organize testing of bstr/ustr methods: a method would be first invoked on regular str, and then on bstr/ustr and the result will be compared against each other. The results are usually different, because e.g. u'a b c'.split() returns [u'a', u'b', u'c'] while b('a b c').split() should return [b('a'), b('b'), b('c')]. We want to make sure that the second result is exactly the first result with all instances of unicode replaced by bstr. That's where deep replacer will be used. The deep replacement itself is implemented via pickle reduce/rebuild protocol: we unassemble and reconstruct objects. And while an object is unassembled, we try to apply the replacement recursively. Since this is not so trivial functionality, it itself also comes with a test.
-
Kirill Smelkov authored
On py2 objects are printed via their .tp_repr slot with flags=0 (contrary to Py_PRINT_RAW which requests to print str - https://docs.python.org/2.7/c-api/object.html#c.PyObject_Print) We were not handling repr'ing inside our tp_print implementation, and as the result e.g. b('мир') was printed on interactive console as '\xd0\xbc\xd0\xb8\xd1\x80' instead of b('мир'). Fix it.
-
Kirill Smelkov authored
Teach bstr/ustr to provide repr of themselves: it goes as b(...) and u(...) where u stands for human-readable repr of contained data. Human-readable means that non-ascii printable unicode characters are shown as-is instead of escaping them, for example: >>> x = u'αβγ' >>> x 'αβγ' >>> y = b(x) >>> y b('αβγ') <-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3') >>> x.encode('utf-8') b'\xce\xb1\xce\xb2\xce\xb3'
-
Kirill Smelkov authored
bstr is becoming the default pygolang string type. And it can be mixed ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was checking which kind of type its input was and was trying to return the result of the same type. Now this becomes unnecessary since bstr is intended to be used universally and interoperable with all other string types.
-
Kirill Smelkov authored
Add support for +, *, += and *= operators to bstr and ustr. For * rhs should be integer and the result, similarly to std strings, is repetition of rhs times. For + the other argument could be any supported string - bstr/ustr / unicode/bytes/bytearray. And the result is always bstr or ustr: u() + * -> u() b() + * -> b() u'' + u()/b() -> u() u'' + u'' -> u'' b'' + u()/b() -> b() b'' + b'' -> b'' barr + u()/b() -> barr in particular if lhs is bstr or ustr, the result will remain exactly of original lhs type. This should be handy when one has e.g. bstr at hand and wants to incrementally append something to it. And if lhs is bytes/unicode, but we append bstr/ustr to it, we "upgrade" the result to bstr/ustr correspondingly. Only if lhs is bytearray it remains to stay that way because it is logical for appended object to remain mutable if it was mutable in the beginning. As before bytearray.__add__ and friends need to patched a bit for bytearray not to reject ustr.
-
Kirill Smelkov authored
Without explicitly overriding __reduce_ex__ pickling was failing for protocols < 2: _________________________ test_strings_pickle __________________________ def test_strings_pickle(): bs = b("мир") us = u("май") #from pickletools import dis for proto in range(0, pickle.HIGHEST_PROTOCOL): > p_bs = pickle.dumps(bs, proto) golang/golang_str_test.py:282: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = b'\xd0\xbc\xd0\xb8\xd1\x80', proto = 0 def _reduce_ex(self, proto): > assert proto < 2 E RecursionError: maximum recursion depth exceeded in comparison /usr/lib/python3.9/copyreg.py:56: RecursionError See added comments for details.
-
Kirill Smelkov authored
Even though bstr is semantically array of bytes, while ustr is array of unicode characters, iterating them _both_ yields unicode characters. This goes in line with Go approach described in "Strings, bytes, runes and characters in Go"[1] and allows for both ustr _and_ bstr to be used as strings in unicode world. Even though this diverges (just a bit) from str/py2 str behaviur, and diverges more from bytes/py3 behaviour, I have not hit any problem in practice due to this divergence. In other words the semantics of bytestring used in Go - to iterate them as unicode characters - is sound. For the reference it is the authors of Go who originally invented UTF-8 - see [2] for details. See also [3] for our discussion with Jérome on this topic. [1] https://blog.golang.org/strings [2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf [3] nexedi/zodbtools!13 (comment 81646)
-
Kirill Smelkov authored
Implement access to bstr/ustr by [index] and by slice. Result of such [index] access - similarly to standard str - returns the same bstr/ustr type with one character: - ustr[i] returns ustr with one unicode character taken from i'th character of original string, while - bstr[i] returns bstr with one byte taken from i'th byte of original bytestring. This follows str/unicode semantics on both py2/py3, bytes semantic on py2, but diverges from bytes semantics on py3. I originally tried to follow bytes/py3 semantic - for bstr to return an integer instead of 1-byte character, but later found several compatibility breakages due to it. I contemplated about this divergence for a long time and finally took decision to follow strings semantics for both ustr and bstr. This preserves backward compatibility with Python2 and also allows for bstr to be practically drop-in replacement for str type. To get an ordinal corresponding to retrieved character, one can use standard `ord`, e.g. as in `ord(bstr[i])`. This will always return an integer for all bstr/ustr/str/unicode. Similarly to standard `chr` and `unichr`, we also provide two utility functions - `uchr` and `bbyte` to create 1-character and 1-byte ustr/bstr correspondingly.
-
Kirill Smelkov authored
Verify that it works as expected, and that memoryview(ustr) is rejected, because ustr is semantically array of unicode characters, not bytes. No change to the code - just add tests for current status which is already working as expected.
-
Kirill Smelkov authored
And to convert them to bstr/ustr decoding buffer data as if it was bytes. This is needed if e.g. we have data in mmap or numpy.ndarray, and want to convert the data to string. The conversion is always explicit via explicit call to b/u. And for bstr/ustr constructors, we preserver their behaviour to match unicode constructor not to convert automatically, but instead to stringify the object, e.g. as shown below: In [1]: bdata = b'hello 123' In [2]: mview = memoryview(bdata) In [3]: str(mview) Out[3]: '<memory at 0x7fb226b26700>' # NOTE _not_ b'hello 123'
-
Kirill Smelkov authored
bytearray was introduced in Python as a mutable version of bytes. It has all strings methods (e.g. .capitalize() .islower(), etc), and it also supports % formatting. In other words it has all attributes of being a byte-string, with the only difference from bytes in that bytearray is mutable. In other words bytearray is handy to have when a string is being incrementally constructed step by step without hitting overhead of many bytes objects creation/destruction. So, since bytearray is also a bytestring, similarly to bytes, let's add support to interoperate with bytearray to bstr and ustr: - b/u and bstr/ustr now accept bytearray as argument and treat it as bytestring. - bytearray() constructor, similarly to bytes() and unicode() constructors, now also accepts bstr/ustr and create bytearray object corresponding to byte-stream of input. For the latter point to work we need to patch bytearray.__init__() a bit, since, contrary to bytes.__init__(), it does not pay attention to whether provided argument has __bytes__ method or not.
-
Kirill Smelkov authored
Both bstr and ustr constructors mimic constructor of unicode(= str on py3) - an object is either stringified, or decoded if it provides buffer interface, or the constructor is invoked with optional encoding and errors argument: # py2 class unicode(basestring) | unicode(object='') -> unicode object | unicode(string[, encoding[, errors]]) -> unicode object # py3 class str(object) | str(object='') -> str | str(bytes_or_buffer[, encoding[, errors]]) -> str Stringification of all bstr/ustr / unicode/bytes is handled automatically with the meaning to convert to created type via b or u. We follow unicode semantic for both ustr _and_ bstr, because bstr/ustr are intended to be used as strings.
-
Kirill Smelkov authored
So that e.g. `bstr == <any string type>` works. We want `bstr == ustr` to work because we intend those types to be interoperable. We also want e.g. `bstr == "a_string"` to work because we want bstr to be interoperable with standard strings. In general we want to have full automatic interoperability with all string types, so that e.g. `bstr == X` works for X being all bstr, ustr, unicode, bytes (and later bytearray). For now we add support only for comparison operators. But later, we will be adding support for e.g. +, string methods, etc - and in all those operations we will be following the same approach: to have automatic interoperability with all string types out of the box. The text added to README reflects this. The patch to unicode.tp_richcompare on py2 illustrates our approach to adjust builtin types when absolutely needed. In this particular case original builtin unicode.__eq__(unicode, bstr) is always returning False for non-ASCII bstr even despite bstr having .__unicode__() method. Our adjustment is non-intrusive - we adjust unicode behaviour only wrt bstr and it stays exactly the same as before wrt all other types. We anyway do that with care and add a test that verifies that behaviour of what we patched stays unaffected when used outside of bstr/ustr context.
-
Kirill Smelkov authored
_patch_slot(typ, slotname, func) installs func into typ's dict[slotname]. For example in the next patch we will need to adjust unicode.__eq__ on py2 not to reject bstr with always assuming that `unicode == bstr` is False. We will do it via patching unicode.__eq__ to first check rhs or whether it is bstr and handling that with our code, while tailing to original unicode.__eq__ for all other types.
-
Kirill Smelkov authored
Document explicitly which types b/u accept and how they are handled. Change bstr/ustr docstrings to also be more explicit. Documentation changes only.
-
Kirill Smelkov authored
In other words casting to bytes/unicode preserves pygolang string to remain pygolang string. Without the changes to bstr/ustr added test fails as e.g. > assert bytes (bs) is bs E AssertionError: assert b'\xd0\xbc\xd0\xb8\xd1\x80' is b'\xd0\xbc\xd0\xb8\xd1\x80' E + where b'\xd0\xbc\xd0\xb8\xd1\x80' = bytes(b'\xd0\xbc\xd0\xb8\xd1\x80') in other words bytes(bstr) was creating a copy and changing type to bytes.
-
Kirill Smelkov authored
Extend current coverage for b/u tests more explicitly verifying resulting type (`type(·) is ...` instead of `isinstance(·, ...)`), verifying unicode(bstr)->ustr and bytes(ustr)->bstr, and str() of both bstr and ustr. Move the check for "no custom attributes" from test_qq to generic test_strings_basic, because now verified string types are publicly accessible, not only via qq. Small cosmetics in benchmarks - by reusing hereby introduced xbytes() utility. No change for the code itself - the tests just add verification to current status.
-
- 08 Oct, 2022 1 commit
-
-
Kirill Smelkov authored
In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and str format whatever type qq argument is) I added custom bytes- and unicode- like types for qq to return instead of str with the idea for qq's result to be interoperable with both bytes and unicode. Citing that patch: qq is used to quote strings or byte-strings. The following example illustrates the problem we are currently hitting in zodbtools with Python3: >>> "hello %s" % qq("мир") 'hello "мир"' >>> b"hello %s" % qq("мир") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' >>> "hello %s" % qq(b("мир")) 'hello "мир"' >>> b"hello %s" % qq(b("мир")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' i.e. one way or another if type of format string and what qq returns do not match it creates a TypeError. We want qq(obj) to be useable with both string and bytestring format. For that let's teach qq to return special str- and bytes- derived types that know how to automatically convert to str->bytes and bytes->str via b/u correspondingly. This way formatting works whatever types combination it was for format and for qq, and the whole result has the same type as format. For now we teach only qq to use new types and don't generally expose _str and _unicode to be returned by b and u yet. However we might do so in the future after incrementally gaining a bit more experience. So two years later I gained that experience and found that having string type, that can interoperate with both bytes and unicode, is generally useful. It is useful for practical backward compatibility with Python2 and for simplicity of programming avoiding constant stream of encode/decode noise. Thus the day to expose Pygolang string types for general use has come. This patch does the first small step: it exposes bytes- and unicode- like types (now named as bstr and ustr) publicly. It switches b and u to return bstr and ustr correspondingly instead of bytes and unicode. This is change in behaviour, but hopefully it should not break anything as there are not many b/u users currently and bstr and ustr are intended to be drop-in replacements for standard string types. Next patches will enhance bstr/ustr step by step to be actually drop-in replacements for standard string types for real. See nexedi/zodbtools!13 (comment 81646) for preliminary discussion from 2019. See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost overview"[2] for related presentation by Jean-Paul from 2018. [1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1 [2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
-
- 05 Oct, 2022 2 commits
-
-
Kirill Smelkov authored
Since the beginning (9bf03d9c "py.bench: New command to benchmark python code similarly to `go test -bench`") py.bench was automatically discovering benchmarks in bench_*.py files only. This was inherited from wendelin.core which keeps its benchmarks in those files. However in pygolang, following Go convention(*), we already have several benchmarks that reside together with tests in same *_test.py files. And currently just running py.bench does not discover them. -> Let's fix this and teach py.bench to automatically discover benchmarks in the test files by default as well. Pytest's default is to look for tests in test_*.py and *_test.py (+). Add those patterns and also keep bench_*.py for backward compatibility. Before this patch running py.bench inside pygolang repository does not run any benchmark at all. After the patch py.bench runs all the benchmarks by default: (z-dev) kirr@deca:~/src/tools/go/pygolang$ py.bench ========================= test session starts ========================== platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1 rootdir: /home/kirr/src/tools/go/pygolang plugins: timeout-1.4.2, profiling-1.7.0, mock-2.0.0 collected 18 items pymod: golang/golang_str_test.py Benchmarkstddecode 2000000 0.756 µs/op Benchmarkudecode 20000 74.359 µs/op Benchmarkstdencode 3000000 0.327 µs/op Benchmarkbencode 40000 32.613 µs/op pymod: golang/golang_test.py Benchmarkpyx_select_nogil 500000 2.051 µs/op Benchmarkpyx_go_nogil 90000 12.177 µs/op Benchmarkpyx_chan_nogil 600000 1.826 µs/op Benchmarkgo 80000 13.267 µs/op Benchmarkchan 500000 2.076 µs/op Benchmarkselect 300000 3.835 µs/op Benchmarkdef 30000000 0.035 µs/op Benchmarkfunc_def 40000 29.387 µs/op Benchmarkcall 30000000 0.043 µs/op Benchmarkfunc_call 2000000 0.819 µs/op Benchmarktry_finally 20000000 0.096 µs/op Benchmarkdefer 600000 1.755 µs/op pymod: golang/sync_test.py Benchmarkworkgroup_empty 40000 25.807 µs/op Benchmarkworkgroup_raise 40000 31.637 µs/op [100%] =========================== warnings summary =========================== (*) see https://pkg.go.dev/cmd/go#hdr-Test_packages (+) see https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_files /reviewed-by @jerome /reviewed-on !20
-
Kirill Smelkov authored
We recently moved our custom UTF-8 encoding/decoding routines to Cython. Now we can start taking speedup advantage on C level to make our own UTF-8 decoder a bit less horribly slow on py2: name old time/op new time/op delta stddecode 752ns ± 0% 743ns ± 0% -1.19% (p=0.000 n=9+10) udecode 216µs ± 0% 75µs ± 0% -65.19% (p=0.000 n=9+10) stdencode 328ns ± 2% 327ns ± 1% ~ (p=0.252 n=10+9) bencode 34.1µs ± 1% 32.1µs ± 1% -5.92% (p=0.000 n=10+10) So it is ~ 3x speedup for u(), but still significantly slower compared to std unicode.decode('utf-8'). Only low-hanging fruit here to make _utf_decode_rune a bit more prompt, since it sits in the most inner loop. In the future _utf8_decode_surrogateescape might be reworked as well to avoid constructing resulting unicode via py-level list of py-unicode character objects. And similarly for _utf8_encode_surrogateescape. On py3 the performance of std and u/b decode/encode is approximately the same. /trusted-by @jerome /reviewed-on !19
-
- 04 Oct, 2022 4 commits
-
-
Kirill Smelkov authored
Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an error in decoding. But the error rune itself is valid unicode codepoint: >>> x = u"�" >>> x u'\ufffd' >>> x.encode('utf-8') '\xef\xbf\xbd' This way only (r=_rune_error, size=1) should be treated by the caller as utf8 decoding error. But e.g. strconv.quote was not careful to also inspect the size, and this way was quoting � into just "\xef" instead of "\xef\xbf\xbd". _utf8_decode_surrogateescape was also subject to similar error. -> Fix it. Without the fix e.g. added test for strconv.quote fails as > assert quote(tin) == tquoted E assert '"\xef"' == '"�"' E - "\xef" E + "�" /reviewed-by @jerome /reviewed-at !18
-
Kirill Smelkov authored
So that those routines could be just called and do what is expected without the caller caring whether it is py2 or py3. We will soon need to use those routines from several callsites, and having that py2/py3 conditioning being spread over all usage places would be inconvenient. /reviewed-by @jerome /reviewed-at nexedi/pygolang!18
-
Kirill Smelkov authored
- Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str - Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu. - work-around emerged golang
↔ strconv dependency with at-runtime import. Moved routines belong to the main part of golang strings processing -> their home should be in _golang_str.pyx /reviewed-by @jerome /reviewed-at nexedi/pygolang!18 -
Kirill Smelkov authored
We are going to significantly extend py-strings related functionality soon - to the point where amount of strings related code will be approximately the same compared to the amount of all other python-related code inside golang module. -> First move everything related to py strings to dedicated _golang_str.pyx as a preparatory step. Keep that new file included from _golang.pyx instead of being real new module, because we want strings functionality to be provided by golang main namespace itself, and to ease internal code interdependencies. Plain code movement. /reviewed-by @jerome /reviewed-at nexedi/pygolang!18
-
- 26 Jan, 2022 8 commits
-
-
Kirill Smelkov authored
-
Kirill Smelkov authored
On Python2 without .tp_print printing _pystr crashes as: pygolang$ ./golang/testprog/golang_test_str.py Traceback (most recent call last): File "./golang/testprog/golang_test_str.py", line 39, in <module> main() File "./golang/testprog/golang_test_str.py", line 34, in main print("print(qq(b)):", qq(sb)) RuntimeError: print recursion See added comments for details.
-
Kirill Smelkov authored
Add convenient utility to read whole file and return its content similarly to Go. The code is taken from wendelin.core: https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp#L246-281
-
Kirill Smelkov authored
Provide os/signal package that can be used to setup signal delivery to nogil channels. This way for user code signal handling becomes regular handling of a signalling channel instead of being something special or limited to only-main python thread. The rationale for why we need it is explained below: There are several problems with regular python's stdlib signal module: 1. Python2 does not call signal handler from under blocked lock.acquire. This means that if the main thread is blocked waiting on a semaphore, signal delivery will be delayed indefinitely, similarly to e.g. problem described in nxdtest!14 (comment 147527) where raising KeyboardInterrupt is delayed after SIGINT for many, potentially unbounded, seconds until ~semaphore wait finishes. Note that Python3 does not have this problem wrt stdlib locks and semaphores, but read below for the next point. 2. all pygolang communication operations (channels send/recv, sync.Mutex, sync.RWMutex, sync.Sema, sync.WaitGroup, sync.WorkGroup, ...) run with GIL released, but if blocked do not handle EINTR and do not schedule python signal handler to run (on main thread). Even if we could theoretically adjust this behaviour of pygolang at python level to match Python3, there are also C++ and pyx/nogil worlds. And we want gil and nogil worlds to interoperate (see https://pypi.org/project/pygolang/#cython-nogil-api), so that e.g. if completely nogil code happens to run on the main thread, signal handling is still possible, even if that signal handling was setup at python level. With signals delivered to nogil channels both nogil world and python world can setup signal handlers and to be notified of them irregardles of whether main python thread is currently blocked in nogil wait or not. /reviewed-on !17
-
Kirill Smelkov authored
To convert an object to str of current python. It will be handy to use __pystr when implementing __str__ methods. /reviewed-on nexedi/pygolang!17
-
Kirill Smelkov authored
Provide C++ package "os" with File, Pipe, etc similarly to what is provided on Go side. The package works through IO methods provided by runtimes. We need IO facility because os/signal package will need to use pipe in cooperative IO mode in its receiving-loop goroutine. os.h and os.cpp are based on drafts from wendelin.core: https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.h https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp /reviewed-on nexedi/pygolang!17
-
Kirill Smelkov authored
Else as https://github.com/python-greenlet/greenlet/pull/285 demonstrates there can be segmentation faults and crashes due to exceptions from one greenlet propagating to C stack of another greenlet. No test here. I've tried to do it, but with gevent (contrary to plain greenlets), spawning new task only schedules corresponding greenlet to run in the end of current event loop cycle instead of switching to created greenlet immediately. With this delaying, it was hard for me to develop corresponding test in a reasonable time. Hopefully having the test I've done for greenlet itself + hereby protection is good enough. /reviewed-on !17
-
Kirill Smelkov authored
This package provides special kind of atomic that is automatically reset to zero after fork in child. This kind of atomic will be used in os package to implement IO that does not deadlock in Close after fork. /reviewed-on nexedi/pygolang!17
-