Commit 8d76276c authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: Fix iter(bstr) to yield byte instead of unicode character

In a72c1c1a (golang_str: bstr/ustr iteration) things were initially
implemented to follow Go semantic exactly with bytestring iteration
yielding unicode characters as explained in
https://blog.golang.org/strings. However this makes bstr not a 100%
drop-in compatible replacement for std str under py2, and even though my
initial testing was saying this change does not affect programs in
practice it turned out to be not the case.

For example with bstr.__iter__ yielding unicode characters running
gpython on py2 with builtin str patched to be bstr will break sometimes
when importing uuid:

There uuid reads 16 bytes from /dev/random and then wants to iterate
those 16 bytes as single bytes and then expects that the length
of the resulting sequence is exactly 16:

     int = long(('%02x'*16) % tuple(map(ord, bytes)), 16)

     ( https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/uuid.py#L147 )

which breaks if some of the read bytes are higher than 0x7f.

Even though this particular problem could be worked-around with
patching uuid, there is no evidence that there will be no similar
problems later, which could be many.

-> So adjust bstr semantic instead to follow semantic of str under py2
   and introduce uiter() primitive to still be able to iterate
   bytestrings as unicode characters.

This makes bstr, hopefully, to be fully compatible with str on py2 while
still providing reasonably good approach for strings processing the
Go-way when needed.

Add biter as well for symmetry.

See

    nexedi/pygolang!21 (comment 170754)
    nexedi/pygolang!21 (comment 170782)
    ...

and

    nexedi/pygolang!21 (comment 206044)

for discussion on iter(bstr) topic.
parent a11cb5dc
...@@ -240,12 +240,16 @@ The conversion, in both encoding and decoding, never fails and never looses ...@@ -240,12 +240,16 @@ The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8. even if bytes data is not valid UTF-8.
Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
Semantically `bstr` is array of bytes, while `ustr` is array of Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` yields byte and unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly [*]_. Iterating them, however, yields unicode unicode character correspondingly [*]_. However it is possible to yield unicode
characters for both `bstr` and `ustr`. In practice `bstr` is enough 99% of the character when iterating `bstr` via `uiter`, and to yield byte character when
time, and `ustr` only needs to be used for random access to string characters. iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of
See `Strings, bytes, runes and characters in Go`__ for overview of this approach. the time, and `ustr` only needs to be used for random access to string
characters. See `Strings, bytes, runes and characters in Go`__ for overview of
this approach.
__ https://blog.golang.org/strings __ https://blog.golang.org/strings
...@@ -266,7 +270,7 @@ Usage example:: ...@@ -266,7 +270,7 @@ Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'. s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
s += ' мир' # s is b('привет мир') s += ' мир' # s is b('привет мир')
for c in s: # c will iterate through for c in uiter(s): # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')] ... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май') # the following gives b('привет мир труд май')
......
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class. - `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow. - `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining. - `error` and package `errors` provide error chaining.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings. - `b`, `u`, `bstr`/`ustr` and `biter`/`uiter` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace. - `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview. See README for thorough overview.
...@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import ...@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import
__version__ = "0.1" __version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic', __all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'bbyte', 'uchr', 'gimport'] 'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'biter', 'uiter', 'bbyte', 'uchr',
'gimport']
from golang._gopath import gimport # make gimport available from golang from golang._gopath import gimport # make gimport available from golang
import inspect, sys import inspect, sys
...@@ -373,4 +374,6 @@ from ._golang import \ ...@@ -373,4 +374,6 @@ from ._golang import \
pybbyte as bbyte, \ pybbyte as bbyte, \
pyu as u, \ pyu as u, \
pyustr as ustr, \ pyustr as ustr, \
pyuchr as uchr pyuchr as uchr, \
pybiter as biter, \
pyuiter as uiter
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Nexedi SA and Contributors. # Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -111,7 +111,7 @@ cpdef pyb(s): # -> bstr ...@@ -111,7 +111,7 @@ cpdef pyb(s): # -> bstr
b(u(bytes_input)) is bstr with the same data as bytes_input. b(u(bytes_input)) is bstr with the same data as bytes_input.
See also: u, bstr/ustr. See also: u, bstr/ustr, biter/uiter.
""" """
bs = _pyb(pybstr, s) bs = _pyb(pybstr, s)
if bs is None: if bs is None:
...@@ -134,7 +134,7 @@ cpdef pyu(s): # -> ustr ...@@ -134,7 +134,7 @@ cpdef pyu(s): # -> ustr
u(b(unicode_input)) is ustr with the same data as unicode_input. u(b(unicode_input)) is ustr with the same data as unicode_input.
See also: b, bstr/ustr. See also: b, bstr/ustr, biter/uiter.
""" """
us = _pyu(pyustr, s) us = _pyu(pyustr, s)
if us is None: if us is None:
...@@ -270,11 +270,11 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711 ...@@ -270,11 +270,11 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
is always identity even if bytes data is not valid UTF-8. is always identity even if bytes data is not valid UTF-8.
Semantically bstr is array of bytes. Accessing its elements by [index] Semantically bstr is array of bytes. Accessing its elements by [index] and
yields byte character. Iterating through bstr, however, yields unicode iterating it yield byte character. However it is possible to yield unicode
characters. In practice bstr is enough 99% of the time, and ustr only character when iterating bstr via uiter. In practice bstr + uiter is enough
needs to be used for random access to string characters. See 99% of the time, and ustr only needs to be used for random access to string
https://blog.golang.org/strings for overview of this approach. characters. See https://blog.golang.org/strings for overview of this approach.
Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr. Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also When the coercion happens, bytes and bytearray, similarly to bstr, are also
...@@ -289,7 +289,7 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711 ...@@ -289,7 +289,7 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
to bstr. See b for details. to bstr. See b for details.
- otherwise bstr will have string representation of the object. - otherwise bstr will have string representation of the object.
See also: b, ustr/u. See also: b, ustr/u, biter/uiter.
""" """
# XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799) # XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
...@@ -381,10 +381,13 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711 ...@@ -381,10 +381,13 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
else: else:
return pyb(x) return pyb(x)
# __iter__ - yields unicode characters # __iter__
def __iter__(self): def __iter__(self):
# TODO iterate without converting self to u if PY_MAJOR_VERSION >= 3:
return pyu(self).__iter__() return _pybstrIter(zbytes.__iter__(self))
else:
# on python 2 str does not have .__iter__
return PySeqIter_New(self)
# __contains__ # __contains__
...@@ -608,8 +611,8 @@ cdef class _pyustr(unicode): ...@@ -608,8 +611,8 @@ cdef class _pyustr(unicode):
elements by [index] yields unicode characters. elements by [index] yields unicode characters.
ustr complements bstr and is meant to be used only in situations when ustr complements bstr and is meant to be used only in situations when
random access to string characters is needed. Otherwise bstr is more random access to string characters is needed. Otherwise bstr + uiter is
preferable and should be enough 99% of the time. more preferable and should be enough 99% of the time.
Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr. Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also When the coercion happens, bytes and bytearray, similarly to bstr, are also
...@@ -618,7 +621,7 @@ cdef class _pyustr(unicode): ...@@ -618,7 +621,7 @@ cdef class _pyustr(unicode):
ustr constructor, similarly to the one in bstr, accepts arbitrary objects ustr constructor, similarly to the one in bstr, accepts arbitrary objects
and stringify them. Please refer to bstr and u documentation for details. and stringify them. Please refer to bstr and u documentation for details.
See also: u, bstr/b. See also: u, bstr/b, biter/uiter.
""" """
# XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799) # XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
...@@ -948,18 +951,44 @@ cdef PyObject* _pyustr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw ...@@ -948,18 +951,44 @@ cdef PyObject* _pyustr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw
assert sizeof(_pyustr) == sizeof(PyUnicodeObject) assert sizeof(_pyustr) == sizeof(PyUnicodeObject)
# _pyustrIter wraps unicode iterator to return pyustr for each yielded character. # _pybstrIter wraps bytes iterator to return pybstr for each yielded byte.
cdef class _pybstrIter:
cdef object zbiter
def __init__(self, zbiter):
self.zbiter = zbiter
def __iter__(self):
return self
def __next__(self):
x = next(self.zbiter)
if PY_MAJOR_VERSION >= 3:
return pybbyte(x)
else:
return pyb(x)
# _pyustrIter wraps zunicode iterator to return pyustr for each yielded character.
cdef class _pyustrIter: cdef class _pyustrIter:
cdef object uiter cdef object zuiter
def __init__(self, uiter): def __init__(self, zuiter):
self.uiter = uiter self.zuiter = zuiter
def __iter__(self): def __iter__(self):
return self return self
def __next__(self): def __next__(self):
x = next(self.uiter) x = next(self.zuiter)
return pyu(x) return pyu(x)
def pybiter(obj):
"""biter(obj) is like iter(b(obj)) but TODO: iterates object incrementally
without doing full convertion to bstr."""
return iter(pyb(obj)) # TODO iterate obj directly
def pyuiter(obj):
"""uiter(obj) is like iter(u(obj)) but TODO: iterates object incrementally
without doing full convertion to ustr."""
return iter(pyu(obj)) # TODO iterate obj directly
# _bdata/_udata retrieve raw data from bytes/unicode. # _bdata/_udata retrieve raw data from bytes/unicode.
def _bdata(obj): # -> bytes def _bdata(obj): # -> bytes
assert isinstance(obj, bytes) assert isinstance(obj, bytes)
......
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Nexedi SA and Contributors. # Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -21,7 +21,7 @@ ...@@ -21,7 +21,7 @@
from __future__ import print_function, absolute_import from __future__ import print_function, absolute_import
import golang import golang
from golang import b, u, bstr, ustr, bbyte, uchr, func, defer, panic from golang import b, u, bstr, ustr, biter, uiter, bbyte, uchr, func, defer, panic
from golang._golang import _udata, _bdata from golang._golang import _udata, _bdata
from golang.gcompat import qq from golang.gcompat import qq
from golang.strconv_test import byterange from golang.strconv_test import byterange
...@@ -612,34 +612,38 @@ def test_strings_index2(): ...@@ -612,34 +612,38 @@ def test_strings_index2():
# verify strings iteration. # verify strings iteration.
def test_strings_iter(): def test_strings_iter():
# iter(u/unicode) + uiter(*) -> iterate unicode characters
# iter(b/bytes) + biter(*) -> iterate byte characters
us = u("миру мир"); u_ = u"миру мир" us = u("миру мир"); u_ = u"миру мир"
bs = b("миру мир") bs = b("миру мир"); b_ = xbytes("миру мир"); a_ = xbytearray(b_)
# iter( b/u/unicode ) -> iterate unicode characters # XIter verifies that going through all given iterators produces the same type and results.
# NOTE that iter(b) too yields unicode characters - not integers or bytes missing=object()
bi = iter(bs)
ui = iter(us)
ui_ = iter(u_)
class XIter: class XIter:
def __init__(self, typok, *viter):
self.typok = typok
self.viter = viter
def __iter__(self): def __iter__(self):
return self return self
def __next__(self, missing=object): def __next__(self):
x = next(bi, missing) vnext = []
y = next(ui, missing) for it in self.viter:
z = next(ui_, missing) obj = next(it, missing)
assert type(x) is type(y) vnext.append(obj)
if x is not missing: if missing in vnext:
assert type(x) is ustr assert vnext == [missing]*len(self.viter)
if z is not missing:
assert type(z) is unicode
assert x == y
assert y == z
if x is missing:
raise StopIteration raise StopIteration
return x for obj in vnext:
assert type(obj) is self.typok
assert obj == vnext[0]
return vnext[0]
next = __next__ # py2 next = __next__ # py2
assert list(XIter()) == ['м','и','р','у',' ','м','и','р'] assert list(XIter(ustr, iter(us), uiter(us), uiter(u_), uiter(bs), uiter(b_), uiter(a_))) == \
['м','и','р','у',' ','м','и','р']
assert list(XIter(bstr, iter(bs), biter(us), biter(u_), biter(bs), biter(b_), biter(a_))) == \
[b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80',b'\xd1',b'\x83',b' ',
b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80']
# verify .encode/.decode . # verify .encode/.decode .
......
...@@ -73,6 +73,8 @@ def test_golang_builtins(): ...@@ -73,6 +73,8 @@ def test_golang_builtins():
assert u is golang.u assert u is golang.u
assert bstr is golang.bstr assert bstr is golang.bstr
assert ustr is golang.ustr assert ustr is golang.ustr
assert biter is golang.biter
assert uiter is golang.uiter
assert bbyte is golang.bbyte assert bbyte is golang.bbyte
assert uchr is golang.uchr assert uchr is golang.uchr
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment