Commit 8d76276c authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: Fix iter(bstr) to yield byte instead of unicode character

In a72c1c1a (golang_str: bstr/ustr iteration) things were initially
implemented to follow Go semantic exactly with bytestring iteration
yielding unicode characters as explained in
https://blog.golang.org/strings. However this makes bstr not a 100%
drop-in compatible replacement for std str under py2, and even though my
initial testing was saying this change does not affect programs in
practice it turned out to be not the case.

For example with bstr.__iter__ yielding unicode characters running
gpython on py2 with builtin str patched to be bstr will break sometimes
when importing uuid:

There uuid reads 16 bytes from /dev/random and then wants to iterate
those 16 bytes as single bytes and then expects that the length
of the resulting sequence is exactly 16:

     int = long(('%02x'*16) % tuple(map(ord, bytes)), 16)

     ( https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/uuid.py#L147 )

which breaks if some of the read bytes are higher than 0x7f.

Even though this particular problem could be worked-around with
patching uuid, there is no evidence that there will be no similar
problems later, which could be many.

-> So adjust bstr semantic instead to follow semantic of str under py2
   and introduce uiter() primitive to still be able to iterate
   bytestrings as unicode characters.

This makes bstr, hopefully, to be fully compatible with str on py2 while
still providing reasonably good approach for strings processing the
Go-way when needed.

Add biter as well for symmetry.

See

    nexedi/pygolang!21 (comment 170754)
    nexedi/pygolang!21 (comment 170782)
    ...

and

    nexedi/pygolang!21 (comment 206044)

for discussion on iter(bstr) topic.
parent a11cb5dc
......@@ -240,12 +240,16 @@ The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.
Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` yields byte and
unicode character correspondingly [*]_. Iterating them, however, yields unicode
characters for both `bstr` and `ustr`. In practice `bstr` is enough 99% of the
time, and `ustr` only needs to be used for random access to string characters.
See `Strings, bytes, runes and characters in Go`__ for overview of this approach.
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly [*]_. However it is possible to yield unicode
character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of
the time, and `ustr` only needs to be used for random access to string
characters. See `Strings, bytes, runes and characters in Go`__ for overview of
this approach.
__ https://blog.golang.org/strings
......@@ -266,7 +270,7 @@ Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
s += ' мир' # s is b('привет мир')
for c in s: # c will iterate through
for c in uiter(s): # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май')
......
......@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `b`, `u`, `bstr`/`ustr` and `biter`/`uiter` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview.
......@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import
__version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'bbyte', 'uchr', 'gimport']
'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'biter', 'uiter', 'bbyte', 'uchr',
'gimport']
from golang._gopath import gimport # make gimport available from golang
import inspect, sys
......@@ -373,4 +374,6 @@ from ._golang import \
pybbyte as bbyte, \
pyu as u, \
pyustr as ustr, \
pyuchr as uchr
pyuchr as uchr, \
pybiter as biter, \
pyuiter as uiter
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -111,7 +111,7 @@ cpdef pyb(s): # -> bstr
b(u(bytes_input)) is bstr with the same data as bytes_input.
See also: u, bstr/ustr.
See also: u, bstr/ustr, biter/uiter.
"""
bs = _pyb(pybstr, s)
if bs is None:
......@@ -134,7 +134,7 @@ cpdef pyu(s): # -> ustr
u(b(unicode_input)) is ustr with the same data as unicode_input.
See also: b, bstr/ustr.
See also: b, bstr/ustr, biter/uiter.
"""
us = _pyu(pyustr, s)
if us is None:
......@@ -270,11 +270,11 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
is always identity even if bytes data is not valid UTF-8.
Semantically bstr is array of bytes. Accessing its elements by [index]
yields byte character. Iterating through bstr, however, yields unicode
characters. In practice bstr is enough 99% of the time, and ustr only
needs to be used for random access to string characters. See
https://blog.golang.org/strings for overview of this approach.
Semantically bstr is array of bytes. Accessing its elements by [index] and
iterating it yield byte character. However it is possible to yield unicode
character when iterating bstr via uiter. In practice bstr + uiter is enough
99% of the time, and ustr only needs to be used for random access to string
characters. See https://blog.golang.org/strings for overview of this approach.
Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
......@@ -289,7 +289,7 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
to bstr. See b for details.
- otherwise bstr will have string representation of the object.
See also: b, ustr/u.
See also: b, ustr/u, biter/uiter.
"""
# XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
......@@ -381,10 +381,13 @@ cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
else:
return pyb(x)
# __iter__ - yields unicode characters
# __iter__
def __iter__(self):
# TODO iterate without converting self to u
return pyu(self).__iter__()
if PY_MAJOR_VERSION >= 3:
return _pybstrIter(zbytes.__iter__(self))
else:
# on python 2 str does not have .__iter__
return PySeqIter_New(self)
# __contains__
......@@ -608,8 +611,8 @@ cdef class _pyustr(unicode):
elements by [index] yields unicode characters.
ustr complements bstr and is meant to be used only in situations when
random access to string characters is needed. Otherwise bstr is more
preferable and should be enough 99% of the time.
random access to string characters is needed. Otherwise bstr + uiter is
more preferable and should be enough 99% of the time.
Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
......@@ -618,7 +621,7 @@ cdef class _pyustr(unicode):
ustr constructor, similarly to the one in bstr, accepts arbitrary objects
and stringify them. Please refer to bstr and u documentation for details.
See also: u, bstr/b.
See also: u, bstr/b, biter/uiter.
"""
# XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
......@@ -948,18 +951,44 @@ cdef PyObject* _pyustr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw
assert sizeof(_pyustr) == sizeof(PyUnicodeObject)
# _pyustrIter wraps unicode iterator to return pyustr for each yielded character.
# _pybstrIter wraps bytes iterator to return pybstr for each yielded byte.
cdef class _pybstrIter:
cdef object zbiter
def __init__(self, zbiter):
self.zbiter = zbiter
def __iter__(self):
return self
def __next__(self):
x = next(self.zbiter)
if PY_MAJOR_VERSION >= 3:
return pybbyte(x)
else:
return pyb(x)
# _pyustrIter wraps zunicode iterator to return pyustr for each yielded character.
cdef class _pyustrIter:
cdef object uiter
def __init__(self, uiter):
self.uiter = uiter
cdef object zuiter
def __init__(self, zuiter):
self.zuiter = zuiter
def __iter__(self):
return self
def __next__(self):
x = next(self.uiter)
x = next(self.zuiter)
return pyu(x)
def pybiter(obj):
"""biter(obj) is like iter(b(obj)) but TODO: iterates object incrementally
without doing full convertion to bstr."""
return iter(pyb(obj)) # TODO iterate obj directly
def pyuiter(obj):
"""uiter(obj) is like iter(u(obj)) but TODO: iterates object incrementally
without doing full convertion to ustr."""
return iter(pyu(obj)) # TODO iterate obj directly
# _bdata/_udata retrieve raw data from bytes/unicode.
def _bdata(obj): # -> bytes
assert isinstance(obj, bytes)
......
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -21,7 +21,7 @@
from __future__ import print_function, absolute_import
import golang
from golang import b, u, bstr, ustr, bbyte, uchr, func, defer, panic
from golang import b, u, bstr, ustr, biter, uiter, bbyte, uchr, func, defer, panic
from golang._golang import _udata, _bdata
from golang.gcompat import qq
from golang.strconv_test import byterange
......@@ -612,34 +612,38 @@ def test_strings_index2():
# verify strings iteration.
def test_strings_iter():
# iter(u/unicode) + uiter(*) -> iterate unicode characters
# iter(b/bytes) + biter(*) -> iterate byte characters
us = u("миру мир"); u_ = u"миру мир"
bs = b("миру мир")
bs = b("миру мир"); b_ = xbytes("миру мир"); a_ = xbytearray(b_)
# iter( b/u/unicode ) -> iterate unicode characters
# NOTE that iter(b) too yields unicode characters - not integers or bytes
bi = iter(bs)
ui = iter(us)
ui_ = iter(u_)
# XIter verifies that going through all given iterators produces the same type and results.
missing=object()
class XIter:
def __init__(self, typok, *viter):
self.typok = typok
self.viter = viter
def __iter__(self):
return self
def __next__(self, missing=object):
x = next(bi, missing)
y = next(ui, missing)
z = next(ui_, missing)
assert type(x) is type(y)
if x is not missing:
assert type(x) is ustr
if z is not missing:
assert type(z) is unicode
assert x == y
assert y == z
if x is missing:
def __next__(self):
vnext = []
for it in self.viter:
obj = next(it, missing)
vnext.append(obj)
if missing in vnext:
assert vnext == [missing]*len(self.viter)
raise StopIteration
return x
for obj in vnext:
assert type(obj) is self.typok
assert obj == vnext[0]
return vnext[0]
next = __next__ # py2
assert list(XIter()) == ['м','и','р','у',' ','м','и','р']
assert list(XIter(ustr, iter(us), uiter(us), uiter(u_), uiter(bs), uiter(b_), uiter(a_))) == \
['м','и','р','у',' ','м','и','р']
assert list(XIter(bstr, iter(bs), biter(us), biter(u_), biter(bs), biter(b_), biter(a_))) == \
[b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80',b'\xd1',b'\x83',b' ',
b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80']
# verify .encode/.decode .
......
......@@ -73,6 +73,8 @@ def test_golang_builtins():
assert u is golang.u
assert bstr is golang.bstr
assert ustr is golang.ustr
assert biter is golang.biter
assert uiter is golang.uiter
assert bbyte is golang.bbyte
assert uchr is golang.uchr
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment