Uniform UTF8-based approach to strings (!21) · Merge requests · nexedi / pygolang

Uniform UTF8-based approach to strings

Context: together with Jérome we've been struggling with porting Zodbtools to Python3 for several years. Despite several incremental attempts[1,2,3] we are not there yet with the main difficulty being backward compatibility breakage that Python3 did for bytes and unicode. During my last trial this spring, after I've tried once again to finish this porting and could not reach satisfactory result, I've finally decided to do something about this at the root of the cause: at the level of strings - where backward compatibility was broken - with the idea to fix everything once and for all.

In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost overview"[5] Jean-Paul highlighted the problem of strings backward compatibility breakage, that Python 3 did, as the major one.

In 2019 we had some conversations with Jérome about this topic as well[6,7].

In 2020 I've started to approach it with b and u that provide always-working conversion in between bytes and unicode[8], and via limited usage of custom bytes- and unicode- like types that are interoperable with both bytes and unicode simultaneously[9].

Today, with this work, I'm finally exposing those types for general usage, so that bytes/unicode problem could be handled automatically. The overview of the functionality is provided below:

---- 8< ----

Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with the idea to make working with byte- and unicode- strings easy and transparently interoperable:

bstr is byte-string: it is based on bytes and can automatically convert to/from unicode (*).
ustr is unicode-string: it is based on unicode and can automatically convert to/from bytes.

The conversion, in both encoding and decoding, never fails and never looses information: bstr→ustr→bstr and ustr→bstr→ustr are always identity even if bytes data is not valid UTF-8.

Both bstr and ustr represent stings. They are two different representations of the same entity.

Semantically bstr is array of bytes, while ustr is array of unicode-characters. Accessing their elements by [index] and iterating them yield byte and unicode character correspondingly (+). ~~Iterating them, however, yields unicode characters for both bstr and ustr~~. However it is possible to yield unicode character when iterating bstr via uiter, and to yield byte character when iterating ustr via biter. In practice bstr + uiter is enough 99% of the time, and ustr only needs to be used for random access to string characters. See Strings, bytes, runes and characters in Go for overview of this approach.

Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr, while operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr. When the coercion happens, bytes and bytearray, similarly to bstr, are also treated as UTF8-encoded strings.

bstr and ustr are meant to be drop-in replacements for standard str/unicode classes. They support all methods of str/unicode and in particular their constructors accept arbitrary objects and either convert or stringify them. For cases when no stringification is desired, and one only wants to convert bstr/ustr / unicode/bytes/bytearray, or an object with buffer interface (%), to Pygolang string, b and u provide way to make sure an object is either bstr or ustr correspondingly.

Usage example:

   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]

   # the following gives b('привет мир труд май')
   b('привет %s %s %s') % (u'мир',                  # raw unicode
                           u'труд'.encode('utf-8'), # raw bytes
                           u('май'))                # ustr

   def f(s):
      s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
      ...               # (^) the decoding never fails nor looses information.

(*) unicode on Python2, str on Python3.
(+) ordinal of such byte and unicode character can be obtained via regular ord.
For completeness bbyte and uchr are also provided for constructing 1-byte bstr and 1-character ustr from ordinal.
(%) data in buffer, similarly to bytes and bytearray, is treated as UTF8-encoded string.
Notice that only explicit conversion through b and u accept objects with buffer interface. Automatic coercion does not.

---- 8< ----

With this e.g. zodbtools is finally ported to Python3 easily[10].

One note is that we change b and u to return bstr/ustr instead of bytes/unicode. This is change in behaviour, but I hope it won't break anything. The reason for this is that now-returned bstr and ustr are meant to be drop-in replacements for standard string types, and that there are not many existing b and u users. We just need to make sure that the places, that already use b and u continue to work. Those include Zodbtools, Nxdtest[11], and lonet[12], which should continue to work ok.

@klaus, you once said that you use b and u somewhere as well. Please do not hesitate to let me know if this change causes any issues for you, and we will, hopefully, try to find a solution.

Kirill

/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya

[1] zodbtools!12 (closed)
[2] zodbtools!13 (merged)
[3] zodbtools!16 (merged)
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
[6] zodbtools!8 (comment 73726)
[7] zodbtools!13 (comment 81646)
[8] bcb95cd5
[9] edc7aaab
[10] zodbtools@9861c136
[11] https://lab.nexedi.com/nexedi/nxdtest
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py

EDIT 2024-05-07: Adjusted iter(bstr) to yield bytes instead of unicode characters as explained in !21 (comment 206044).

Edited Feb 19, 2025 by Kirill Smelkov