golang_str: bstr/ustr %-formatting (390fd810) · Commits · Kirill Smelkov / pygolang

Commit 390fd810 authored Oct 09, 2022 by

Kirill Smelkov

golang_str: bstr/ustr %-formatting

Teach bstr/ustr to do % formatting similarly to how unicode does, but
with treating bytes as UTF8-encoded strings - all in line with
general idea for bstr/ustr to treat bytes as strings.

The following approach is used to implement this:

1. both bstr and ustr format via bytes-based _bprintf.
2. we parse the format string and handle every formatting specifier separately:
3. for formats besides %s/%r we use bytes.__mod__ directly.

4. for %s we stringify corresponding argument specially with all, potentially
   internal, bytes instances treated as UTF8-encoded strings:

      '%s' % b'\xce\xb2'      ->  "β"
      '%s' % [b'\xce\xb2']    ->  "['β']"

5. for %r, similarly to %s, we prepare repr of corresponding argument
   specially with all, potentially internal, bytes instances also treated as
   UTF8-encoded strings:

      '%r' % b'\xce\xb2'      ->  "b'β'"
      '%r' % [b'\xce\xb2']    ->  "[b'β']"

For "2" we implement %-format parsing ourselves. test_strings_mod
has good coverage for this phase to make sure we get it right and behaving
exactly the same way as standard Python does.

For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
from under bstr.__mod__(). See _bstringify for details.

For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
See _bstringify_repr for details.

I initially tried to avoid parsing format specification myself and
wanted to reuse original bytes.__mod__ and just adjust its behaviour
a bit somehow. This did not worked quite right as the following comment
explains:

    # Rejected alternative: try to format; if we get "TypeError: %b requires a
    # bytes-like object ..." retry with that argument converted to bstr.
    #
    # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
    # access number instead of key 'x' to determine which accesses to
    # bstringify. We could do that, but unfortunately on Python2 the access
    # number is not easily predictable because string could be upgraded to
    # unicode in the midst of being formatted and so some access keys will be
    # accesses not once.
    #
    # Another reason for rejection: b'%r' and u'%r' handle arguments
    # differently - on b %r is aliased to %a.

That's why full %-format parsing and handling is implemented in this
patch. Once again to make sure its behaviour is really the same compared
to Python's builtin %-formatting, we have good test coverage for both
%-format parsing itself, and for actual formatting of many various cases.

See test_strings_mod for details.

parent ddf6958b

Expand all Show whitespace changes

Inline Side-by-side

View file @ 390fd810

@@ -269,6 +269,11 @@ Usage example::
    for c in s:          # c will iterate through
         ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
    # the following gives b('привет мир труд май')
    b('привет %s %s %s') % (u'мир',                  # raw unicode
                            u'труд'.encode('utf-8'), # raw bytes
                            u('май'))                # ustr
    def f(s):
       s = u(s)          # make sure s is ustr, decoding as UTF-8(*) if it was bstr, bytes, bytearray or buffer.
       ...               # (*) the decoding never fails nor looses information.
-...

View file @ 390fd810

This diff is collapsed.

View file @ 390fd810

This diff is collapsed.

Please register or to comment