golang_str: bstr/ustr %-formatting
Teach bstr/ustr to do % formatting similarly to how unicode does, but with treating bytes as UTF8-encoded strings - all in line with general idea for bstr/ustr to treat bytes as strings. The following approach is used to implement this: 1. both bstr and ustr format via bytes-based _bprintf. 2. we parse the format string and handle every formatting specifier separately: 3. for formats besides %s/%r we use bytes.__mod__ directly. 4. for %s we stringify corresponding argument specially with all, potentially internal, bytes instances treated as UTF8-encoded strings: '%s' % b'\xce\xb2' -> "β" '%s' % [b'\xce\xb2'] -> "['β']" 5. for %r, similarly to %s, we prepare repr of corresponding argument specially with all, potentially internal, bytes instances also treated as UTF8-encoded strings: '%r' % b'\xce\xb2' -> "b'β'" '%r' % [b'\xce\xb2'] -> "[b'β']" For "2" we implement %-format parsing ourselves. test_strings_mod has good coverage for this phase to make sure we get it right and behaving exactly the same way as standard Python does. For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called from under bstr.__mod__(). See _bstringify for details. For "5", similarly to "4", we rely on adjustments to bytes.__repr__ . See _bstringify_repr for details. I initially tried to avoid parsing format specification myself and wanted to reuse original bytes.__mod__ and just adjust its behaviour a bit somehow. This did not worked quite right as the following comment explains: # Rejected alternative: try to format; if we get "TypeError: %b requires a # bytes-like object ..." retry with that argument converted to bstr. # # Rejected because e.g. for `%(x)s %(x)r` % {'x': obj}` we need to use # access number instead of key 'x' to determine which accesses to # bstringify. We could do that, but unfortunately on Python2 the access # number is not easily predictable because string could be upgraded to # unicode in the midst of being formatted and so some access keys will be # accesses not once. # # Another reason for rejection: b'%r' and u'%r' handle arguments # differently - on b %r is aliased to %a. That's why full %-format parsing and handling is implemented in this patch. Once again to make sure its behaviour is really the same compared to Python's builtin %-formatting, we have good test coverage for both %-format parsing itself, and for actual formatting of many various cases. See test_strings_mod for details.
Showing