golang/strconv.py · 4f28dddfd5048cbcbd733e141bc337ae8dc9d7cd · nexedi / pygolang

strconv: Fix b & friends on macos/windows · 0561926a

Kirill Smelkov authored Feb 28, 2020

On macos and windows, Python2 is built with --enable-unicode=ucs2, which
makes it to use UTF-16 encoding for unicode characters, and so for
characters higher than U+10000 it uses surrogate encoding with _2_
unicode points, for example:

        >>> import sys
        >>> sys.maxunicode
        65535                       <-- NOTE indicates UCS2 build
        >>> s = u'\U00012345'
        >>> s
        u'\U00012345'
        >>> s.encode('utf-8')
        '\xf0\x92\x8d\x85'
        >>> len(s)
        2                           <-- NOTE _not_ 1
        >>> s[0]
        u'\ud808'
        >>> s[1]
        u'\udf45'

This leads to e.g. b tests failing for

    # tbytes                        tunicode
    (b"\xf0\x90\x8c\xbc",           u'\U0001033c'),     # Valid 4 Octet Sequence '𐌼'

    >           assert b(tunicode) == tbytes
    E           AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc'
    E             - \xed\xa0\x80\xed\xbc\xbc
    E             + \xf0\x90\x8c\xbc

because on UCS2 python build u'\U0001033c' is represented as 2 unicode
points:

    >>> s = u'\U0001033c'
    >>> len(s)
    2
    >>> s[0]
    u'\ud800'
    >>> s[1]
    u'\udf3c'
    >>> s[0].encode('utf-8')
    '\xed\xa0\x80'
    >>> s[1].encode('utf-8')
    '\xed\xbc\xbc'

-> Fix it by detecting UCS2 build and working around by manually
combining such surrogate unicode pairs appropriately.

A reference on the subject:

https://matthew-brett.github.io/pydagogue/python_unicode.html#utf-16-ucs2-builds-of-python-and-32-bit-unicode-code-points

0561926a

strconv.py 10.9 KB

Replace strconv.py