• Kirill Smelkov's avatar
    strconv: Fix b & friends on macos/windows · 0561926a
    Kirill Smelkov authored
    On macos and windows, Python2 is built with --enable-unicode=ucs2, which
    makes it to use UTF-16 encoding for unicode characters, and so for
    characters higher than U+10000 it uses surrogate encoding with _2_
    unicode points, for example:
    
            >>> import sys
            >>> sys.maxunicode
            65535                       <-- NOTE indicates UCS2 build
            >>> s = u'\U00012345'
            >>> s
            u'\U00012345'
            >>> s.encode('utf-8')
            '\xf0\x92\x8d\x85'
            >>> len(s)
            2                           <-- NOTE _not_ 1
            >>> s[0]
            u'\ud808'
            >>> s[1]
            u'\udf45'
    
    This leads to e.g. b tests failing for
    
        # tbytes                        tunicode
        (b"\xf0\x90\x8c\xbc",           u'\U0001033c'),     # Valid 4 Octet Sequence '𐌼'
    
        >           assert b(tunicode) == tbytes
        E           AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc'
        E             - \xed\xa0\x80\xed\xbc\xbc
        E             + \xf0\x90\x8c\xbc
    
    because on UCS2 python build u'\U0001033c' is represented as 2 unicode
    points:
    
        >>> s = u'\U0001033c'
        >>> len(s)
        2
        >>> s[0]
        u'\ud800'
        >>> s[1]
        u'\udf3c'
        >>> s[0].encode('utf-8')
        '\xed\xa0\x80'
        >>> s[1].encode('utf-8')
        '\xed\xbc\xbc'
    
    -> Fix it by detecting UCS2 build and working around by manually
    combining such surrogate unicode pairs appropriately.
    
    A reference on the subject:
    
    https://matthew-brett.github.io/pydagogue/python_unicode.html#utf-16-ucs2-builds-of-python-and-32-bit-unicode-code-points
    0561926a
strconv.py 10.9 KB