Commit 410eee56 authored by Ezio Melotti's avatar Ezio Melotti

#4153: update the Unicode howto.

parent 663a9e2f
...@@ -44,7 +44,7 @@ machines assigned values between 128 and 255 to accented characters. Different ...@@ -44,7 +44,7 @@ machines assigned values between 128 and 255 to accented characters. Different
machines had different codes, however, which led to problems exchanging files. machines had different codes, however, which led to problems exchanging files.
Eventually various commonly used sets of values for the 128--255 range emerged. Eventually various commonly used sets of values for the 128--255 range emerged.
Some were true standards, defined by the International Standards Organization, Some were true standards, defined by the International Standards Organization,
and some were **de facto** conventions that were invented by one company or and some were *de facto* conventions that were invented by one company or
another and managed to catch on. another and managed to catch on.
255 characters aren't very many. For example, you can't fit both the accented 255 characters aren't very many. For example, you can't fit both the accented
...@@ -62,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible ...@@ -62,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible
to represent many different characters from many different alphabets; an initial to represent many different characters from many different alphabets; an initial
goal was to have Unicode contain the alphabets for every single human language. goal was to have Unicode contain the alphabets for every single human language.
It turns out that even 16 bits isn't enough to meet that goal, and the modern It turns out that even 16 bits isn't enough to meet that goal, and the modern
Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10ffff Unicode specification uses a wider range of codes, 0 through 1,114,111 (
in base 16). ``0x10FFFF`` in base 16).
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with the 1.1 originally separate efforts, but the specifications were merged with the 1.1
...@@ -87,9 +87,11 @@ meanings. ...@@ -87,9 +87,11 @@ meanings.
The Unicode standard describes how characters are represented by **code The Unicode standard describes how characters are represented by **code
points**. A code point is an integer value, usually denoted in base 16. In the points**. A code point is an integer value, usually denoted in base 16. In the
standard, a code point is written using the notation U+12ca to mean the standard, a code point is written using the notation ``U+12CA`` to mean the
character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
of tables listing characters and their corresponding code points:: a lot of tables listing characters and their corresponding code points:
.. code-block:: none
0061 'a'; LATIN SMALL LETTER A 0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B 0062 'b'; LATIN SMALL LETTER B
...@@ -98,7 +100,7 @@ of tables listing characters and their corresponding code points:: ...@@ -98,7 +100,7 @@ of tables listing characters and their corresponding code points::
007B '{'; LEFT CURLY BRACKET 007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'. U+12ca is a code point, which represents some particular character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
informal contexts, this distinction between code points and characters will informal contexts, this distinction between code points and characters will
sometimes be forgotten. sometimes be forgotten.
...@@ -115,13 +117,15 @@ Encodings ...@@ -115,13 +117,15 @@ Encodings
--------- ---------
To summarize the previous section: a Unicode string is a sequence of code To summarize the previous section: a Unicode string is a sequence of code
points, which are numbers from 0 through 0x10ffff (1,114,111 decimal). This points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
sequence needs to be represented as a set of bytes (meaning, values sequence needs to be represented as a set of bytes (meaning, values
from 0 through 255) in memory. The rules for translating a Unicode string from 0 through 255) in memory. The rules for translating a Unicode string
into a sequence of bytes are called an **encoding**. into a sequence of bytes are called an **encoding**.
The first encoding you might think of is an array of 32-bit integers. In this The first encoding you might think of is an array of 32-bit integers. In this
representation, the string "Python" would look like this:: representation, the string "Python" would look like this:
.. code-block:: none
P y t h o n P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
...@@ -133,10 +137,10 @@ problems. ...@@ -133,10 +137,10 @@ problems.
1. It's not portable; different processors order the bytes differently. 1. It's not portable; different processors order the bytes differently.
2. It's very wasteful of space. In most texts, the majority of the code points 2. It's very wasteful of space. In most texts, the majority of the code points
are less than 127, or less than 255, so a lot of space is occupied by zero are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
ASCII representation. Increased RAM usage doesn't matter too much (desktop ASCII representation. Increased RAM usage doesn't matter too much (desktop
computers have megabytes of RAM, and strings aren't usually that large), but computers have gigabytes of RAM, and strings aren't usually that large), but
expanding our usage of disk and network bandwidth by a factor of 4 is expanding our usage of disk and network bandwidth by a factor of 4 is
intolerable. intolerable.
...@@ -175,14 +179,12 @@ internal detail. ...@@ -175,14 +179,12 @@ internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
Transformation Format", and the '8' means that 8-bit numbers are used in the Transformation Format", and the '8' means that 8-bit numbers are used in the
encoding. (There's also a UTF-16 encoding, but it's less frequently used than encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
UTF-8.) UTF-8 uses the following rules: frequently used than UTF-8.) UTF-8 uses the following rules:
1. If the code point is <128, it's represented by the corresponding byte value. 1. If the code point is < 128, it's represented by the corresponding byte value.
2. If the code point is between 128 and 0x7ff, it's turned into two byte values 2. If the code point is >= 128, it's turned into a sequence of two, three, or
between 128 and 255. four bytes, where each byte of the sequence is between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties: UTF-8 has several convenient properties:
...@@ -192,8 +194,8 @@ UTF-8 has several convenient properties: ...@@ -192,8 +194,8 @@ UTF-8 has several convenient properties:
processed by C functions such as ``strcpy()`` and sent through protocols that processed by C functions such as ``strcpy()`` and sent through protocols that
can't handle zero bytes. can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text. 3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of code points are turned into two 4. UTF-8 is fairly compact; the majority of commonly used characters can be
bytes, and values less than 128 occupy only a single byte. represented with one or two bytes.
5. If bytes are corrupted or lost, it's possible to determine the start of the 5. If bytes are corrupted or lost, it's possible to determine the start of the
next UTF-8-encoded code point and resynchronize. It's also unlikely that next UTF-8-encoded code point and resynchronize. It's also unlikely that
random 8-bit data will look like valid UTF-8. random 8-bit data will look like valid UTF-8.
...@@ -203,25 +205,25 @@ UTF-8 has several convenient properties: ...@@ -203,25 +205,25 @@ UTF-8 has several convenient properties:
References References
---------- ----------
The Unicode Consortium site at <http://www.unicode.org> has character charts, a The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
glossary, and PDF versions of the Unicode specification. Be prepared for some glossary, and PDF versions of the Unicode specification. Be prepared for some
difficult reading. <http://www.unicode.org/history/> is a chronology of the difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
origin and development of Unicode. origin and development of Unicode is also available on the site.
To help understand the standard, Jukka Korpela has written an introductory guide To help understand the standard, Jukka Korpela has written `an introductory
to reading the Unicode character tables, available at guide <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. Unicode character tables.
Another good introductory article was written by Joel Spolsky Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
<http://www.joelonsoftware.com/articles/Unicode.html>. was written by Joel Spolsky.
If this introduction didn't make things clear to you, you should try reading this If this introduction didn't make things clear to you, you should try reading this
alternate article before continuing. alternate article before continuing.
.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken .. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
Wikipedia entries are often helpful; see the entries for "character encoding" Wikipedia entries are often helpful; see the entries for "`character encoding
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 <http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
<http://en.wikipedia.org/wiki/UTF-8>, for example. <http://en.wikipedia.org/wiki/UTF-8>`_, for example.
Python's Unicode Support Python's Unicode Support
...@@ -233,11 +235,11 @@ Unicode features. ...@@ -233,11 +235,11 @@ Unicode features.
The String Type The String Type
--------------- ---------------
Since Python 3.0, the language features a ``str`` type that contain Unicode Since Python 3.0, the language features a :class:`str` type that contain Unicode
characters, meaning any string created using ``"unicode rocks!"``, ``'unicode characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
rocks!'``, or the triple-quoted string syntax is stored as Unicode. rocks!'``, or the triple-quoted string syntax is stored as Unicode.
To insert a Unicode character that is not part ASCII, e.g., any letters with To insert a non-ASCII Unicode character, e.g., any letters with
accents, one can use escape sequences in their string literals as such:: accents, one can use escape sequences in their string literals as such::
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
...@@ -247,15 +249,16 @@ accents, one can use escape sequences in their string literals as such:: ...@@ -247,15 +249,16 @@ accents, one can use escape sequences in their string literals as such::
>>> "\U00000394" # Using a 32-bit hex value >>> "\U00000394" # Using a 32-bit hex value
'\u0394' '\u0394'
In addition, one can create a string using the :func:`decode` method of In addition, one can create a string using the :func:`~bytes.decode` method of
:class:`bytes`. This method takes an encoding, such as UTF-8, and, optionally, :class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
an *errors* argument. and optionally, an *errors* argument.
The *errors* argument specifies the response when the input string can't be The *errors* argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are converted according to the encoding's rules. Legal values for this argument are
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (use U+FFFD, ``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the ``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
Unicode result). The following examples show the differences:: character out of the Unicode result).
The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last): Traceback (most recent call last):
...@@ -273,8 +276,8 @@ a question mark because it may not be displayed on some systems.) ...@@ -273,8 +276,8 @@ a question mark because it may not be displayed on some systems.)
Encodings are specified as strings containing the encoding's name. Python 3.2 Encodings are specified as strings containing the encoding's name. Python 3.2
comes with roughly 100 different encodings; see the Python Library Reference at comes with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings have multiple names; for :ref:`standard-encodings` for a list. Some encodings have multiple names; for
example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
encoding. the same encoding.
One-character Unicode strings can also be created with the :func:`chr` One-character Unicode strings can also be created with the :func:`chr`
built-in function, which takes integers and returns a Unicode string of length 1 built-in function, which takes integers and returns a Unicode string of length 1
...@@ -290,13 +293,14 @@ returns the code point value:: ...@@ -290,13 +293,14 @@ returns the code point value::
Converting to Bytes Converting to Bytes
------------------- -------------------
Another important str method is ``.encode([encoding], [errors='strict'])``, The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
which returns a ``bytes`` representation of the Unicode string, encoded in the which returns a :class:`bytes` representation of the Unicode string, encoded in the
requested encoding. The ``errors`` parameter is the same as the parameter of requested *encoding*. The *errors* parameter is the same as the parameter of
the :meth:`decode` method, with one additional possibility; as well as 'strict', the :meth:`~bytes.decode` method, with one additional possibility; as well as
'ignore', and 'replace' (which in this case inserts a question mark instead of ``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case inserts a
the unencodable character), you can also pass 'xmlcharrefreplace' which uses question mark instead of the unencodable character), you can also pass
XML's character references. The following example shows the different results:: ``'xmlcharrefreplace'`` which uses XML's character references.
The following example shows the different results::
>>> u = chr(40960) + 'abcd' + chr(1972) >>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8') >>> u.encode('utf-8')
...@@ -313,6 +317,8 @@ XML's character references. The following example shows the different results:: ...@@ -313,6 +317,8 @@ XML's character references. The following example shows the different results::
>>> u.encode('ascii', 'xmlcharrefreplace') >>> u.encode('ascii', 'xmlcharrefreplace')
b'&#40960;abcd&#1972;' b'&#40960;abcd&#1972;'
.. XXX mention the surrogate* error handlers
The low-level routines for registering and accessing the available encodings are The low-level routines for registering and accessing the available encodings are
found in the :mod:`codecs` module. However, the encoding and decoding functions found in the :mod:`codecs` module. However, the encoding and decoding functions
returned by this module are usually more low-level than is comfortable, so I'm returned by this module are usually more low-level than is comfortable, so I'm
...@@ -365,14 +371,14 @@ they have no significance to Python but are a convention. Python looks for ...@@ -365,14 +371,14 @@ they have no significance to Python but are a convention. Python looks for
``coding: name`` or ``coding=name`` in the comment. ``coding: name`` or ``coding=name`` in the comment.
If you don't include such a comment, the default encoding used will be UTF-8 as If you don't include such a comment, the default encoding used will be UTF-8 as
already mentioned. already mentioned. See also :pep:`263` for more information.
Unicode Properties Unicode Properties
------------------ ------------------
The Unicode specification includes a database of information about code points. The Unicode specification includes a database of information about code points.
For each code point that's defined, the information includes the character's For each defined code point, the information includes the character's
name, its category, the numeric value if applicable (Unicode has characters name, its category, the numeric value if applicable (Unicode has characters
representing the Roman numerals and fractions such as one-third and representing the Roman numerals and fractions such as one-third and
four-fifths). There are also properties related to the code point's use in four-fifths). There are also properties related to the code point's use in
...@@ -392,7 +398,9 @@ prints the numeric value of one particular character:: ...@@ -392,7 +398,9 @@ prints the numeric value of one particular character::
# Get numeric value of second character # Get numeric value of second character
print(unicodedata.numeric(u[1])) print(unicodedata.numeric(u[1]))
When run, this prints:: When run, this prints:
.. code-block:: none
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND 1 0bf2 No TAMIL NUMBER ONE THOUSAND
...@@ -413,7 +421,7 @@ list of category codes. ...@@ -413,7 +421,7 @@ list of category codes.
References References
---------- ----------
The ``str`` type is described in the Python library reference at The :class:`str` type is described in the Python library reference at
:ref:`typesseq`. :ref:`typesseq`.
The documentation for the :mod:`unicodedata` module. The documentation for the :mod:`unicodedata` module.
...@@ -443,16 +451,16 @@ columns and can return Unicode values from an SQL query. ...@@ -443,16 +451,16 @@ columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It's possible to do all the work written to disk or sent over a socket. It's possible to do all the work
yourself: open a file, read an 8-bit byte string from it, and convert the string yourself: open a file, read an 8-bit bytes object from it, and convert the string
with ``str(bytes, encoding)``. However, the manual approach is not recommended. with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be One problem is the multi-byte nature of encodings; one Unicode character can be
represented by several bytes. If you want to read the file in arbitrary-sized represented by several bytes. If you want to read the file in arbitrary-sized
chunks (say, 1K or 4K), you need to write error-handling code to catch the case chunks (say, 1k or 4k), you need to write error-handling code to catch the case
where only part of the bytes encoding a single Unicode character are read at the where only part of the bytes encoding a single Unicode character are read at the
end of a chunk. One solution would be to read the entire file into memory and end of a chunk. One solution would be to read the entire file into memory and
then perform the decoding, but that prevents you from working with files that then perform the decoding, but that prevents you from working with files that
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. are extremely large; if you need to read a 2GB file, you need 2GB of RAM.
(More, really, since for at least a moment you'd need to have both the encoded (More, really, since for at least a moment you'd need to have both the encoded
string and its Unicode version in memory.) string and its Unicode version in memory.)
...@@ -460,9 +468,9 @@ The solution would be to use the low-level decoding interface to catch the case ...@@ -460,9 +468,9 @@ The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been of partial coding sequences. The work of implementing this has already been
done for you: the built-in :func:`open` function can return a file-like object done for you: the built-in :func:`open` function can return a file-like object
that assumes the file's contents are in a specified encoding and accepts Unicode that assumes the file's contents are in a specified encoding and accepts Unicode
parameters for methods such as ``.read()`` and ``.write()``. This works through parameters for methods such as :meth:`read` and :meth:`write`. This works through
:func:`open`\'s *encoding* and *errors* parameters which are interpreted just :func:`open`\'s *encoding* and *errors* parameters which are interpreted just
like those in string objects' :meth:`encode` and :meth:`decode` methods. like those in :meth:`str.encode` and :meth:`bytes.decode`.
Reading Unicode from a file is therefore simple:: Reading Unicode from a file is therefore simple::
...@@ -478,7 +486,7 @@ writing:: ...@@ -478,7 +486,7 @@ writing::
f.seek(0) f.seek(0)
print(repr(f.readline()[:1])) print(repr(f.readline()[:1]))
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
written as the first character of a file in order to assist with autodetection written as the first character of a file in order to assist with autodetection
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
present at the start of a file; when such an encoding is used, the BOM will be present at the start of a file; when such an encoding is used, the BOM will be
...@@ -520,12 +528,12 @@ Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unico ...@@ -520,12 +528,12 @@ Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unico
filenames. filenames.
Function :func:`os.listdir`, which returns filenames, raises an issue: should it return Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
the Unicode version of filenames, or should it return byte strings containing the Unicode version of filenames, or should it return bytes containing
the encoded versions? :func:`os.listdir` will do both, depending on whether you the encoded versions? :func:`os.listdir` will do both, depending on whether you
provided the directory path as a byte string or a Unicode string. If you pass a provided the directory path as bytes or a Unicode string. If you pass a
Unicode string as the path, filenames will be decoded using the filesystem's Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing a byte encoding and a list of Unicode strings will be returned, while passing a byte
path will return the byte string versions of the filenames. For example, path will return the bytes versions of the filenames. For example,
assuming the default filesystem encoding is UTF-8, running the following assuming the default filesystem encoding is UTF-8, running the following
program:: program::
...@@ -559,13 +567,13 @@ Unicode. ...@@ -559,13 +567,13 @@ Unicode.
The most important tip is: The most important tip is:
Software should only work with Unicode strings internally, converting to a Software should only work with Unicode strings internally, decoding the input
particular encoding on output. data as soon as possible and encoding the output only at the end.
If you attempt to write processing functions that accept both Unicode and byte If you attempt to write processing functions that accept both Unicode and byte
strings, you will find your program vulnerable to bugs wherever you combine the strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. There is no automatic encoding or decoding if two different kinds of strings. There is no automatic encoding or decoding: if
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression. you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
When using data coming from a web browser or some other untrusted source, a When using data coming from a web browser or some other untrusted source, a
common technique is to check for illegal characters in a string before using the common technique is to check for illegal characters in a string before using the
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment