Commit 836aa220 authored by Stefan Behnel's avatar Stefan Behnel

added section on source code encodings to string handling chapter

parent 0b0dd28c
...@@ -21,6 +21,18 @@ original C string. It can be safely passed around in Python code, and ...@@ -21,6 +21,18 @@ original C string. It can be safely passed around in Python code, and
will be garbage collected when the last reference to it goes out of will be garbage collected when the last reference to it goes out of
scope. scope.
Note that the creation of the Python bytes string can fail with an
exception, e.g. due to insufficient memory. If you need to ``free()``
the string after the conversion, you should wrap the assignment in a
try-finally construct::
cimport stdlib
cdef char* c_string = c_call_returning_a_c_string()
try:
py_string = c_string
finally:
stdlib.free(c_string)
To convert the byte string back into a C ``char*``, use the opposite To convert the byte string back into a C ``char*``, use the opposite
assignment:: assignment::
...@@ -40,11 +52,11 @@ management is required. ...@@ -40,11 +52,11 @@ management is required.
Decoding bytes to text Decoding bytes to text
---------------------- ----------------------
The above way of passing and receiving C strings is as simple that The initially presented way of passing and receiving C strings is
that, as long as we only deal with binary data in the strings. When sufficient if your code only deals with binary data in the strings.
we deal with encoded text, however, it is best practice to decode the C byte When we deal with encoded text, however, it is best practice to decode
strings to Python Unicode strings on reception, and to encode Python the C byte strings to Python Unicode strings on reception, and to
Unicode strings to C byte strings on the way out. encode Python Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call the With a Python byte string object, you would normally just call the
``.decode()`` method to decode it into a Unicode string:: ``.decode()`` method to decode it into a Unicode string::
...@@ -74,7 +86,7 @@ number of bytes by slicing the C string:: ...@@ -74,7 +86,7 @@ number of bytes by slicing the C string::
ustring = c_string[:length].decode('UTF-8') ustring = c_string[:length].decode('UTF-8')
The same can be used when the string contains null bytes, e.g. when it The same can be used when the string contains null bytes, e.g. when it
uses an encoding like UCS-2, where each character is encoded as two uses an encoding like UCS-4, where each character is encoded in four
bytes. bytes.
It is common practice to wrap string conversions (and non-trivial type It is common practice to wrap string conversions (and non-trivial type
...@@ -117,15 +129,69 @@ a memory managed byte string:: ...@@ -117,15 +129,69 @@ a memory managed byte string::
As noted before, this takes the pointer to the byte buffer of the As noted before, this takes the pointer to the byte buffer of the
Python byte string. Trying to do the same without keeping a reference Python byte string. Trying to do the same without keeping a reference
to the intermediate byte string will fail with a compile error:: to the Python byte string will fail with a compile error::
# this will not compile ! # this will not compile !
cdef char* c_string = py_unicode_string.encode('UTF-8') cdef char* c_string = py_unicode_string.encode('UTF-8')
Here, the Cython compiler notices that the code takes a pointer to a Here, the Cython compiler notices that the code takes a pointer to a
temporary string result that will be garbage collected after the temporary string result that will be garbage collected after the
assignment. Later access to the invalidated pointer will most likely assignment. Later access to the invalidated pointer will read invalid
result in a crash. Cython will therefore refuse to compile this code. memory and likely result in a segfault. Cython will therefore refuse
to compile this code.
Source code encoding
--------------------
When string literals appear in the code, the source code encoding is
important. It determines the byte sequence that Cython will store in
the C code for bytes literals, and the Unicode code points that Cython
builds for unicode literals when parsing the byte encoded source file.
Following `PEP 263`_, Cython supports the explicit declaration of
source file encodings. For example, putting the following comment at
the top of an ``ISO-8859-15`` (Latin-9) encoded source file (into the
first or second line) is required to enable ``ISO-8859-15`` decoding
in the parser::
# -*- coding: ISO-8859-15 -*-
When no explicit encoding declaration is provided, the source code is
parsed as UTF-8 encoded text, as specified by `PEP 3120`_. `UTF-8`_
is a very common encoding that can represent the entire Unicode set of
characters and is compatible with plain ASCII encoded text that it
encodes efficiently. This makes it a very good choice for source code
files which usually consist mostly of ASCII characters.
.. _`PEP 263`: http://www.python.org/dev/peps/pep-0263/
.. _`PEP 3120`: http://www.python.org/dev/peps/pep-3120/
.. _`UTF-8`: http://en.wikipedia.org/wiki/UTF-8
As an example, putting the following line into a UTF-8 encoded source
file will print ``5``, as UTF-8 encodes the letter ``'ö'`` in the two
byte sequence ``'\xc3\xb6'``::
print( len(b'abcö') )
whereas the following ``ISO-8859-15`` encoded source file will print
``4``, as the encoding uses only 1 byte for this letter::
# -*- coding: ISO-8859-15 -*-
print( len(b'abcö') )
Note that the unicode literal ``u'abcö'`` is a correctly decoded four
character Unicode string in both cases, whereas the unprefixed Python
``str`` literal ``'abcö'`` will become a byte string in Python 2 (thus
having length 4 or 5 in the examples above), and a 4 character Unicode
string in Python 3. If you are not familiar with encodings, this may
not appear obvious at first read. See `CEP 108`_ for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCII ``str``
literals and to use unicode string literals for all text. Cython also
supports the ``__future__`` import ``unicode_literals`` that instructs
the parser to read all unprefixed ``str`` literals in a source file as
unicode string literals.
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
Single bytes and characters Single bytes and characters
--------------------------- ---------------------------
...@@ -136,15 +202,18 @@ code point value, i.e. a single Unicode character. Since version ...@@ -136,15 +202,18 @@ code point value, i.e. a single Unicode character. Since version
0.13, Cython supports the latter natively, which is either defined as 0.13, Cython supports the latter natively, which is either defined as
an unsigned 2-byte or 4-byte integer, or as ``wchar_t``, depending on an unsigned 2-byte or 4-byte integer, or as ``wchar_t``, depending on
the platform. The exact type is a compile time option in the build of the platform. The exact type is a compile time option in the build of
the CPython interpreter. the CPython interpreter and extension modules inherit this definition
at C compile time.
In Cython, the ``char`` and ``Py_UNICODE`` types behave differently In Cython, the ``char`` and ``Py_UNICODE`` types behave differently
when coercing to Python objects. Similar to the behaviour of the when coercing to Python objects. Similar to the behaviour of the
bytes type in Python 3, the ``char`` type coerces to a Python integer bytes type in Python 3, the ``char`` type coerces to a Python integer
value by default, so that the following prints 65 and not ``A``:: value by default, so that the following prints 65 and not ``A``::
# -*- coding: ASCII -*-
cdef char char_val = 'A' cdef char char_val = 'A'
assert char_val == 65 # 'A' assert char_val == 65 # ASCII encoded byte value of 'A'
print( char_val ) print( char_val )
If you want a Python bytes string instead, you have to request it If you want a Python bytes string instead, you have to request it
...@@ -154,9 +223,9 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python ...@@ -154,9 +223,9 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
print( <bytes>char_val ) print( <bytes>char_val )
The explicit coercion works for any C integer type. Values outside of The explicit coercion works for any C integer type. Values outside of
the range of a ``char`` will raise an ``OverflowError``. Coercion the range of a ``char`` or ``unsigned char`` will raise an
will also happen automatically when assigning to a typed variable, ``OverflowError``. Coercion will also happen automatically when
e.g.:: assigning to a typed variable, e.g.::
cdef bytes py_byte_string = char_val cdef bytes py_byte_string = char_val
...@@ -188,6 +257,8 @@ Cython 0.13 supports efficient iteration over ``char*``, bytes and ...@@ -188,6 +257,8 @@ Cython 0.13 supports efficient iteration over ``char*``, bytes and
unicode strings, as long as the loop variable is appropriately typed. unicode strings, as long as the loop variable is appropriately typed.
So the following will generate the expected C code:: So the following will generate the expected C code::
# -*- coding: ASCII -*-
cdef char* c_string = c_call_returning_a_c_string() cdef char* c_string = c_call_returning_a_c_string()
cdef char c cdef char c
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment