Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Boxiang Sun
cython
Commits
836aa220
Commit
836aa220
authored
May 02, 2010
by
Stefan Behnel
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
added section on source code encodings to string handling chapter
parent
0b0dd28c
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
85 additions
and
14 deletions
+85
-14
src/tutorial/strings.rst
src/tutorial/strings.rst
+85
-14
No files found.
src/tutorial/strings.rst
View file @
836aa220
...
...
@@ -21,6 +21,18 @@ original C string. It can be safely passed around in Python code, and
will be garbage collected when the last reference to it goes out of
scope.
Note that the creation of the Python bytes string can fail with an
exception, e.g. due to insufficient memory. If you need to ``free()``
the string after the conversion, you should wrap the assignment in a
try-finally construct::
cimport stdlib
cdef char* c_string = c_call_returning_a_c_string()
try:
py_string = c_string
finally:
stdlib.free(c_string)
To convert the byte string back into a C ``char*``, use the opposite
assignment::
...
...
@@ -40,11 +52,11 @@ management is required.
Decoding bytes to text
----------------------
The
above way of passing and receiving C strings is as simple that
that, as long as we only deal with binary data in the strings. When
we deal with encoded text, however, it is best practice to decode the C byt
e
strings to Python Unicode strings on reception, and to encode Python
Unicode strings to C byte strings on the way out.
The
initially presented way of passing and receiving C strings is
sufficient if your code only deals with binary data in the strings.
When we deal with encoded text, however, it is best practice to decod
e
the C byte strings to Python Unicode strings on reception, and to
encode Python
Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call the
``.decode()`` method to decode it into a Unicode string::
...
...
@@ -74,7 +86,7 @@ number of bytes by slicing the C string::
ustring = c_string[:length].decode('UTF-8')
The same can be used when the string contains null bytes, e.g. when it
uses an encoding like UCS-
2, where each character is encoded as two
uses an encoding like UCS-
4, where each character is encoded in four
bytes.
It is common practice to wrap string conversions (and non-trivial type
...
...
@@ -117,15 +129,69 @@ a memory managed byte string::
As noted before, this takes the pointer to the byte buffer of the
Python byte string. Trying to do the same without keeping a reference
to the
intermediate
byte string will fail with a compile error::
to the
Python
byte string will fail with a compile error::
# this will not compile !
cdef char* c_string = py_unicode_string.encode('UTF-8')
Here, the Cython compiler notices that the code takes a pointer to a
temporary string result that will be garbage collected after the
assignment. Later access to the invalidated pointer will most likely
result in a crash. Cython will therefore refuse to compile this code.
assignment. Later access to the invalidated pointer will read invalid
memory and likely result in a segfault. Cython will therefore refuse
to compile this code.
Source code encoding
--------------------
When string literals appear in the code, the source code encoding is
important. It determines the byte sequence that Cython will store in
the C code for bytes literals, and the Unicode code points that Cython
builds for unicode literals when parsing the byte encoded source file.
Following `PEP 263`_, Cython supports the explicit declaration of
source file encodings. For example, putting the following comment at
the top of an ``ISO-8859-15`` (Latin-9) encoded source file (into the
first or second line) is required to enable ``ISO-8859-15`` decoding
in the parser::
# -*- coding: ISO-8859-15 -*-
When no explicit encoding declaration is provided, the source code is
parsed as UTF-8 encoded text, as specified by `PEP 3120`_. `UTF-8`_
is a very common encoding that can represent the entire Unicode set of
characters and is compatible with plain ASCII encoded text that it
encodes efficiently. This makes it a very good choice for source code
files which usually consist mostly of ASCII characters.
.. _`PEP 263`: http://www.python.org/dev/peps/pep-0263/
.. _`PEP 3120`: http://www.python.org/dev/peps/pep-3120/
.. _`UTF-8`: http://en.wikipedia.org/wiki/UTF-8
As an example, putting the following line into a UTF-8 encoded source
file will print ``5``, as UTF-8 encodes the letter ``'ö'`` in the two
byte sequence ``'\xc3\xb6'``::
print( len(b'abcö') )
whereas the following ``ISO-8859-15`` encoded source file will print
``4``, as the encoding uses only 1 byte for this letter::
# -*- coding: ISO-8859-15 -*-
print( len(b'abcö') )
Note that the unicode literal ``u'abcö'`` is a correctly decoded four
character Unicode string in both cases, whereas the unprefixed Python
``str`` literal ``'abcö'`` will become a byte string in Python 2 (thus
having length 4 or 5 in the examples above), and a 4 character Unicode
string in Python 3. If you are not familiar with encodings, this may
not appear obvious at first read. See `CEP 108`_ for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCII ``str``
literals and to use unicode string literals for all text. Cython also
supports the ``__future__`` import ``unicode_literals`` that instructs
the parser to read all unprefixed ``str`` literals in a source file as
unicode string literals.
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
Single bytes and characters
---------------------------
...
...
@@ -136,15 +202,18 @@ code point value, i.e. a single Unicode character. Since version
0.13, Cython supports the latter natively, which is either defined as
an unsigned 2-byte or 4-byte integer, or as ``wchar_t``, depending on
the platform. The exact type is a compile time option in the build of
the CPython interpreter.
the CPython interpreter and extension modules inherit this definition
at C compile time.
In Cython, the ``char`` and ``Py_UNICODE`` types behave differently
when coercing to Python objects. Similar to the behaviour of the
bytes type in Python 3, the ``char`` type coerces to a Python integer
value by default, so that the following prints 65 and not ``A``::
# -*- coding: ASCII -*-
cdef char char_val = 'A'
assert char_val == 65 # 'A'
assert char_val == 65 #
ASCII encoded byte value of
'A'
print( char_val )
If you want a Python bytes string instead, you have to request it
...
...
@@ -154,9 +223,9 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
print( <bytes>char_val )
The explicit coercion works for any C integer type. Values outside of
the range of a ``char``
will raise an ``OverflowError``. Coercio
n
will also happen automatically when assigning to a typed variable,
e.g.::
the range of a ``char``
or ``unsigned char`` will raise a
n
``OverflowError``. Coercion will also happen automatically when
assigning to a typed variable,
e.g.::
cdef bytes py_byte_string = char_val
...
...
@@ -188,6 +257,8 @@ Cython 0.13 supports efficient iteration over ``char*``, bytes and
unicode strings, as long as the loop variable is appropriately typed.
So the following will generate the expected C code::
# -*- coding: ASCII -*-
cdef char* c_string = c_call_returning_a_c_string()
cdef char c
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment