Commit 6722061c authored by Stefan Behnel's avatar Stefan Behnel

improve some Sphinx markup

parent a9963a76
...@@ -16,46 +16,46 @@ implicitly insert these encoding/decoding steps. ...@@ -16,46 +16,46 @@ implicitly insert these encoding/decoding steps.
Python string types in Cython code Python string types in Cython code
---------------------------------- ----------------------------------
Cython supports four Python string types: ``bytes``, ``str``, Cython supports four Python string types: :obj:`bytes`, :obj:`str`,
``unicode`` and ``basestring``. The ``bytes`` and ``unicode`` types :obj:`unicode` and :obj:`basestring`. The :obj:`bytes` and :obj:`unicode` types
are the specific types known from normal Python 2.x (named ``bytes`` are the specific types known from normal Python 2.x (named :obj:`bytes`
and ``str`` in Python 3). Additionally, Cython also supports the and :obj:`str` in Python 3). Additionally, Cython also supports the
``bytearray`` type starting with Python 2.6. It behaves like the :obj:`bytearray` type starting with Python 2.6. It behaves like the
``bytes`` type, except that it is mutable. :obj:`bytes` type, except that it is mutable.
The ``str`` type is special in that it is the byte string in Python 2 The :obj:`str` type is special in that it is the byte string in Python 2
and the Unicode string in Python 3 (for Cython code compiled with and the Unicode string in Python 3 (for Cython code compiled with
language level 2, i.e. the default). Meaning, it always corresponds language level 2, i.e. the default). Meaning, it always corresponds
exactly with the type that the Python runtime itself calls ``str``. exactly with the type that the Python runtime itself calls :obj:`str`.
Thus, in Python 2, both ``bytes`` and ``str`` represent the byte string Thus, in Python 2, both :obj:`bytes` and :obj:`str` represent the byte string
type, whereas in Python 3, both ``str`` and ``unicode`` represent the type, whereas in Python 3, both :obj:`str` and :obj:`unicode` represent the
Python Unicode string type. The switch is made at C compile time, the Python Unicode string type. The switch is made at C compile time, the
Python version that is used to run Cython is not relevant. Python version that is used to run Cython is not relevant.
When compiling Cython code with language level 3, the ``str`` type is When compiling Cython code with language level 3, the :obj:`str` type is
identified with exactly the Unicode string type at Cython compile time, identified with exactly the Unicode string type at Cython compile time,
i.e. it does not identify with ``bytes`` when running in Python 2. i.e. it does not identify with :obj:`bytes` when running in Python 2.
Note that the ``str`` type is not compatible with the ``unicode`` Note that the :obj:`str` type is not compatible with the :obj:`unicode`
type in Python 2, i.e. you cannot assign a Unicode string to a variable type in Python 2, i.e. you cannot assign a Unicode string to a variable
or argument that is typed ``str``. The attempt will result in either or argument that is typed :obj:`str`. The attempt will result in either
a compile time error (if detectable) or a ``TypeError`` exception at a compile time error (if detectable) or a :obj:`TypeError` exception at
runtime. You should therefore be careful when you statically type a runtime. You should therefore be careful when you statically type a
string variable in code that must be compatible with Python 2, as this string variable in code that must be compatible with Python 2, as this
Python version allows a mix of byte strings and unicode strings for data Python version allows a mix of byte strings and unicode strings for data
and users normally expect code to be able to work with both. Code that and users normally expect code to be able to work with both. Code that
only targets Python 3 can safely type variables and arguments as either only targets Python 3 can safely type variables and arguments as either
``bytes`` or ``unicode``. :obj:`bytes` or :obj:`unicode`.
The ``basestring`` type represents both the types ``str`` and ``unicode``, The :obj:`basestring` type represents both the types :obj:`str` and :obj:`unicode`,
i.e. all Python text string types in Python 2 and Python 3. This can be i.e. all Python text string types in Python 2 and Python 3. This can be
used for typing text variables that normally contain Unicode text (at used for typing text variables that normally contain Unicode text (at
least in Python 3) but must additionally accept the ``str`` type in least in Python 3) but must additionally accept the :obj:`str` type in
Python 2 for backwards compatibility reasons. It is not compatible with Python 2 for backwards compatibility reasons. It is not compatible with
the ``bytes`` type. Its usage should be rare in normal Cython code as the :obj:`bytes` type. Its usage should be rare in normal Cython code as
the generic ``object`` type (i.e. untyped code) will normally be good the generic :obj:`object` type (i.e. untyped code) will normally be good
enough and has the additional advantage of supporting the assignment of enough and has the additional advantage of supporting the assignment of
string subtypes. Support for the ``basestring`` type is new in Cython string subtypes. Support for the :obj:`basestring` type is new in Cython
0.20. 0.20.
...@@ -100,7 +100,7 @@ Python variable:: ...@@ -100,7 +100,7 @@ Python variable::
cdef char* c_string = c_call_returning_a_c_string() cdef char* c_string = c_call_returning_a_c_string()
cdef bytes py_string = c_string cdef bytes py_string = c_string
A type cast to ``object`` or ``bytes`` will do the same thing:: A type cast to :obj:`object` or :obj:`bytes` will do the same thing::
py_string = <bytes> c_string py_string = <bytes> c_string
...@@ -163,8 +163,8 @@ however, when the C function stores the pointer for later use. Apart ...@@ -163,8 +163,8 @@ however, when the C function stores the pointer for later use. Apart
from keeping a Python reference to the string object, no manual memory from keeping a Python reference to the string object, no manual memory
management is required. management is required.
Starting with Cython 0.20, the ``bytearray`` type is supported and Starting with Cython 0.20, the :obj:`bytearray` type is supported and
coerces in the same way as the ``bytes`` type. However, when using it coerces in the same way as the :obj:`bytes` type. However, when using it
in a C context, special care must be taken not to grow or shrink the in a C context, special care must be taken not to grow or shrink the
object buffer after converting it to a C string pointer. These object buffer after converting it to a C string pointer. These
modifications can change the internal buffer address, which will make modifications can change the internal buffer address, which will make
...@@ -224,6 +224,7 @@ In Cython 0.18, these standard declarations have been changed to ...@@ -224,6 +224,7 @@ In Cython 0.18, these standard declarations have been changed to
use the correct ``const`` modifier, so your code will automatically use the correct ``const`` modifier, so your code will automatically
benefit from the new ``const`` support if it uses them. benefit from the new ``const`` support if it uses them.
Decoding bytes to text Decoding bytes to text
---------------------- ----------------------
...@@ -234,7 +235,7 @@ the C byte strings to Python Unicode strings on reception, and to ...@@ -234,7 +235,7 @@ the C byte strings to Python Unicode strings on reception, and to
encode Python Unicode strings to C byte strings on the way out. encode Python Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call the With a Python byte string object, you would normally just call the
``.decode()`` method to decode it into a Unicode string:: ``bytes.decode()`` method to decode it into a Unicode string::
ustring = byte_string.decode('UTF-8') ustring = byte_string.decode('UTF-8')
...@@ -318,6 +319,7 @@ assignment. Later access to the invalidated pointer will read invalid ...@@ -318,6 +319,7 @@ assignment. Later access to the invalidated pointer will read invalid
memory and likely result in a segfault. Cython will therefore refuse memory and likely result in a segfault. Cython will therefore refuse
to compile this code. to compile this code.
C++ strings C++ strings
----------- -----------
...@@ -375,7 +377,7 @@ There are two use cases where this is inconvenient. First, if all ...@@ -375,7 +377,7 @@ There are two use cases where this is inconvenient. First, if all
C strings that are being processed (or the large majority) contain C strings that are being processed (or the large majority) contain
text, automatic encoding and decoding from and to Python unicode text, automatic encoding and decoding from and to Python unicode
objects can reduce the code overhead a little. In this case, you objects can reduce the code overhead a little. In this case, you
can set the ``c_string_type`` directive in your module to ``unicode`` can set the ``c_string_type`` directive in your module to :obj:`unicode`
and the ``c_string_encoding`` to the encoding that your C code uses, and the ``c_string_encoding`` to the encoding that your C code uses,
for example:: for example::
...@@ -393,7 +395,7 @@ The second use case is when all C strings that are being processed ...@@ -393,7 +395,7 @@ The second use case is when all C strings that are being processed
only contain ASCII encodable characters (e.g. numbers) and you want only contain ASCII encodable characters (e.g. numbers) and you want
your code to use the native legacy string type in Python 2 for them, your code to use the native legacy string type in Python 2 for them,
instead of always using Unicode. In this case, you can set the instead of always using Unicode. In this case, you can set the
string type to ``str``:: string type to :obj:`str`::
# cython: c_string_type=str, c_string_encoding=ascii # cython: c_string_type=str, c_string_encoding=ascii
...@@ -472,15 +474,15 @@ whereas the following ``ISO-8859-15`` encoded source file will print ...@@ -472,15 +474,15 @@ whereas the following ``ISO-8859-15`` encoded source file will print
Note that the unicode literal ``u'abcö'`` is a correctly decoded four Note that the unicode literal ``u'abcö'`` is a correctly decoded four
character Unicode string in both cases, whereas the unprefixed Python character Unicode string in both cases, whereas the unprefixed Python
``str`` literal ``'abcö'`` will become a byte string in Python 2 (thus :obj:`str` literal ``'abcö'`` will become a byte string in Python 2 (thus
having length 4 or 5 in the examples above), and a 4 character Unicode having length 4 or 5 in the examples above), and a 4 character Unicode
string in Python 3. If you are not familiar with encodings, this may string in Python 3. If you are not familiar with encodings, this may
not appear obvious at first read. See `CEP 108`_ for details. not appear obvious at first read. See `CEP 108`_ for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCII ``str`` As a rule of thumb, it is best to avoid unprefixed non-ASCII :obj:`str`
literals and to use unicode string literals for all text. Cython also literals and to use unicode string literals for all text. Cython also
supports the ``__future__`` import ``unicode_literals`` that instructs supports the ``__future__`` import ``unicode_literals`` that instructs
the parser to read all unprefixed ``str`` literals in a source file as the parser to read all unprefixed :obj:`str` literals in a source file as
unicode string literals, just like Python 3. unicode string literals, just like Python 3.
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals .. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
...@@ -522,7 +524,7 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python ...@@ -522,7 +524,7 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
The explicit coercion works for any C integer type. Values outside of The explicit coercion works for any C integer type. Values outside of
the range of a :c:type:`char` or :c:type:`unsigned char` will raise an the range of a :c:type:`char` or :c:type:`unsigned char` will raise an
``OverflowError`` at runtime. Coercion will also happen automatically :obj:`OverflowError` at runtime. Coercion will also happen automatically
when assigning to a typed variable, e.g.:: when assigning to a typed variable, e.g.::
cdef bytes py_byte_string cdef bytes py_byte_string
...@@ -544,10 +546,10 @@ The following will print 65:: ...@@ -544,10 +546,10 @@ The following will print 65::
cdef Py_UCS4 uchar_val = u'A' cdef Py_UCS4 uchar_val = u'A'
print( <long>uchar_val ) print( <long>uchar_val )
Note that casting to a C ``long`` (or ``unsigned long``) will work Note that casting to a C :c:type:`long` (or :c:type:`unsigned long`) will work
just fine, as the maximum code point value that a Unicode character just fine, as the maximum code point value that a Unicode character
can have is 1114111 (``0x10FFFF``). On platforms with 32bit or more, can have is 1114111 (``0x10FFFF``). On platforms with 32bit or more,
``int`` is just as good. :c:type:`int` is just as good.
Narrow Unicode builds Narrow Unicode builds
...@@ -682,15 +684,15 @@ zero-terminated UTF-16 encoded :c:type:`wchar_t*` strings, so called ...@@ -682,15 +684,15 @@ zero-terminated UTF-16 encoded :c:type:`wchar_t*` strings, so called
"wide strings". "wide strings".
By default, Windows builds of CPython define :c:type:`Py_UNICODE` as By default, Windows builds of CPython define :c:type:`Py_UNICODE` as
a synonym for :c:type:`wchar_t`. This makes internal ``unicode`` a synonym for :c:type:`wchar_t`. This makes internal :obj:`unicode`
representation compatible with UTF-16 and allows for efficient zero-copy representation compatible with UTF-16 and allows for efficient zero-copy
conversions. This also means that Windows builds are always conversions. This also means that Windows builds are always
`Narrow Unicode builds`_ with all the caveats. `Narrow Unicode builds`_ with all the caveats.
To aid interoperation with Windows APIs, Cython 0.19 supports wide To aid interoperation with Windows APIs, Cython 0.19 supports wide
strings (in the form of :c:type:`Py_UNICODE*`) and implicitly converts strings (in the form of :c:type:`Py_UNICODE*`) and implicitly converts
them to and from ``unicode`` string objects. These conversions behave the them to and from :obj:`unicode` string objects. These conversions behave the
same way as they do for :c:type:`char*` and ``bytes`` as described in same way as they do for :c:type:`char*` and :obj:`bytes` as described in
`Passing byte strings`_. `Passing byte strings`_.
In addition to automatic conversion, unicode literals that appear In addition to automatic conversion, unicode literals that appear
...@@ -722,7 +724,7 @@ Here is an example of how one would call a Unicode API on Windows:: ...@@ -722,7 +724,7 @@ Here is an example of how one would call a Unicode API on Windows::
APIs deprecated and inefficient. APIs deprecated and inefficient.
One consequence of CPython 3.3 changes is that :py:func:`len` of One consequence of CPython 3.3 changes is that :py:func:`len` of
``unicode`` strings is always measured in *code points* ("characters"), :obj:`unicode` strings is always measured in *code points* ("characters"),
while Windows API expect the number of UTF-16 *code units* while Windows API expect the number of UTF-16 *code units*
(where each surrogate is counted individually). To always get the number (where each surrogate is counted individually). To always get the number
of code units, call :c:func:`PyUnicode_GetSize` directly. of code units, call :c:func:`PyUnicode_GetSize` directly.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment