Add markup to new section in codecs docs

131e4f71 · Georg Brandl · 296152e6 · 131e4f71
Commit 131e4f71 authored Jan 23, 2006 by Georg Brandl
Hide whitespace changes
Inline Side-by-side

Showing with 42 additions and 41 deletions

Doc/lib/libcodecs.tex Doc/lib/libcodecs.tex +42 -41

No files found.
--- a/Doc/lib/libcodecs.tex
+++ b/Doc/lib/libcodecs.tex
@@ -525,9 +525,10 @@ all other methods and attribute from the underlying stream.
 \subsection{Encodings and Unicode\label{encodings-overview}}
 Unicode strings are stored internally as sequences of codepoints (to
-be precise as Py_UNICODE arrays). Depending on the way Python is
+be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
-compiled (either via --enable-unicode=ucs2 or --enable-unicode=ucs4,
+compiled (either via \longprogramopt{enable-unicode=ucs2} or 
-with the former being the default) Py_UNICODE is either a 16-bit or
+\longprogramopt{enable-unicode=ucs4}, with the former being the default)
+\ctype{Py_UNICODE} is either a 16-bit or
 32-bit data type. Once a Unicode object is used outside of CPU and
 memory, CPU endianness and how these arrays are stored as bytes become
 an issue. Transforming a unicode object into a sequence of bytes is
@@ -535,20 +536,20 @@ called encoding and recreating the unicode object from the sequence of
 bytes is known as decoding. There are many different methods how this
 transformation can be done (these methods are also called encodings).
 The simplest method is to map the codepoints 0-255 to the bytes
-0x0-0xff. This means that a unicode object that contains codepoints
+\code{0x0}-\code{0xff}. This means that a unicode object that contains 
-above U+00FF can't be encoded with this method (which is called
+codepoints above \code{U+00FF} can't be encoded with this method (which 
-'latin-1' or 'iso-8859-1'). unicode.encode() will raise a
+is called \code{'latin-1'} or \code{'iso-8859-1'}). unicode.encode() will 
-UnicodeEncodeError that looks like this: UnicodeEncodeError: 'latin-1'
+raise a UnicodeEncodeError that looks like this: \samp{UnicodeEncodeError:
-codec can't encode character u'\u1234' in position 3: ordinal not in
+'latin-1' codec can't encode character u'\e u1234' in position 3: ordinal
-range(256)
+not in range(256)}.
 There's another group of encodings (the so called charmap encodings)
 that choose a different subset of all unicode code points and how
-these codepoints are mapped to the bytes 0x0-0xff. To see how this is
+these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
-done simply open e.g.  encodings/cp1252.py (which is an encoding that
+To see how this is done simply open e.g. \file{encodings/cp1252.py}
-is used primarily on Windows).  There's a string constant with 256
+(which is an encoding that is used primarily on Windows).
-characters that shows you which character is mapped to which byte
+There's a string constant with 256 characters that shows you which 
-value.
+character is mapped to which byte value.
 All of these encodings can only encode 256 of the 65536 (or 1114111)
 codepoints defined in unicode. A simple and straightforward way that
@@ -562,20 +563,20 @@ Bytes will always be in natural endianness. When these bytes are read
 by a CPU with a different endianness, then bytes have to be swapped
 though. To be able to detect the endianness of a UTF-16 byte sequence,
 there's the so called BOM (the "Byte Order Mark"). This is the Unicode
-character U+FEFF. This character will be prepended to every UTF-16
+character \code{U+FEFF}. This character will be prepended to every UTF-16
-byte sequence. The byte swapped version of this character (0xFFFE) is
+byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
 an illegal character that may not appear in a Unicode text. So when
-the first character in an UTF-16 byte sequence appears to be a U+FFFE
+the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
 the bytes have to be swapped on decoding. Unfortunately upto Unicode
-4.0 the character U+FEFF had a second purpose as a "ZERO WIDTH
+4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
-NO-BREAK SPACE": A character that has no width and doesn't allow a
+NO-BREAK SPACE}: A character that has no width and doesn't allow a
 word to be split. It can e.g. be used to give hints to a ligature
-algorithm. With Unicode 4.0 using U+FEFF as a ZERO WIDTH NO-BREAK
+algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
-SPACE has been deprecated (with U+2060 (WORD JOINER) assuming this
+SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
-role). Nevertheless Unicode software still must be able to handle
+this role). Nevertheless Unicode software still must be able to handle
-U+FEFF in both roles: As a BOM it's a device to determine the storage
+\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
 layout of the encoded bytes, and vanishes once the byte sequence has
-been decoded into a Unicode string; as a ZERO WIDTH NO-BREAK SPACE
+been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
 it's a normal character that will be decoded like any other.
 There's another encoding that is able to encoding the full range of
@@ -588,20 +589,20 @@ bits) and payload bits. The marker bits are a sequence of zero to six
 character):
 \begin{tableii}{l|l}{textrm}{}{Range}{Encoding}
-\lineii{U-00000000 ... U-0000007F}{0xxxxxxx}
+\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
-\lineii{U-00000080 ... U-000007FF}{110xxxxx 10xxxxxx}
+\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
-\lineii{U-00000800 ... U-0000FFFF}{1110xxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
-\lineii{U-00010000 ... U-001FFFFF}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
-\lineii{U-00200000 ... U-03FFFFFF}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
-\lineii{U-04000000 ... U-7FFFFFFF}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
 \end{tableii}
 The least significant bit of the Unicode character is the rightmost x
 bit.
-As UTF-8 is an 8bit encoding no BOM is required and any U+FEFF
+As UTF-8 is an 8bit encoding no BOM is required and any \code{U+FEFF}
 character in the decoded Unicode string (even if it's the first
-character) is treated as a ZERO WIDTH NO-BREAK SPACE.
+character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
 Without external information it's impossible to reliably determine
 which encoding was used for encoding a Unicode string. Each charmap
@@ -609,14 +610,14 @@ encoding can decode any random byte sequence. However that's not
 possible with UTF-8, as UTF-8 byte sequences have a structure that
 doesn't allow arbitrary byte sequence. To increase the reliability
 with which a UTF-8 encoding can be detected, Microsoft invented a
-variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad
+variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
 program: Before any of the Unicode characters is written to the file,
-a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef,
+a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
-0xbb, 0xbf) is written. As it's rather improbably that any charmap
+\code{0xbb}, \code{0xbf}) is written. As it's rather improbably that any
-encoded file starts with these byte values (which would e.g. map to
+charmap encoded file starts with these byte values (which would e.g. map to
-   LATIN SMALL LETTER I WITH DIAERESIS
+   LATIN SMALL LETTER I WITH DIAERESIS \\
-   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
+   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
   INVERTED QUESTION MARK
 in iso-8859-1), this increases the probability that a utf-8-sig
@@ -624,9 +625,9 @@ encoding can be correctly guessed from the byte sequence. So here the
 BOM is not used to be able to determine the byte order used for
 generating the byte sequence, but as a signature that helps in
 guessing the encoding. On encoding the utf-8-sig codec will write
-0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding
+\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
-utf-8-sig will skip those three bytes if they appear as the first
+On decoding utf-8-sig will skip those three bytes if they appear as the
-three bytes in the file.
+first three bytes in the file.
 \subsection{Standard Encodings\label{standard-encodings}}