Issue 19548: update codecs module documentation

- clarified the distinction between text encodings and other codecs - clarified relationship with builtin open and the io module - consolidated documentation of error handlers into one section - clarified type constraints of some behaviours - added tests for some of the new statements in the docs

Issue 19548: update codecs module documentation
- clarified the distinction between text encodings and other codecs - clarified relationship with builtin open and the io module - consolidated documentation of error handlers into one section - clarified type constraints of some behaviours - added tests for some of the new statements in the docs
b9fdb7a4 · Nick Coghlan · fcfed199 · b9fdb7a4 · b9fdb7a4 · b9fdb7a4
Commit b9fdb7a4 authored Jan 07, 2015 by Nick Coghlan
9 changed files
--- a/Doc/glossary.rst
+++ b/Doc/glossary.rst
@@ -820,10 +820,13 @@ Glossary
      :meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
      include :data:`sys.float_info` and the return value of :func:`os.stat`.
+   text encoding
+      A codec which encodes Unicode strings to bytes.
   text file
      A :term:`file object` able to read and write :class:`str` objects.
      Often, a text file actually accesses a byte-oriented datastream
-      and handles the text encoding automatically.
+      and handles the :term:`text encoding` automatically.
      .. seealso::
         A :term:`binary file` reads and write :class:`bytes` objects.

--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
--- a/Doc/library/functions.rst
+++ b/Doc/library/functions.rst
@@ -939,15 +939,17 @@ are always available.  They are listed here in alphabetical order.
   *encoding* is the name of the encoding used to decode or encode the file.
   This should only be used in text mode.  The default encoding is platform
   dependent (whatever :func:`locale.getpreferredencoding` returns), but any
-   encoding supported by Python can be used.  See the :mod:`codecs` module for
+   :term:`text encoding` supported by Python
+   can be used.  See the :mod:`codecs` module for
   the list of supported encodings.
   *errors* is an optional string that specifies how encoding and decoding
   errors are to be handled--this cannot be used in binary mode.
-   A variety of standard error handlers are available, though any
+   A variety of standard error handlers are available
+   (listed under :ref:`error-handlers`), though any
   error handling name that has been registered with
   :func:`codecs.register_error` is also valid.  The standard names
-   are:
+   include:
   * ``'strict'`` to raise a :exc:`ValueError` exception if there is
     an encoding error.  The default value of ``None`` has the same

--- a/Doc/library/stdtypes.rst
+++ b/Doc/library/stdtypes.rst
@@ -1512,7 +1512,7 @@ expression support in the :mod:`re` module).
   a :exc:`UnicodeError`. Other possible
   values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
   ``'backslashreplace'`` and any other name registered via
-   :func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
+   :func:`codecs.register_error`, see section :ref:`error-handlers`. For a
   list of possible encodings, see section :ref:`standard-encodings`.
   .. versionchanged:: 3.1
@@ -2384,7 +2384,7 @@ arbitrary binary data.
   error handling scheme.  The default for *errors* is ``'strict'``, meaning
   that encoding errors raise a :exc:`UnicodeError`.  Other possible values are
   ``'ignore'``, ``'replace'`` and any other name registered via
-   :func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
+   :func:`codecs.register_error`, see section :ref:`error-handlers`. For a
   list of possible encodings, see section :ref:`standard-encodings`.
   .. note::

--- a/Doc/library/tarfile.rst
+++ b/Doc/library/tarfile.rst
@@ -794,7 +794,7 @@ metadata must be either decoded or encoded. If *encoding* is not set
 appropriately, this conversion may fail.
 The *errors* argument defines how characters are treated that cannot be
-converted. Possible values are listed in section :ref:`codec-base-classes`.
+converted. Possible values are listed in section :ref:`error-handlers`.
 The default scheme is ``'surrogateescape'`` which Python also uses for its
 file system calls, see :ref:`os-filenames`.

--- a/Lib/codecs.py
+++ b/Lib/codecs.py
@@ -346,8 +346,7 @@ class StreamWriter(Codec):
        """ Creates a StreamWriter instance.
-            stream must be a file-like object open for writing
+            stream must be a file-like object open for writing.
-            (binary) data.
            The StreamWriter may use different error handling
            schemes by providing the errors keyword argument. These
@@ -421,8 +420,7 @@ class StreamReader(Codec):
        """ Creates a StreamReader instance.
-            stream must be a file-like object open for reading
+            stream must be a file-like object open for reading.
-            (binary) data.
            The StreamReader may use different error handling
            schemes by providing the errors keyword argument. These
@@ -450,13 +448,12 @@ class StreamReader(Codec):
        """ Decodes data from the stream self.stream and returns the
            resulting object.
-            chars indicates the number of characters to read from the
+            chars indicates the number of decoded code points or bytes to
-            stream. read() will never return more than chars
+            return. read() will never return more data than requested,
-            characters, but it might return less, if there are not enough
+            but it might return less, if there is not enough available.
-            characters available.
-            size indicates the approximate maximum number of bytes to
+            size indicates the approximate maximum number of decoded
-            read from the stream for decoding purposes. The decoder
+            bytes or code points to read for decoding. The decoder
            can modify this setting as appropriate. The default value
            -1 indicates to read and decode as much as possible.  size
            is intended to prevent having to decode huge files in one
@@ -467,7 +464,7 @@ class StreamReader(Codec):
            will be returned, the rest of the input will be kept until the
            next call to read().
-            The method should use a greedy read strategy meaning that
+            The method should use a greedy read strategy, meaning that
            it should read as much data as is allowed within the
            definition of the encoding and the given size, e.g.  if
            optional encoding endings or state markers are available
@@ -602,7 +599,7 @@ class StreamReader(Codec):
    def readlines(self, sizehint=None, keepends=True):
        """ Read all lines available on the input stream
-            and return them as list of lines.
+            and return them as a list.
            Line breaks are implemented using the codec's decoder
            method and are included in the list entries.
@@ -750,19 +747,18 @@ class StreamReaderWriter:
 class StreamRecoder:
-    """ StreamRecoder instances provide a frontend - backend
+    """ StreamRecoder instances translate data from one encoding to another.
-        view of encoding data.
        They use the complete set of APIs returned by the
        codecs.lookup() function to implement their task.
-        Data written to the stream is first decoded into an
+        Data written to the StreamRecoder is first decoded into an
-        intermediate format (which is dependent on the given codec
+        intermediate format (depending on the "decode" codec) and then
-        combination) and then written to the stream using an instance
+        written to the underlying stream using an instance of the provided
-        of the provided Writer class.
+        Writer class.
-        In the other direction, data is read from the stream using a
+        In the other direction, data is read from the underlying stream using
-        Reader instance and then return encoded data to the caller.
+        a Reader instance and then encoded and returned to the caller.
    """
    # Optional attributes set by the file wrappers below
@@ -774,22 +770,17 @@ class StreamRecoder:
        """ Creates a StreamRecoder instance which implements a two-way
            conversion: encode and decode work on the frontend (the
-            input to .read() and output of .write()) while
+            data visible to .read() and .write()) while Reader and Writer
-            Reader and Writer work on the backend (reading and
+            work on the backend (the data in stream).
-            writing to the stream).
-            You can use these objects to do transparent direct
+            You can use these objects to do transparent
-            recodings from e.g. latin-1 to utf-8 and back.
+            transcodings from e.g. latin-1 to utf-8 and back.
            stream must be a file-like object.
-            encode, decode must adhere to the Codec interface, Reader,
+            encode and decode must adhere to the Codec interface; Reader and
            Writer must be factory functions or classes providing the
-            StreamReader, StreamWriter interface resp.
+            StreamReader and StreamWriter interfaces resp.
-            encode and decode are needed for the frontend translation,
-            Reader and Writer for the backend translation. Unicode is
-            used as intermediate encoding.
            Error handling is done in the same way as defined for the
            StreamWriter/Readers.
@@ -864,7 +855,7 @@ class StreamRecoder:
 ### Shortcuts
-def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
+def open(filename, mode='r', encoding=None, errors='strict', buffering=1):
    """ Open an encoded file using the given mode and return
        a wrapped version providing transparent encoding/decoding.
@@ -874,10 +865,8 @@ def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
        codecs. Output is also codec dependent and will usually be
        Unicode as well.
-        Files are always opened in binary mode, even if no binary mode
+        Underlying encoded files are always opened in binary mode.
-        was specified. This is done to avoid data loss due to encodings
+        The default file mode is 'r', meaning to open the file in read mode.
-        using 8-bit values. The default file mode is 'rb' meaning to
-        open the file in binary read mode.
        encoding specifies the encoding which is to be used for the
        file.
@@ -913,13 +902,13 @@ def EncodedFile(file, data_encoding, file_encoding=None, errors='strict'):
    """ Return a wrapped version of file which provides transparent
        encoding translation.
-        Strings written to the wrapped file are interpreted according
+        Data written to the wrapped file is decoded according
-        to the given data_encoding and then written to the original
+        to the given data_encoding and then encoded to the underlying
-        file as string using file_encoding. The intermediate encoding
+        file using file_encoding. The intermediate data type
        will usually be Unicode but depends on the specified codecs.
-        Strings are read from the file using file_encoding and then
+        Bytes read from the file are decoded using file_encoding and then
-        passed back to the caller as string using data_encoding.
+        passed back to the caller encoded using data_encoding.
        If file_encoding is not given, it defaults to data_encoding.

--- a/Lib/test/test_codecs.py
+++ b/Lib/test/test_codecs.py
@@ -1139,6 +1139,8 @@ class RecodingTest(unittest.TestCase):
        # Python used to crash on this at exit because of a refcount
        # bug in _codecsmodule.c
+        self.assertTrue(f.closed)
 # From RFC 3492
 punycode_testcases = [
    # A Arabic (Egyptian):
@@ -1591,6 +1593,16 @@ class IDNACodecTest(unittest.TestCase):
        self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
        self.assertEqual(encoder.encode("", True), b"")
+    def test_errors(self):
+        """Only supports "strict" error handler"""
+        "python.org".encode("idna", "strict")
+        b"python.org".decode("idna", "strict")
+        for errors in ("ignore", "replace", "backslashreplace",
+                "surrogateescape"):
+            self.assertRaises(Exception, "python.org".encode, "idna", errors)
+            self.assertRaises(Exception,
+                b"python.org".decode, "idna", errors)
 class CodecsModuleTest(unittest.TestCase):
    def test_decode(self):
@@ -1668,6 +1680,24 @@ class CodecsModuleTest(unittest.TestCase):
        for api in codecs.__all__:
            getattr(codecs, api)
+    def test_open(self):
+        self.addCleanup(support.unlink, support.TESTFN)
+        for mode in ('w', 'r', 'r+', 'w+', 'a', 'a+'):
+            with self.subTest(mode), \
+                    codecs.open(support.TESTFN, mode, 'ascii') as file:
+                self.assertIsInstance(file, codecs.StreamReaderWriter)
+    def test_undefined(self):
+        self.assertRaises(UnicodeError, codecs.encode, 'abc', 'undefined')
+        self.assertRaises(UnicodeError, codecs.decode, b'abc', 'undefined')
+        self.assertRaises(UnicodeError, codecs.encode, '', 'undefined')
+        self.assertRaises(UnicodeError, codecs.decode, b'', 'undefined')
+        for errors in ('strict', 'ignore', 'replace', 'backslashreplace'):
+            self.assertRaises(UnicodeError,
+                codecs.encode, 'abc', 'undefined', errors)
+            self.assertRaises(UnicodeError,
+                codecs.decode, b'abc', 'undefined', errors)
 class StreamReaderTest(unittest.TestCase):
    def setUp(self):
@@ -1801,13 +1831,10 @@ if hasattr(codecs, "mbcs_encode"):
 #    "undefined"
 # The following encodings don't work in stateful mode
-broken_unicode_with_streams = [
+broken_unicode_with_stateful = [
    "punycode",
    "unicode_internal"
 ]
-broken_incremental_coders = broken_unicode_with_streams + [
-    "idna",
-]
 class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
    def test_basics(self):
@@ -1827,7 +1854,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
                (chars, size) = codecs.getdecoder(encoding)(b)
                self.assertEqual(chars, s, "encoding=%r" % encoding)
-            if encoding not in broken_unicode_with_streams:
+            if encoding not in broken_unicode_with_stateful:
                # check stream reader/writer
                q = Queue(b"")
                writer = codecs.getwriter(encoding)(q)
@@ -1845,7 +1872,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
                    decodedresult += reader.read()
                self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
-            if encoding not in broken_incremental_coders:
+            if encoding not in broken_unicode_with_stateful:
                # check incremental decoder/encoder and iterencode()/iterdecode()
                try:
                    encoder = codecs.getincrementalencoder(encoding)()
@@ -1894,7 +1921,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
        from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
        s = "abc123"  # all codecs should be able to encode these
        for encoding in all_unicode_encodings:
-            if encoding not in broken_incremental_coders:
+            if encoding not in broken_unicode_with_stateful:
                # check incremental decoder/encoder (fetched via the C API)
                try:
                    cencoder = codec_incrementalencoder(encoding)
@@ -1934,7 +1961,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
        for encoding in all_unicode_encodings:
            if encoding == "idna": # FIXME: See SF bug #1163178
                continue
-            if encoding in broken_unicode_with_streams:
+            if encoding in broken_unicode_with_stateful:
                continue
            reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
            for t in range(5):
@@ -1967,7 +1994,7 @@ class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
        # Check that getstate() and setstate() handle the state properly
        u = "abc123"
        for encoding in all_unicode_encodings:
-            if encoding not in broken_incremental_coders:
+            if encoding not in broken_unicode_with_stateful:
                self.check_state_handling_decode(encoding, u, u.encode(encoding))
                self.check_state_handling_encode(encoding, u, u.encode(encoding))
@@ -2171,6 +2198,7 @@ class WithStmtTest(unittest.TestCase):
        f = io.BytesIO(b"\xc3\xbc")
        with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
            self.assertEqual(ef.read(), b"\xfc")
+        self.assertTrue(f.closed)
    def test_streamreaderwriter(self):
        f = io.BytesIO(b"\xc3\xbc")

--- a/Misc/NEWS
+++ b/Misc/NEWS
@@ -265,6 +265,10 @@ IDLE
 Tests
 -----
+- Issue #19548: Added some additional checks to test_codecs to ensure that
+  statements in the updated documentation remain accurate. Patch by Martin
+  Panter.
 - Issue #22838: All test_re tests now work with unittest test discovery.
 - Issue #22173: Update lib2to3 tests to use unittest test discovery.
@@ -297,6 +301,10 @@ Build
 Documentation
 -------------
+- Issue #19548: Update the codecs module documentation to better cover the
+  distinction between text encodings and other codecs, together with other
+  clarifications. Patch by Martin Panter.
 - Issue #22914: Update the Python 2/3 porting HOWTO to describe a more automated
  approach.

--- a/Modules/_codecsmodule.c
+++ b/Modules/_codecsmodule.c
@@ -54,9 +54,9 @@ PyDoc_STRVAR(register__doc__,
 "register(search_function)\n\
 \n\
 Register a codec search function. Search functions are expected to take\n\
-one argument, the encoding name in all lower case letters, and return\n\
+one argument, the encoding name in all lower case letters, and either\n\
-a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\
+return None, or a tuple of functions (encoder, decoder, stream_reader,\n\
-(or a CodecInfo object).");
+stream_writer) (or a CodecInfo object).");
 static
 PyObject *codec_register(PyObject *self, PyObject *search_function)