#18891: Complete new provisional email API.

This adds EmailMessage and, MIMEPart subclasses of Message with new API methods, and a ContentManager class used by the new methods. Also a new policy setting, content_manager. Patch was reviewed by Stephen J. Turnbull and Serhiy Storchaka, and reflects their feedback. I will ideally add some examples of using the new API to the documentation before the final release.

#18891: Complete new provisional email API.
This adds EmailMessage and, MIMEPart subclasses of Message with new API methods, and a ContentManager class used by the new methods. Also a new policy setting, content_manager. Patch was reviewed by Stephen J. Turnbull and Serhiy Storchaka, and reflects their feedback. I will ideally add some examples of using the new API to the documentation before the final release.
e61e3fda · R David Murray · 57822dfd · e61e3fda · e61e3fda · e61e3fda
Commit e61e3fda authored Oct 16, 2013 by R David Murray
15 changed files
--- a/Doc/library/email.contentmanager.rst
+++ b/Doc/library/email.contentmanager.rst
--- a/Doc/library/email.message.rst
+++ b/Doc/library/email.message.rst
@@ -33,10 +33,11 @@ Here are the methods of the :class:`Message` class:

 .. class:: Message(policy=compat32)

-   The *policy* argument determiens the :mod:`~email.policy` that will be used
-   to update the message model.  The default value, :class:`compat32
-   <email.policy.Compat32>` maintains backward compatibility with the
-   Python 3.2 version of the email package.  For more information see the
+   If *policy* is specified (it must be an instance of a :mod:`~email.policy`
+   class) use the rules it specifies to udpate and serialize the representation
+   of the message.  If *policy* is not set, use the :class`compat32
+   <email.policy.Compat32>` policy, which maintains backward compatibility with
+   the Python 3.2 version of the email package.  For more information see the
   :mod:`~email.policy` documentation.

   .. versionchanged:: 3.3 The *policy* keyword argument was added.
@@ -465,7 +466,8 @@ Here are the methods of the :class:`Message` class:
      to ``False``.


-   .. method:: set_param(param, value, header='Content-Type', requote=True, charset=None, language='')
+   .. method:: set_param(param, value, header='Content-Type', requote=True,
+                         charset=None, language='', replace=False)

      Set a parameter in the :mailheader:`Content-Type` header.  If the
      parameter already exists in the header, its value will be replaced with
@@ -482,6 +484,12 @@ Here are the methods of the :class:`Message` class:
      language, defaulting to the empty string.  Both *charset* and *language*
      should be strings.

+      If *replace* is ``False`` (the default) the header is moved to the
+      end of the list of headers.  If *replace* is ``True``, the header
+      will be updated in place.
+
+      .. versionchanged: 3.4 ``replace`` keyword was added.
+

   .. method:: del_param(param, header='content-type', requote=True)


--- a/Doc/library/email.policy.rst
+++ b/Doc/library/email.policy.rst
@@ -371,7 +371,7 @@ added matters.  To illustrate::
   to) :rfc:`5322`, :rfc:`2047`, and the current MIME RFCs.

   This policy adds new header parsing and folding algorithms.  Instead of
-   simple strings, headers are custom objects with custom attributes depending
+   simple strings, headers are ``str`` subclasses with attributes that depend
   on the type of the field.  The parsing and folding algorithm fully implement
   :rfc:`2047` and :rfc:`5322`.

@@ -408,6 +408,20 @@ added matters.  To illustrate::
      fields are treated as unstructured.  This list will be completed before
      the extension is marked stable.)

+   .. attribute:: content_manager
+
+      An object with at least two methods: get_content and set_content.  When
+      the :meth:`~email.message.Message.get_content` or
+      :meth:`~email.message.Message.set_content` method of a
+      :class:`~email.message.Message` object is called, it calls the
+      corresponding method of this object, passing it the message object as its
+      first argument, and any arguments or keywords that were passed to it as
+      additional arguments.  By default ``content_manager`` is set to
+      :data:`~email.contentmanager.raw_data_manager`.
+
+      .. versionadded 3.4
+
+
   The class provides the following concrete implementations of the abstract
   methods of :class:`Policy`:

@@ -427,7 +441,7 @@ added matters.  To illustrate::
      The name is returned unchanged.  If the input value has a ``name``
      attribute and it matches *name* ignoring case, the value is returned
      unchanged.  Otherwise the *name* and *value* are passed to
-      ``header_factory``, and the resulting custom header object is returned as
+      ``header_factory``, and the resulting header object is returned as
      the value.  In this case a ``ValueError`` is raised if the input value
      contains CR or LF characters.

@@ -435,7 +449,7 @@ added matters.  To illustrate::

      If the value has a ``name`` attribute, it is returned to unmodified.
      Otherwise the *name*, and the *value* with any CR or LF characters
-      removed, are passed to the ``header_factory``, and the resulting custom
+      removed, are passed to the ``header_factory``, and the resulting
      header object is returned.  Any surrogateescaped bytes get turned into
      the unicode unknown-character glyph.

@@ -445,9 +459,9 @@ added matters.  To illustrate::
      A value is considered to be a 'source value' if and only if it does not
      have a ``name`` attribute (having a ``name`` attribute means it is a
      header object of some sort).  If a source value needs to be refolded
-      according to the policy, it is converted into a custom header object by
+      according to the policy, it is converted into a header object by
      passing the *name* and the *value* with any CR and LF characters removed
-      to the ``header_factory``.  Folding of a custom header object is done by
+      to the ``header_factory``.  Folding of a header object is done by
      calling its ``fold`` method with the current policy.

      Source values are split into lines using :meth:`~str.splitlines`.  If
@@ -502,23 +516,23 @@ With all of these :class:`EmailPolicies <.EmailPolicy>`, the effective API of
 the email package is changed from the Python 3.2 API in the following ways:

   * Setting a header on a :class:`~email.message.Message` results in that
-     header being parsed and a custom header object created.
+     header being parsed and a header object created.

   * Fetching a header value from a :class:`~email.message.Message` results
-     in that header being parsed and a custom header object created and
+     in that header being parsed and a header object created and
     returned.

-   * Any custom header object, or any header that is refolded due to the
+   * Any header object, or any header that is refolded due to the
     policy settings, is folded using an algorithm that fully implements the
     RFC folding algorithms, including knowing where encoded words are required
     and allowed.

 From the application view, this means that any header obtained through the
-:class:`~email.message.Message` is a custom header object with custom
+:class:`~email.message.Message` is a header object with extra
 attributes, whose string value is the fully decoded unicode value of the
 header.  Likewise, a header may be assigned a new value, or a new header
 created, using a unicode string, and the policy will take care of converting
 the unicode string into the correct RFC encoded form.

-The custom header objects and their attributes are described in
+The header objects and their attributes are described in
 :mod:`~email.headerregistry`.
--- a/Doc/library/email.rst
+++ b/Doc/library/email.rst
@@ -53,6 +53,7 @@ Contents of the :mod:`email` package documentation:
   email.generator.rst
   email.policy.rst
   email.headerregistry.rst
+   email.contentmanager.rst
   email.mime.rst
   email.header.rst
   email.charset.rst

--- a/Doc/whatsnew/3.4.rst
+++ b/Doc/whatsnew/3.4.rst
@@ -280,6 +280,21 @@ result:  a bytes object containing the fully formatted message.

 (Contributed by R. David Murray in :issue:`18600`.)

+A pair of new subclasses of :class:`~email.message.Message` have been added,
+along with a new sub-module, :mod:`~email.contentmanager`.  All documentation
+is currently in the new module, which is being added as part of the new
+:term:`provisional <provosional package>` email API.  These classes provide a
+number of new methods that make extracting content from and inserting content
+into email messages much easier.  See the :mod:`~email.contentmanager`
+documentation for details.
+
+These API additions complete the bulk of the work that was planned as part of
+the email6 project.  The currently provisional API is scheduled to become final
+in Python 3.5 (possibly with a few minor additions in the area of error
+handling).
+
+(Contributed by R. David Murray in :issue:`18891`.)
+

 functools
 ---------

--- a/Lib/email/contentmanager.py
+++ b/Lib/email/contentmanager.py
--- a/Lib/email/message.py
+++ b/Lib/email/message.py
@@ -8,8 +8,6 @@ __all__ = ['Message']

 import re
 import uu
-import base64
-import binascii
 from io import BytesIO, StringIO

 # Intrapackage imports
@@ -679,7 +677,7 @@ class Message:
        return failobj

    def set_param(self, param, value, header='Content-Type', requote=True,
-                  charset=None, language=''):
+                  charset=None, language='', replace=False):
        """Set a parameter in the Content-Type header.

        If the parameter already exists in the header, its value will be
@@ -723,8 +721,11 @@ class Message:
                else:
                    ctype = SEMISPACE.join([ctype, append_param])
        if ctype != self.get(header):
-            del self[header]
-            self[header] = ctype
+            if replace:
+                self.replace_header(header, ctype)
+            else:
+                del self[header]
+                self[header] = ctype

    def del_param(self, param, header='content-type', requote=True):
        """Remove the given parameter completely from the Content-Type header.
@@ -905,3 +906,208 @@ class Message:

    # I.e. def walk(self): ...
    from email.iterators import walk
+
+
+class MIMEPart(Message):
+
+    def __init__(self, policy=None):
+        if policy is None:
+            from email.policy import default
+            policy = default
+        Message.__init__(self, policy)
+
+    @property
+    def is_attachment(self):
+        c_d = self.get('content-disposition')
+        if c_d is None:
+            return False
+        return c_d.lower() == 'attachment'
+
+    def _find_body(self, part, preferencelist):
+        if part.is_attachment:
+            return
+        maintype, subtype = part.get_content_type().split('/')
+        if maintype == 'text':
+            if subtype in preferencelist:
+                yield (preferencelist.index(subtype), part)
+            return
+        if maintype != 'multipart':
+            return
+        if subtype != 'related':
+            for subpart in part.iter_parts():
+                yield from self._find_body(subpart, preferencelist)
+            return
+        if 'related' in preferencelist:
+            yield (preferencelist.index('related'), part)
+        candidate = None
+        start = part.get_param('start')
+        if start:
+            for subpart in part.iter_parts():
+                if subpart['content-id'] == start:
+                    candidate = subpart
+                    break
+        if candidate is None:
+            subparts = part.get_payload()
+            candidate = subparts[0] if subparts else None
+        if candidate is not None:
+            yield from self._find_body(candidate, preferencelist)
+
+    def get_body(self, preferencelist=('related', 'html', 'plain')):
+        """Return best candidate mime part for display as 'body' of message.
+
+        Do a depth first search, starting with self, looking for the first part
+        matching each of the items in preferencelist, and return the part
+        corresponding to the first item that has a match, or None if no items
+        have a match.  If 'related' is not included in preferencelist, consider
+        the root part of any multipart/related encountered as a candidate
+        match.  Ignore parts with 'Content-Disposition: attachment'.
+        """
+        best_prio = len(preferencelist)
+        body = None
+        for prio, part in self._find_body(self, preferencelist):
+            if prio < best_prio:
+                best_prio = prio
+                body = part
+                if prio == 0:
+                    break
+        return body
+
+    _body_types = {('text', 'plain'),
+                   ('text', 'html'),
+                   ('multipart', 'related'),
+                   ('multipart', 'alternative')}
+    def iter_attachments(self):
+        """Return an iterator over the non-main parts of a multipart.
+
+        Skip the first of each occurrence of text/plain, text/html,
+        multipart/related, or multipart/alternative in the multipart (unless
+        they have a 'Content-Disposition: attachment' header) and include all
+        remaining subparts in the returned iterator.  When applied to a
+        multipart/related, return all parts except the root part.  Return an
+        empty iterator when applied to a multipart/alternative or a
+        non-multipart.
+        """
+        maintype, subtype = self.get_content_type().split('/')
+        if maintype != 'multipart' or subtype == 'alternative':
+            return
+        parts = self.get_payload()
+        if maintype == 'multipart' and subtype == 'related':
+            # For related, we treat everything but the root as an attachment.
+            # The root may be indicated by 'start'; if there's no start or we
+            # can't find the named start, treat the first subpart as the root.
+            start = self.get_param('start')
+            if start:
+                found = False
+                attachments = []
+                for part in parts:
+                    if part.get('content-id') == start:
+                        found = True
+                    else:
+                        attachments.append(part)
+                if found:
+                    yield from attachments
+                    return
+            parts.pop(0)
+            yield from parts
+            return
+        # Otherwise we more or less invert the remaining logic in get_body.
+        # This only really works in edge cases (ex: non-text relateds or
+        # alternatives) if the sending agent sets content-disposition.
+        seen = []   # Only skip the first example of each candidate type.
+        for part in parts:
+            maintype, subtype = part.get_content_type().split('/')
+            if ((maintype, subtype) in self._body_types and
+                    not part.is_attachment and subtype not in seen):
+                seen.append(subtype)
+                continue
+            yield part
+
+    def iter_parts(self):
+        """Return an iterator over all immediate subparts of a multipart.
+
+        Return an empty iterator for a non-multipart.
+        """
+        if self.get_content_maintype() == 'multipart':
+            yield from self.get_payload()
+
+    def get_content(self, *args, content_manager=None, **kw):
+        if content_manager is None:
+            content_manager = self.policy.content_manager
+        return content_manager.get_content(self, *args, **kw)
+
+    def set_content(self, *args, content_manager=None, **kw):
+        if content_manager is None:
+            content_manager = self.policy.content_manager
+        content_manager.set_content(self, *args, **kw)
+
+    def _make_multipart(self, subtype, disallowed_subtypes, boundary):
+        if self.get_content_maintype() == 'multipart':
+            existing_subtype = self.get_content_subtype()
+            disallowed_subtypes = disallowed_subtypes + (subtype,)
+            if existing_subtype in disallowed_subtypes:
+                raise ValueError("Cannot convert {} to {}".format(
+                    existing_subtype, subtype))
+        keep_headers = []
+        part_headers = []
+        for name, value in self._headers:
+            if name.lower().startswith('content-'):
+                part_headers.append((name, value))
+            else:
+                keep_headers.append((name, value))
+        if part_headers:
+            # There is existing content, move it to the first subpart.
+            part = type(self)(policy=self.policy)
+            part._headers = part_headers
+            part._payload = self._payload
+            self._payload = [part]
+        else:
+            self._payload = []
+        self._headers = keep_headers
+        self['Content-Type'] = 'multipart/' + subtype
+        if boundary is not None:
+            self.set_param('boundary', boundary)
+
+    def make_related(self, boundary=None):
+        self._make_multipart('related', ('alternative', 'mixed'), boundary)
+
+    def make_alternative(self, boundary=None):
+        self._make_multipart('alternative', ('mixed',), boundary)
+
+    def make_mixed(self, boundary=None):
+        self._make_multipart('mixed', (), boundary)
+
+    def _add_multipart(self, _subtype, *args, _disp=None, **kw):
+        if (self.get_content_maintype() != 'multipart' or
+                self.get_content_subtype() != _subtype):
+            getattr(self, 'make_' + _subtype)()
+        part = type(self)(policy=self.policy)
+        part.set_content(*args, **kw)
+        if _disp and 'content-disposition' not in part:
+            part['Content-Disposition'] = _disp
+        self.attach(part)
+
+    def add_related(self, *args, **kw):
+        self._add_multipart('related', *args, _disp='inline', **kw)
+
+    def add_alternative(self, *args, **kw):
+        self._add_multipart('alternative', *args, **kw)
+
+    def add_attachment(self, *args, **kw):
+        self._add_multipart('mixed', *args, _disp='attachment', **kw)
+
+    def clear(self):
+        self._headers = []
+        self._payload = None
+
+    def clear_content(self):
+        self._headers = [(n, v) for n, v in self._headers
+                         if not n.lower().startswith('content-')]
+        self._payload = None
+
+
+class EmailMessage(MIMEPart):
+
+    def set_content(self, *args, **kw):
+        super().set_content(*args, **kw)
+        if 'MIME-Version' not in self:
+            self['MIME-Version'] = '1.0'
--- a/Lib/email/policy.py
+++ b/Lib/email/policy.py
@@ -5,6 +5,7 @@ code that adds all the email6 features.
 from email._policybase import Policy, Compat32, compat32, _extend_docstrings
 from email.utils import _has_surrogates
 from email.headerregistry import HeaderRegistry as HeaderRegistry
+from email.contentmanager import raw_data_manager

 __all__ = [
    'Compat32',
@@ -58,10 +59,22 @@ class EmailPolicy(Policy):
                           special treatment, while all other fields are
                           treated as unstructured.  This list will be
                           completed before the extension is marked stable.)
+
+    content_manager     -- an object with at least two methods: get_content
+                           and set_content.  When the get_content or
+                           set_content method of a Message object is called,
+                           it calls the corresponding method of this object,
+                           passing it the message object as its first argument,
+                           and any arguments or keywords that were passed to
+                           it as additional arguments.  The default
+                           content_manager is
+                           :data:`~email.contentmanager.raw_data_manager`.
+
    """

    refold_source = 'long'
    header_factory = HeaderRegistry()
+    content_manager = raw_data_manager

    def __init__(self, **kw):
        # Ensure that each new instance gets a unique header factory

--- a/Lib/email/utils.py
+++ b/Lib/email/utils.py
@@ -68,9 +68,13 @@ def _has_surrogates(s):
 # How to deal with a string containing bytes before handing it to the
 # application through the 'normal' interface.
 def _sanitize(string):
-    # Turn any escaped bytes into unicode 'unknown' char.
-    original_bytes = string.encode('ascii', 'surrogateescape')
-    return original_bytes.decode('ascii', 'replace')
+    # Turn any escaped bytes into unicode 'unknown' char.  If the escaped
+    # bytes happen to be utf-8 they will instead get decoded, even if they
+    # were invalid in the charset the source was supposed to be in.  This
+    # seems like it is not a bad thing; a defect was still registered.
+    original_bytes = string.encode('utf-8', 'surrogateescape')
+    return original_bytes.decode('utf-8', 'replace')
+


 # Helpers

--- a/Lib/test/test_email/__init__.py
+++ b/Lib/test/test_email/__init__.py
@@ -2,6 +2,7 @@ import os
 import sys
 import unittest
 import test.support
+import collections
 import email
 from email.message import Message
 from email._policybase import compat32
@@ -42,6 +43,8 @@ class TestEmailBase(unittest.TestCase):
    # here we make minimal changes in the test_email tests compared to their
    # pre-3.3 state.
    policy = compat32
+    # Likewise, the default message object is Message.
+    message = Message

    def __init__(self, *args, **kw):
        super().__init__(*args, **kw)
@@ -54,11 +57,23 @@ class TestEmailBase(unittest.TestCase):
        with openfile(filename) as fp:
            return email.message_from_file(fp, policy=self.policy)

-    def _str_msg(self, string, message=Message, policy=None):
+    def _str_msg(self, string, message=None, policy=None):
        if policy is None:
            policy = self.policy
+        if message is None:
+            message = self.message
        return email.message_from_string(string, message, policy=policy)

+    def _bytes_msg(self, bytestring, message=None, policy=None):
+        if policy is None:
+            policy = self.policy
+        if message is None:
+            message = self.message
+        return email.message_from_bytes(bytestring, message, policy=policy)
+
+    def _make_message(self):
+        return self.message(policy=self.policy)
+
    def _bytes_repr(self, b):
        return [repr(x) for x in b.splitlines(keepends=True)]

@@ -123,6 +138,7 @@ def parameterize(cls):

    """
    paramdicts = {}
+    testers = collections.defaultdict(list)
    for name, attr in cls.__dict__.items():
        if name.endswith('_params'):
            if not hasattr(attr, 'keys'):
@@ -134,7 +150,15 @@ def parameterize(cls):
                    d[n] = x
                attr = d
            paramdicts[name[:-7] + '_as_'] = attr
+        if '_as_' in name:
+            testers[name.split('_as_')[0] + '_as_'].append(name)
    testfuncs = {}
+    for name in paramdicts:
+        if name not in testers:
+            raise ValueError("No tester found for {}".format(name))
+    for name in testers:
+        if name not in paramdicts:
+            raise ValueError("No params found for {}".format(name))
    for name, attr in cls.__dict__.items():
        for paramsname, paramsdict in paramdicts.items():
            if name.startswith(paramsname):

--- a/Lib/test/test_email/test_contentmanager.py
+++ b/Lib/test/test_email/test_contentmanager.py
--- a/Lib/test/test_email/test_headerregistry.py
+++ b/Lib/test/test_email/test_headerregistry.py
@@ -661,7 +661,7 @@ class TestContentTypeHeader(TestHeaderBase):
            'text/plain; name="ascii_is_the_default"'),

        'rfc2231_bad_character_in_charset_parameter_value': (
-            "text/plain; charset*=ascii''utf-8%E2%80%9D",
+            "text/plain; charset*=ascii''utf-8%F1%F2%F3",
            'text/plain',
            'text',
            'plain',
@@ -669,6 +669,18 @@ class TestContentTypeHeader(TestHeaderBase):
            [errors.UndecodableBytesDefect],
            'text/plain; charset="utf-8\uFFFD\uFFFD\uFFFD"'),

+        'rfc2231_utf_8_in_supposedly_ascii_charset_parameter_value': (
+            "text/plain; charset*=ascii''utf-8%E2%80%9D",
+            'text/plain',
+            'text',
+            'plain',
+            {'charset': 'utf-8”'},
+            [errors.UndecodableBytesDefect],
+            'text/plain; charset="utf-8”"',
+            ),
+            # XXX: if the above were *re*folded, it would get tagged as utf-8
+            # instead of ascii in the param, since it now contains non-ASCII.
+
        'rfc2231_encoded_then_unencoded_segments': (
            ('application/x-foo;'
                '\tname*0*="us-ascii\'en-us\'My";'

--- a/Lib/test/test_email/test_message.py
+++ b/Lib/test/test_email/test_message.py
--- a/Lib/test/test_email/test_policy.py
+++ b/Lib/test/test_email/test_policy.py
@@ -30,6 +30,7 @@ class PolicyAPITests(unittest.TestCase):
        'raise_on_defect':          False,
        'header_factory':           email.policy.EmailPolicy.header_factory,
        'refold_source':            'long',
+        'content_manager':          email.policy.EmailPolicy.content_manager,
        })

    # For each policy under test, we give here what we expect the defaults to

--- a/Misc/NEWS
+++ b/Misc/NEWS
@@ -42,6 +42,9 @@ Core and Builtins
 Library
 -------

+- Issue #18891: Completed the new email package (provisional) API additions
+  by adding new classes EmailMessage, MIMEPart, and ContentManager.
+
 - Issue #18468: The re.split, re.findall, and re.sub functions and the group()
  and groups() methods of match object now always return a string or a bytes
  object.