Commit 96fd54ea authored by R. David Murray's avatar R. David Murray

#4661: add bytes parsing and generation to email (email version bump to 5.1.0)

The work on this is not 100% complete, but everything is present to
allow real-world testing of the code.  The only remaining major todo
item is to (hopefully!) enhance the handling of non-ASCII bytes in headers
converted to unicode by RFC2047 encoding them rather than replacing them with
'?'s.
parent 59fdd673
......@@ -22,6 +22,12 @@ the Generator on a :class:`~email.message.Message` constructed by program may
result in changes to the :class:`~email.message.Message` object as defaults are
filled in.
:class:`bytes` output can be generated using the :class:`BytesGenerator` class.
If the message object structure contains non-ASCII bytes, this generator's
:meth:`~BytesGenerator.flatten` method will emit the original bytes. Parsing a
binary message and then flattening it with :class:`BytesGenerator` should be
idempotent for standards compliant messages.
Here are the public methods of the :class:`Generator` class, imported from the
:mod:`email.generator` module:
......@@ -65,6 +71,13 @@ Here are the public methods of the :class:`Generator` class, imported from the
Note that for subparts, no envelope header is ever printed.
Messages parsed with a Bytes parser that have a
:mailheader:`Content-Transfer-Encoding` of 8bit will be converted to a
use a 7bit Content-Transfer-Encoding. Any other non-ASCII bytes in the
message structure will be converted to '?' characters.
.. versionchanged:: 3.2 added support for re-encoding 8bit message bodies.
.. method:: clone(fp)
Return an independent clone of this :class:`Generator` instance with the
......@@ -76,11 +89,27 @@ Here are the public methods of the :class:`Generator` class, imported from the
:class:`Generator`'s constructor. This provides just enough file-like API
for :class:`Generator` instances to be used in the :func:`print` function.
As a convenience, see the methods :meth:`Message.as_string` and
``str(aMessage)``, a.k.a. :meth:`Message.__str__`, which simplify the generation
of a formatted string representation of a message object. For more detail, see
As a convenience, see the :class:`~email.message.Message` methods
:meth:`~email.message.Message.as_string` and ``str(aMessage)``, a.k.a.
:meth:`~email.message.Message.__str__`, which simplify the generation of a
formatted string representation of a message object. For more detail, see
:mod:`email.message`.
.. class:: BytesGenerator(outfp, mangle_from_=True, maxheaderlen=78, fmt=None)
This class has the same API as the :class:`Generator` class, except that
*outfp* must be a file like object that will accept :class`bytes` input to
its `write` method. If the message object structure contains non-ASCII
bytes, this generator's :meth:`~BytesGenerator.flatten` method will produce
them as-is, including preserving parts with a
:mailheader:`Content-Transfer-Encoding` of ``8bit``.
Note that even the :meth:`write` method API is identical: it expects
strings as input, and converts them to bytes by encoding them using
the ASCII codec.
.. versionadded:: 3.2
The :mod:`email.generator` module also provides a derived class, called
:class:`DecodedGenerator` which is like the :class:`Generator` base class,
except that non-\ :mimetype:`text` parts are substituted with a format string
......
......@@ -111,9 +111,17 @@ Here are the methods of the :class:`Message` class:
be decoded if this header's value is ``quoted-printable`` or ``base64``.
If some other encoding is used, or :mailheader:`Content-Transfer-Encoding`
header is missing, or if the payload has bogus base64 data, the payload is
returned as-is (undecoded). If the message is a multipart and the
*decode* flag is ``True``, then ``None`` is returned. The default for
*decode* is ``False``.
returned as-is (undecoded). In all cases the returned value is binary
data. If the message is a multipart and the *decode* flag is ``True``,
then ``None`` is returned.
When *decode* is ``False`` (the default) the body is returned as a string
without decoding the :mailheader:`Content-Transfer-Encoding`. However,
for a :mailheader:`Content-Transfer-Encoding` of 8bit, an attempt is made
to decode the original bytes using the `charset` specified by the
:mailheader:`Content-Type` header, using the `replace` error handler. If
no `charset` is specified, or if the `charset` given is not recognized by
the email package, the body is decoded using the default ASCII charset.
.. method:: set_payload(payload, charset=None)
......@@ -160,6 +168,10 @@ Here are the methods of the :class:`Message` class:
Note that in all cases, any envelope header present in the message is not
included in the mapping interface.
In a model generated from bytes, any header values that (in contravention
of the RFCs) contain non-ASCII bytes will have those bytes transformed
into '?' characters when the values are retrieved through this interface.
.. method:: __len__()
......
......@@ -80,6 +80,14 @@ Here is the API for the :class:`FeedParser`:
if you feed more data to a closed :class:`FeedParser`.
.. class:: BytesFeedParser(_factory=email.message.Message)
Works exactly like :class:`FeedParser` except that the input to the
:meth:`~FeedParser.feed` method must be bytes and not string.
.. versionadded:: 3.2
Parser class API
^^^^^^^^^^^^^^^^
......@@ -131,7 +139,7 @@ class.
Similar to the :meth:`parse` method, except it takes a string object
instead of a file-like object. Calling this method on a string is exactly
equivalent to wrapping *text* in a :class:`StringIO` instance first and
equivalent to wrapping *text* in a :class:`~io.StringIO` instance first and
calling :meth:`parse`.
Optional *headersonly* is a flag specifying whether to stop parsing after
......@@ -139,25 +147,78 @@ class.
the entire contents of the file.
.. class:: BytesParser(_class=email.message.Message, strict=None)
This class is exactly parallel to :class:`Parser`, but handles bytes input.
The *_class* and *strict* arguments are interpreted in the same way as for
the :class:`Parser` constructor. *strict* is supported only to make porting
code easier; it is deprecated.
.. method:: parse(fp, headeronly=False)
Read all the data from the binary file-like object *fp*, parse the
resulting bytes, and return the message object. *fp* must support
both the :meth:`readline` and the :meth:`read` methods on file-like
objects.
The bytes contained in *fp* must be formatted as a block of :rfc:`2822`
style headers and header continuation lines, optionally preceded by a
envelope header. The header block is terminated either by the end of the
data or by a blank line. Following the header block is the body of the
message (which may contain MIME-encoded subparts, including subparts
with a :mailheader:`Content-Transfer-Encoding` of ``8bit``.
Optional *headersonly* is a flag specifying whether to stop parsing after
reading the headers or not. The default is ``False``, meaning it parses
the entire contents of the file.
.. method:: parsebytes(bytes, headersonly=False)
Similar to the :meth:`parse` method, except it takes a byte string object
instead of a file-like object. Calling this method on a byte string is
exactly equivalent to wrapping *text* in a :class:`~io.BytesIO` instance
first and calling :meth:`parse`.
Optional *headersonly* is as with the :meth:`parse` method.
.. versionadded:: 3.2
Since creating a message object structure from a string or a file object is such
a common task, two functions are provided as a convenience. They are available
a common task, four functions are provided as a convenience. They are available
in the top-level :mod:`email` package namespace.
.. currentmodule:: email
.. function:: message_from_string(s[, _class][, strict])
.. function:: message_from_string(s, _class=email.message.Message, strict=None)
Return a message object structure from a string. This is exactly equivalent to
``Parser().parsestr(s)``. Optional *_class* and *strict* are interpreted as
with the :class:`Parser` class constructor.
.. function:: message_from_bytes(s, _class=email.message.Message, strict=None)
Return a message object structure from a byte string. This is exactly
equivalent to ``BytesParser().parsebytes(s)``. Optional *_class* and
*strict* are interpreted as with the :class:`Parser` class constructor.
.. versionadded:: 3.2
.. function:: message_from_file(fp[, _class][, strict])
.. function:: message_from_file(fp, _class=email.message.Message, strict=None)
Return a message object structure tree from an open :term:`file object`.
This is exactly equivalent to ``Parser().parse(fp)``. Optional *_class*
and *strict* are interpreted as with the :class:`Parser` class constructor.
.. function:: message_from_binary_file(fp, _class=email.message.Message, strict=None)
Return a message object structure tree from an open binary :term:`file
object`. This is exactly equivalent to ``BytesParser().parse(fp)``.
Optional *_class* and *strict* are interpreted as with the :class:`Parser`
class constructor.
.. versionadded:: 3.2
Here's an example of how you might use this at an interactive Python prompt::
>>> import email
......
......@@ -6,7 +6,7 @@
email messages, including MIME documents.
.. moduleauthor:: Barry A. Warsaw <barry@python.org>
.. sectionauthor:: Barry A. Warsaw <barry@python.org>
.. Copyright (C) 2001-2007 Python Software Foundation
.. Copyright (C) 2001-2010 Python Software Foundation
The :mod:`email` package is a library for managing email messages, including
......@@ -92,6 +92,44 @@ table also describes the Python compatibility of each version of the package.
+---------------+------------------------------+-----------------------+
| :const:`4.0` | Python 2.5 | Python 2.3 to 2.5 |
+---------------+------------------------------+-----------------------+
| :const:`5.0` | Python 3.0 and Python 3.1 | Python 3.0 to 3.2 |
+---------------+------------------------------+-----------------------+
| :const:`5.1` | Python 3.2 | Python 3.0 to 3.2 |
+---------------+------------------------------+-----------------------+
Here are the major differences between :mod:`email` version 5.1 and
version 5.0:
* It is once again possible to parse messages containing non-ASCII bytes,
and to reproduce such messages if the data containing the non-ASCII
bytes is not modified.
* New functions :func:`message_from_bytes` and :func:`message_from_binary_file`,
and new classes :class:`~email.parser.BytesFeedParser` and
:class:`~email.parser.BytesParser` allow binary message data to be parsed
into model objects.
* Given bytes input to the model, :meth:`~email.message.Message.get_payload`
will by default decode a message body that has a
:mailheader:`Content-Transfer-Encoding` of `8bit` using the charset specified
in the MIME headers and return the resulting string.
* Given bytes input to the model, :class:`~email.generator.Generator` will
convert message bodies that have a :mailheader:`Content-Transfer-Encoding` of
8bit to instead have a 7bit Content-Transfer-Encoding.
* New function :class:`~email.generator.BytesGenerator` produces bytes
as output, preserving any unchanged non-ASCII data that was
present in the input used to build the model, including message bodies
with a :mailheader:`Content-Transfer-Encoding` of 8bit.
Here are the major differences between :mod:`email` version 5.0 and version 4:
* All operations are on unicode strings. Text inputs must be strings,
text outputs are strings. Outputs are limited to the ASCII character
set and so can be encoded to ASCII for transmission. Inputs are also
limited to ASCII; this is an acknowledged limitation of email 5.0 and
means it can only be used to parse email that is 7bit clean.
Here are the major differences between :mod:`email` version 4 and version 3:
......
......@@ -4,7 +4,7 @@
"""A package for parsing, handling, and generating email messages."""
__version__ = '5.0.0'
__version__ = '5.1.0'
__all__ = [
'base64mime',
......@@ -16,7 +16,9 @@ __all__ = [
'iterators',
'message',
'message_from_file',
'message_from_binary_file',
'message_from_string',
'message_from_bytes',
'mime',
'parser',
'quoprimime',
......@@ -36,6 +38,13 @@ def message_from_string(s, *args, **kws):
from email.parser import Parser
return Parser(*args, **kws).parsestr(s)
def message_from_bytes(s, *args, **kws):
"""Parse a bytes string into a Message object model.
Optional _class and strict are passed to the Parser constructor.
"""
from email.parser import BytesParser
return BytesParser(*args, **kws).parsebytes(s)
def message_from_file(fp, *args, **kws):
"""Read a file and parse its contents into a Message object model.
......@@ -44,3 +53,11 @@ def message_from_file(fp, *args, **kws):
"""
from email.parser import Parser
return Parser(*args, **kws).parse(fp)
def message_from_binary_file(fp, *args, **kws):
"""Read a binary file and parse its contents into a Message object model.
Optional _class and strict are passed to the Parser constructor.
"""
from email.parser import Parser
return BytesParser(*args, **kws).parse(fp)
......@@ -482,3 +482,10 @@ class FeedParser:
if lastheader:
# XXX reconsider the joining of folded lines
self._cur[lastheader] = EMPTYSTRING.join(lastvalue).rstrip('\r\n')
class BytesFeedParser(FeedParser):
"""Like FeedParser, but feed accepts bytes."""
def feed(self, data):
super().feed(data.decode('ascii', 'surrogateescape'))
This diff is collapsed.
......@@ -24,8 +24,26 @@ SEMISPACE = '; '
# existence of which force quoting of the parameter value.
tspecials = re.compile(r'[ \(\)<>@,;:\\"/\[\]\?=]')
# How to figure out if we are processing strings that come from a byte
# source with undecodable characters.
_has_surrogates = re.compile(
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search
# Helper functions
def _sanitize_surrogates(value):
# If the value contains surrogates, re-decode and replace the original
# non-ascii bytes with '?'s. Used to sanitize header values before letting
# them escape as strings.
if not isinstance(value, str):
# Header object
return value
if _has_surrogates(value):
original_bytes = value.encode('ascii', 'surrogateescape')
return original_bytes.decode('ascii', 'replace').replace('\ufffd', '?')
else:
return value
def _splitparam(param):
# Split header parameters. BAW: this may be too simple. It isn't
# strictly RFC 2045 (section 5.1) compliant, but it catches most headers
......@@ -184,44 +202,72 @@ class Message:
If the message is a multipart and the decode flag is True, then None
is returned.
"""
if i is None:
payload = self._payload
elif not isinstance(self._payload, list):
# Here is the logic table for this code, based on the email5.0.0 code:
# i decode is_multipart result
# ------ ------ ------------ ------------------------------
# None True True None
# i True True None
# None False True _payload (a list)
# i False True _payload element i (a Message)
# i False False error (not a list)
# i True False error (not a list)
# None False False _payload
# None True False _payload decoded (bytes)
# Note that Barry planned to factor out the 'decode' case, but that
# isn't so easy now that we handle the 8 bit data, which needs to be
# converted in both the decode and non-decode path.
if self.is_multipart():
if decode:
return None
if i is None:
return self._payload
else:
return self._payload[i]
# For backward compatibility, Use isinstance and this error message
# instead of the more logical is_multipart test.
if i is not None and not isinstance(self._payload, list):
raise TypeError('Expected list, got %s' % type(self._payload))
else:
payload = self._payload[i]
payload = self._payload
cte = self.get('content-transfer-encoding', '').lower()
# payload can be bytes here, (I wonder if that is actually a bug?)
if isinstance(payload, str):
if _has_surrogates(payload):
bpayload = payload.encode('ascii', 'surrogateescape')
if not decode:
try:
payload = bpayload.decode(self.get_param('charset', 'ascii'), 'replace')
except LookupError:
payload = bpayload.decode('ascii', 'replace')
elif decode:
try:
bpayload = payload.encode('ascii')
except UnicodeError:
# This won't happen for RFC compliant messages (messages
# containing only ASCII codepoints in the unicode input).
# If it does happen, turn the string into bytes in a way
# guaranteed not to fail.
bpayload = payload.encode('raw-unicode-escape')
if not decode:
return payload
# Decoded payloads always return bytes. XXX split this part out into
# a new method called .get_decoded_payload().
if self.is_multipart():
return None
cte = self.get('content-transfer-encoding', '').lower()
if cte == 'quoted-printable':
if isinstance(payload, str):
payload = payload.encode('ascii')
return utils._qdecode(payload)
return utils._qdecode(bpayload)
elif cte == 'base64':
try:
if isinstance(payload, str):
payload = payload.encode('ascii')
return base64.b64decode(payload)
return base64.b64decode(bpayload)
except binascii.Error:
# Incorrect padding
pass
return bpayload
elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
in_file = BytesIO(payload.encode('ascii'))
in_file = BytesIO(bpayload)
out_file = BytesIO()
try:
uu.decode(in_file, out_file, quiet=True)
return out_file.getvalue()
except uu.Error:
# Some decoding problem
pass
# Is there a better way to do this? We can't use the bytes
# constructor.
return bpayload
if isinstance(payload, str):
return payload.encode('raw-unicode-escape')
return bpayload
return payload
def set_payload(self, payload, charset=None):
......@@ -340,7 +386,7 @@ class Message:
Any fields deleted and re-inserted are always appended to the header
list.
"""
return [v for k, v in self._headers]
return [_sanitize_surrogates(v) for k, v in self._headers]
def items(self):
"""Get all the message's header fields and values.
......@@ -350,7 +396,7 @@ class Message:
Any fields deleted and re-inserted are always appended to the header
list.
"""
return self._headers[:]
return [(k, _sanitize_surrogates(v)) for k, v in self._headers]
def get(self, name, failobj=None):
"""Get a header value.
......@@ -361,7 +407,7 @@ class Message:
name = name.lower()
for k, v in self._headers:
if k.lower() == name:
return v
return _sanitize_surrogates(v)
return failobj
#
......@@ -381,7 +427,7 @@ class Message:
name = name.lower()
for k, v in self._headers:
if k.lower() == name:
values.append(v)
values.append(_sanitize_surrogates(v))
if not values:
return failobj
return values
......
......@@ -7,7 +7,7 @@
__all__ = ['Parser', 'HeaderParser']
import warnings
from io import StringIO
from io import StringIO, TextIOWrapper
from email.feedparser import FeedParser
from email.message import Message
......@@ -89,3 +89,47 @@ class HeaderParser(Parser):
def parsestr(self, text, headersonly=True):
return Parser.parsestr(self, text, True)
class BytesParser:
def __init__(self, *args, **kw):
"""Parser of binary RFC 2822 and MIME email messages.
Creates an in-memory object tree representing the email message, which
can then be manipulated and turned over to a Generator to return the
textual representation of the message.
The input must be formatted as a block of RFC 2822 headers and header
continuation lines, optionally preceeded by a `Unix-from' header. The
header block is terminated either by the end of the input or by a
blank line.
_class is the class to instantiate for new message objects when they
must be created. This class must have a constructor that can take
zero arguments. Default is Message.Message.
"""
self.parser = Parser(*args, **kw)
def parse(self, fp, headersonly=False):
"""Create a message structure from the data in a binary file.
Reads all the data from the file and returns the root of the message
structure. Optional headersonly is a flag specifying whether to stop
parsing after reading the headers or not. The default is False,
meaning it parses the entire contents of the file.
"""
fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')
return self.parser.parse(fp, headersonly)
def parsebytes(self, text, headersonly=False):
"""Create a message structure from a byte string.
Returns the root of the message structure. Optional headersonly is a
flag specifying whether to stop parsing after reading the headers or
not. The default is False, meaning it parses the entire contents of
the file.
"""
text = text.decode('ASCII', errors='surrogateescape')
return self.parser.parsestr(text, headersonly)
This diff is collapsed.
......@@ -92,6 +92,9 @@ Core and Builtins
Library
-------
- Issue #4661: email can now parse bytes input and generate either converted
7bit output or bytes output. Email version bumped to 5.1.0.
- Issue #1589: Add ssl.match_hostname(), to help implement server identity
verification for higher-level protocols.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment