Commit 65d9c968 authored by R David Murray's avatar R David Murray

#15220: simplify and speed up feedparser's line splitting.

Original patch submitted by QNX, modified for clarity by me (mostly comments).
QNX reports a 30% speed up in average email parsing time.
parent 657c2835
...@@ -98,24 +98,15 @@ class BufferedSubFile(object): ...@@ -98,24 +98,15 @@ class BufferedSubFile(object):
"""Push some new data into this object.""" """Push some new data into this object."""
# Handle any previous leftovers # Handle any previous leftovers
data, self._partial = self._partial + data, '' data, self._partial = self._partial + data, ''
# Crack into lines, but preserve the newlines on the end of each # Crack into lines, but preserve the linesep characters on the end of each
parts = NLCRE_crack.split(data) parts = data.splitlines(True)
# The *ahem* interesting behaviour of re.split when supplied grouping # If the last element of the list does not end in a newline, then treat
# parentheses is that the last element of the resulting list is the # it as a partial line. We only check for '\n' here because a line
# data after the final RE. In the case of a NL/CR terminated string, # ending with '\r' might be a line that was split in the middle of a
# this is the empty string. # '\r\n' sequence (see bugs 1555570 and 1721862).
self._partial = parts.pop() if parts and not parts[-1].endswith('\n'):
#GAN 29Mar09 bugs 1555570, 1721862 Confusion at 8K boundary ending with \r: self._partial = parts.pop()
# is there a \n to follow later? self.pushlines(parts)
if not self._partial and parts and parts[-1].endswith('\r'):
self._partial = parts.pop(-2)+parts.pop()
# parts is a list of strings, alternating between the line contents
# and the eol character(s). Gather up a list of lines after
# re-attaching the newlines.
lines = []
for i in range(len(parts) // 2):
lines.append(parts[i*2] + parts[i*2+1])
self.pushlines(lines)
def pushlines(self, lines): def pushlines(self, lines):
# Reverse and insert at the front of the lines. # Reverse and insert at the front of the lines.
......
...@@ -253,6 +253,9 @@ Core and Builtins ...@@ -253,6 +253,9 @@ Core and Builtins
Library Library
------- -------
- Issue #15220: email.feedparser's line splitting algorithm is now simpler and
faster.
- Issue #16743: Fix mmap overflow check on 32 bit Windows. - Issue #16743: Fix mmap overflow check on 32 bit Windows.
- Issue #16996: webbrowser module now uses shutil.which() to find a - Issue #16996: webbrowser module now uses shutil.which() to find a
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment