bpo-34043: Optimize tarfile uncompress performance (GH-8089)

tarfile._Stream has two buffer for compressed and uncompressed data. Those buffers are not aligned so unnecessary bytes slicing happens for every reading chunks. This commit bypass compressed buffering. In this benchmark [1], user time become 250ms from 300ms. [1]: https://bugs.python.org/msg320763

bpo-34043: Optimize tarfile uncompress performance (GH-8089)
tarfile._Stream has two buffer for compressed and uncompressed data. Those buffers are not aligned so unnecessary bytes slicing happens for every reading chunks. This commit bypass compressed buffering. In this benchmark [1], user time become 250ms from 300ms. [1]: https://bugs.python.org/msg320763
8d130913 · INADA Naoki · GitHub · f1202880 · 8d130913 · 8d130913
Commit 8d130913 authored Jul 06, 2018 by INADA Naoki Committed by GitHub Jul 06, 2018
Hide whitespace changes
Inline Side-by-side

Showing with 13 additions and 18 deletions

Lib/tarfile.py Lib/tarfile.py +12 -18

Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst ...S.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst +1 -0

No files found.
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -513,21 +513,10 @@ class _Stream:
            raise StreamError("seeking backwards is not allowed")
        return self.pos

-    def read(self, size=None):
-        """Return the next size number of bytes from the stream.
-           If size is not defined, return all bytes of the stream
-           up to EOF.
-        """
-        if size is None:
-            t = []
-            while True:
-                buf = self._read(self.bufsize)
-                if not buf:
-                    break
-                t.append(buf)
-            buf = b"".join(t)
-        else:
-            buf = self._read(size)
+    def read(self, size):
+        """Return the next size number of bytes from the stream."""
+        assert size is not None
+        buf = self._read(size)
        self.pos += len(buf)
        return buf

@@ -540,9 +529,14 @@ class _Stream:
        c = len(self.dbuf)
        t = [self.dbuf]
        while c < size:
-            buf = self.__read(self.bufsize)
-            if not buf:
-                break
+            # Skip underlying buffer to avoid unaligned double buffering.
+            if self.buf:
+                buf = self.buf
+                self.buf = b""
+            else:
+                buf = self.fileobj.read(self.bufsize)
+                if not buf:
+                    break
            try:
                buf = self.cmp.decompress(buf)
            except self.exception:

--- a/Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst
+++ b/Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst
+Optimize tarfile uncompress performance about 15% when gzip is used.