dms: use ghostscript to convert PDF to text

For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.

dms: use ghostscript to convert PDF to text
For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.
b6d64daf · Jérome Perrin · Klaus Wölfel · 38574278 · b6d64daf
Commit b6d64daf authored May 28, 2021 by Jérome Perrin Committed by Klaus Wölfel Mar 17, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 50 additions and 5 deletions

bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py ...mplateItem/portal_components/document.erp5.PDFDocument.py +50 -5

No files found.
--- a/bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py
+++ b/bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py
@@ -164,22 +164,67 @@ class PDFDocument(Image):
    raise NotImplementedError

  security.declarePrivate('_convertToText')
-  def _convertToText(self):
-    """
-      Convert the PDF text content to text with pdftotext
+  def _convertToText(self, format='txt'):  # pylint: disable=redefined-builtin
+    """Convert the PDF to text
+
+    If the PDF have text, return the text, otherwise try to do OCR using
+    tesseract.
    """
    if not self.hasData():
      return ''
+    data = str(self.getData())
+    try:
+      from PyPDF2 import PdfFileReader
+      from PyPDF2.utils import PdfReadError
+    except ImportError:
+      pass
+    else:
+      try:
+        if PdfFileReader(StringIO(data)).isEncrypted:
+          return ''
+      except PdfReadError:
+        return ''
+
    mime_type = 'text/plain'
    portal_transforms = self.getPortalObject().portal_transforms
    filename = self.getFilename()
-    result = portal_transforms.convertToData(mime_type, str(self.getData()),
+    result = portal_transforms.convertToData(mime_type, data,
                                             context=self, filename=filename,
                                             mimetype=self.getContentType())
    if result:
      return result
    else:
-      # Try to use OCR
+      # Try to use OCR from ghostscript, but tolerate that the command might
+      # not be available.
+      process = None
+      command = [
+          'gs', '-q', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE',
+          '-dNOPROMPT', '-sDEVICE=ocr', '-r300x300', '-o', '-', '-f', '-'
+      ]
+      try:
+        process = Popen(
+            command,
+            stdin=PIPE,
+            stdout=PIPE,
+            stderr=PIPE,
+            close_fds=True,
+        )
+        output, error = process.communicate(data)
+        if process.returncode:
+          raise ConversionError(
+              "Error invoking ghostscript.\noutput:%s\nerror:%s" % (output, error))
+        return output.strip()
+      except OSError as e:
+        if e.errno != errno.ENOENT:
+          raise
+      finally:
+        del process
+
+      # We don't have ghostscript, fallback to the expensive pipeline using:
+      #   pdf -- (Image._convert imagemagick) --> png
+      #       -- (PortalTransforms.png_to_tiff imagemagick) --> tiff
+      #       -- (PortalTransforms.tiff_to_text tesseract) --> text
+      #
      # As high dpi images are required, it may take some times to convert the
      # pdf.
      # It may be required to use activities to fill the cache and at the end,