dms: use ghostscript to convert PDF to text

For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.

dms: use ghostscript to convert PDF to text
For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.
f775724e · Jérome Perrin · d74981c3 · f775724e · f775724e
Commit f775724e authored May 28, 2021 by Jérome Perrin
2 changed files
--- a/bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py
+++ b/bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py
@@ -165,21 +165,66 @@ class PDFDocument(Image):
  security.declarePrivate('_convertToText')
  def _convertToText(self, format='txt'):  # pylint: disable=redefined-builtin
-    """
+    """Convert the PDF to text
-      Convert the PDF text content to text with pdftotext
+    If the PDF have text, return the text, otherwise try to do OCR using
+    tesseract.
    """
    if not self.hasData():
      return ''
+    data = str(self.getData())
+    try:
+      from PyPDF2 import PdfFileReader
+      from PyPDF2.utils import PdfReadError
+    except ImportError:
+      pass
+    else:
+      try:
+        if PdfFileReader(StringIO(data)).isEncrypted:
+          return ''
+      except PdfReadError:
+        return ''
    mime_type = 'text/plain'
    portal_transforms = self.getPortalObject().portal_transforms
    filename = self.getFilename()
-    result = portal_transforms.convertToData(mime_type, str(self.getData()),
+    result = portal_transforms.convertToData(mime_type, data,
                                             context=self, filename=filename,
                                             mimetype=self.getContentType())
    if result:
      return result
    else:
-      # Try to use OCR
+      # Try to use OCR from ghostscript, but tolerate that the command might
+      # not be available.
+      process = None
+      command = [
+          'gs', '-q', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE',
+          '-dNOPROMPT', '-sDEVICE=ocr', '-r300x300', '-o', '-', '-f', '-'
+      ]
+      try:
+        process = Popen(
+            command,
+            stdin=PIPE,
+            stdout=PIPE,
+            stderr=PIPE,
+            close_fds=True,
+        )
+        output, error = process.communicate(data)
+        if process.returncode:
+          raise ConversionError(
+              "Error invoking ghostscript.\noutput:%s\nerror:%s" % (output, error))
+        return output.strip()
+      except OSError as e:
+        if e.errno != errno.ENOENT:
+          raise
+      finally:
+        del process
+      # We don't have ghostscript, fallback to the expensive pipeline using:
+      #   pdf -- (Image._convert imagemagick) --> png
+      #       -- (PortalTransforms.png_to_tiff imagemagick) --> tiff
+      #       -- (PortalTransforms.tiff_to_text tesseract) --> text
+      #
      # As high dpi images are required, it may take some times to convert the
      # pdf.
      # It may be required to use activities to fill the cache and at the end,

--- a/bt5/erp5_dms/TestTemplateItem/portal_components/test.erp5.testDms.py
+++ b/bt5/erp5_dms/TestTemplateItem/portal_components/test.erp5.testDms.py
@@ -71,6 +71,7 @@ from AccessControl import Unauthorized
 from Products.ERP5Type import Permissions
 from DateTime import DateTime
 from ZTUtils import make_query
+import PyPDF2
 QUIET = 0
@@ -1946,13 +1947,34 @@ document.write('<sc'+'ript type="text/javascript" src="http://somosite.bg/utb.ph
  def test_PDFDocument_asTextConversion(self):
    """Test a PDF document with embedded images
-    To force usage of Ocropus portal_transform chain
+    To force usage of ghostscript with embedded tesseract OCR device
    """
-    portal_type = 'PDF'
+    document = self.portal.document_module.newContent(
-    module = self.portal.getDefaultModule(portal_type)
+        portal_type='PDF',
-    upload_file = makeFileUpload('TEST.Embedded.Image.pdf')
+        file=makeFileUpload('TEST.Embedded.Image.pdf'))
-    document = module.newContent(portal_type=portal_type, file=upload_file)
+    self.assertEqual(document.asText(), 'ERP5 is a free software.')
-    self.assertEqual('ERP5 is a free software.', document.asText())
+  def test_broken_pdf_asText(self):
+    class StringIOWithFilename(StringIO.StringIO):
+      filename = 'broken.pdf'
+    document = self.portal.document_module.newContent(
+        portal_type='PDF',
+        file=StringIOWithFilename('broken'))
+    self.assertEqual(document.asText(), '')
+    self.tic() # no activity failure
+  def test_password_protected_pdf_asText(self):
+    pdf_reader = PyPDF2.PdfFileReader(makeFileUpload('TEST.Embedded.Image.pdf'))
+    pdf_writer = PyPDF2.PdfFileWriter()
+    pdf_writer.addPage(pdf_reader.getPage(0))
+    pdf_writer.encrypt('secret')
+    encrypted_pdf_stream = StringIO.StringIO()
+    pdf_writer.write(encrypted_pdf_stream)
+    document = self.portal.document_module.newContent(
+        portal_type='PDF',
+        file=encrypted_pdf_stream)
+    self.assertEqual(document.asText(), '')
+    self.tic() # no activity failure
  def createRestrictedSecurityHelperScript(self):
    script_content_list = ['format=None, **kw', """