Commit b6d64daf authored by Jérome Perrin's avatar Jérome Perrin Committed by Klaus Wölfel

dms: use ghostscript to convert PDF to text

For historical reasons, PDF to text involved conversion first of the PDF to
png, then this png to tiff and the tiff was sent to tesseract. This works, but
it consumes a lot of resources with large PDFs, especially because the
intermediate png/tiff are created with a resolution of 300 DPI, which easily
needs serveral Go of RAM and temporary disk space.
This was obsorved with the PDF created by erp5_document_scanner, which are
usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
took more than one minute to OCR.

Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
prepare a tiff beforehand, we can directly send the PDF data to ghostscript.

These change use ghostscript if available and otherwise fallback to the same
pipeline as before. This will allow the transition until all ERP5 instances
are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
in $PATH, so we don't have to check ghostscript version, we assume that if gs
is in $PATH, it means we have a recent enough SlapOS.

This new approach was less tolerant regarding broken/password-protected PDFs
so we perform a new check that the PDF is valid and not encrypted before
trying to use OCR.
parent 38574278
......@@ -164,22 +164,67 @@ class PDFDocument(Image):
raise NotImplementedError
security.declarePrivate('_convertToText')
def _convertToText(self):
"""
Convert the PDF text content to text with pdftotext
def _convertToText(self, format='txt'): # pylint: disable=redefined-builtin
"""Convert the PDF to text
If the PDF have text, return the text, otherwise try to do OCR using
tesseract.
"""
if not self.hasData():
return ''
data = str(self.getData())
try:
from PyPDF2 import PdfFileReader
from PyPDF2.utils import PdfReadError
except ImportError:
pass
else:
try:
if PdfFileReader(StringIO(data)).isEncrypted:
return ''
except PdfReadError:
return ''
mime_type = 'text/plain'
portal_transforms = self.getPortalObject().portal_transforms
filename = self.getFilename()
result = portal_transforms.convertToData(mime_type, str(self.getData()),
result = portal_transforms.convertToData(mime_type, data,
context=self, filename=filename,
mimetype=self.getContentType())
if result:
return result
else:
# Try to use OCR
# Try to use OCR from ghostscript, but tolerate that the command might
# not be available.
process = None
command = [
'gs', '-q', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE',
'-dNOPROMPT', '-sDEVICE=ocr', '-r300x300', '-o', '-', '-f', '-'
]
try:
process = Popen(
command,
stdin=PIPE,
stdout=PIPE,
stderr=PIPE,
close_fds=True,
)
output, error = process.communicate(data)
if process.returncode:
raise ConversionError(
"Error invoking ghostscript.\noutput:%s\nerror:%s" % (output, error))
return output.strip()
except OSError as e:
if e.errno != errno.ENOENT:
raise
finally:
del process
# We don't have ghostscript, fallback to the expensive pipeline using:
# pdf -- (Image._convert imagemagick) --> png
# -- (PortalTransforms.png_to_tiff imagemagick) --> tiff
# -- (PortalTransforms.tiff_to_text tesseract) --> text
#
# As high dpi images are required, it may take some times to convert the
# pdf.
# It may be required to use activities to fill the cache and at the end,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment