Pipelines · dms: use ghostscript to convert PDF to text (f775724e) · Commits · nexedi / erp5

Commit f775724e authored May 28, 2021 by

Jérome Perrin

dms: use ghostscript to convert PDF to text

For historical reasons, PDF to text involved conversion first of the PDF to
png, then this png to tiff and the tiff was sent to tesseract. This works, but
it consumes a lot of resources with large PDFs, especially because the
intermediate png/tiff are created with a resolution of 300 DPI, which easily
needs serveral Go of RAM and temporary disk space.
This was obsorved with the PDF created by erp5_document_scanner, which are
usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
took more than one minute to OCR.

Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
prepare a tiff beforehand, we can directly send the PDF data to ghostscript.

These change use ghostscript if available and otherwise fallback to the same
pipeline as before. This will allow the transition until all ERP5 instances
are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
in $PATH, so we don't have to check ghostscript version, we assume that if gs
is in $PATH, it means we have a recent enough SlapOS.

This new approach was less tolerant regarding broken/password-protected PDFs
so we perform a new check that the PDF is valid and not encrypted before
trying to use OCR.

parent d74981c3

Pipeline #15769 failed with stage in 0 seconds