Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
erp5 erp5
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Labels
    • Labels
  • Merge requests 139
    • Merge requests 139
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • nexedi
  • erp5erp5
  • Merge requests
  • !1420

Merged
Created May 20, 2021 by Jérome Perrin@jeromeOwner3 of 3 tasks completed3/3 tasks

Lighter processing for OCR activities

  • Overview 9
  • Commits 2
  • Pipelines 8
  • Changes 3

When running OCR, we sometimes have issues because processing is "too heavy":

  • use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
  • use 300% of CPU. Fixed by setting OMP_THREAD_LIMIT when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
  • ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of slapos!985 (merged)

Edited May 31, 2021 by Jérome Perrin
Assignee
Assign to
Reviewer
Request review from
None
Milestone
None
Assign milestone
Time tracking
Source branch: fix/tesseract-lighter-activities
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7