-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR image-only PDFs #1583
Comments
Is this still an issue? |
I think so. Tagging as a feature request until someone can confirm we have PDF OCR. |
It should be working. Our installation of it has it working |
@DonRichards, and it isn't a Born-Digital specific feature? Can you test on a vanilla Islandora install to confirm? |
I think this is still a problem after testing on my local. |
This is still an issue. When Hypercube gets a PDF, it uses pdftotext instead. This was done as part of the RDM work. We did not realize that you can get text out of most text-containing files (and if you want into the solr index) with https://www.drupal.org/project/file_extractor. I have a PR coming after I run the tests. |
I do not have a PR coming. It turns out Hypercube on its own does not accept PDFs, you need a wrapper like ocrmypdf. |
Linked to this as well #1012 |
Popped up in the |
Stemming from on an email discussion:
Hypercube currently uses pdftotext to extract text embedded in a PDF OR tesseract to perform OCR on images. However, if a user uploads a scanned document as a PDF, it won't perform OCR on the scanned document resulting in no output.
Tesseract can't process PDFs natively (ergo the pdftotext) but we can use pdfimages to extract the images into a temporary directory and loop tesseract over those to produce our extracted text OCR.
The text was updated successfully, but these errors were encountered: