Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR image-only PDFs #1583

Open
seth-shaw-unlv opened this issue Aug 19, 2020 · 9 comments
Open

OCR image-only PDFs #1583

seth-shaw-unlv opened this issue Aug 19, 2020 · 9 comments
Labels
Type: feature request a proposal for a new feature in the software (should be justified by a ‘use case’)

Comments

@seth-shaw-unlv
Copy link
Contributor

Stemming from on an email discussion:

Hypercube currently uses pdftotext to extract text embedded in a PDF OR tesseract to perform OCR on images. However, if a user uploads a scanned document as a PDF, it won't perform OCR on the scanned document resulting in no output.

Tesseract can't process PDFs natively (ergo the pdftotext) but we can use pdfimages to extract the images into a temporary directory and loop tesseract over those to produce our extracted text OCR.

@DonRichards
Copy link
Member

Is this still an issue?

@seth-shaw-unlv
Copy link
Contributor Author

I think so. Tagging as a feature request until someone can confirm we have PDF OCR.

@DonRichards
Copy link
Member

It should be working. Our installation of it has it working

@seth-shaw-unlv
Copy link
Contributor Author

@DonRichards, and it isn't a Born-Digital specific feature? Can you test on a vanilla Islandora install to confirm?

@kstapelfeldt kstapelfeldt added Type: feature request a proposal for a new feature in the software (should be justified by a ‘use case’) and removed feature request labels Sep 25, 2021
@DonRichards
Copy link
Member

I think this is still a problem after testing on my local.

@rosiel
Copy link
Member

rosiel commented Jul 18, 2022

This is still an issue. When Hypercube gets a PDF, it uses pdftotext instead. This was done as part of the RDM work.

We did not realize that you can get text out of most text-containing files (and if you want into the solr index) with https://www.drupal.org/project/file_extractor.

I have a PR coming after I run the tests.

@rosiel
Copy link
Member

rosiel commented Jul 20, 2022

I do not have a PR coming. It turns out Hypercube on its own does not accept PDFs, you need a wrapper like ocrmypdf.

@DonRichards
Copy link
Member

Linked to this as well #1012

@adam-vessey
Copy link
Contributor

Popped up in the islandora/islandora queue: Islandora/islandora#910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: feature request a proposal for a new feature in the software (should be justified by a ‘use case’)
Projects
Development

No branches or pull requests

5 participants