OCR image-only PDFs #1583

seth-shaw-unlv · 2020-08-19T18:33:24Z

Hypercube currently uses pdftotext to extract text embedded in a PDF OR tesseract to perform OCR on images. However, if a user uploads a scanned document as a PDF, it won't perform OCR on the scanned document resulting in no output.

Tesseract can't process PDFs natively (ergo the pdftotext) but we can use pdfimages to extract the images into a temporary directory and loop tesseract over those to produce our extracted text OCR.

DonRichards · 2021-03-22T19:07:37Z

Is this still an issue?

seth-shaw-unlv · 2021-09-08T17:59:45Z

I think so. Tagging as a feature request until someone can confirm we have PDF OCR.

DonRichards · 2021-09-21T13:43:49Z

It should be working. Our installation of it has it working

seth-shaw-unlv · 2021-09-21T14:42:36Z

@DonRichards, and it isn't a Born-Digital specific feature? Can you test on a vanilla Islandora install to confirm?

DonRichards · 2021-09-29T19:42:10Z

I think this is still a problem after testing on my local.

rosiel · 2022-07-18T16:09:41Z

This is still an issue. When Hypercube gets a PDF, it uses pdftotext instead. This was done as part of the RDM work.

We did not realize that you can get text out of most text-containing files (and if you want into the solr index) with https://www.drupal.org/project/file_extractor.

I have a PR coming after I run the tests.

rosiel · 2022-07-20T14:43:52Z

I do not have a PR coming. It turns out Hypercube on its own does not accept PDFs, you need a wrapper like ocrmypdf.

DonRichards · 2022-07-20T17:15:37Z

Linked to this as well #1012

adam-vessey · 2022-11-09T16:30:08Z

Popped up in the islandora/islandora queue: Islandora/islandora#910

seth-shaw-unlv added the feature request label Sep 8, 2021

kstapelfeldt added Type: feature request a proposal for a new feature in the software (should be justified by a ‘use case’) and removed feature request labels Sep 25, 2021

kstapelfeldt added this to Islandora Issues Queue Feb 1, 2022

kstapelfeldt moved this to Todo in Islandora Issues Queue Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR image-only PDFs #1583

OCR image-only PDFs #1583

seth-shaw-unlv commented Aug 19, 2020

DonRichards commented Mar 22, 2021

seth-shaw-unlv commented Sep 8, 2021

DonRichards commented Sep 21, 2021

seth-shaw-unlv commented Sep 21, 2021

DonRichards commented Sep 29, 2021

rosiel commented Jul 18, 2022

rosiel commented Jul 20, 2022

DonRichards commented Jul 20, 2022

adam-vessey commented Nov 9, 2022

OCR image-only PDFs #1583

OCR image-only PDFs #1583

Comments

seth-shaw-unlv commented Aug 19, 2020

DonRichards commented Mar 22, 2021

seth-shaw-unlv commented Sep 8, 2021

DonRichards commented Sep 21, 2021

seth-shaw-unlv commented Sep 21, 2021

DonRichards commented Sep 29, 2021

rosiel commented Jul 18, 2022

rosiel commented Jul 20, 2022

DonRichards commented Jul 20, 2022

adam-vessey commented Nov 9, 2022