Harvest text from PDF images #73

richardmatthewsdev · 2024-07-16T23:16:35Z

As an Agency, I want the text in my PDF image to be able to be harvested by Supplejack, so that information about my legislation can be accurately used on the NZLegislation website.

Acceptance Criteria

Text in PDF images can be harvested
Viewers of the data are able to see if the data has been converted from a PDF or Image

Notes

Harvester must be able to determine dynamically whether it is dealing with an image or text file

…xtraction worker is running

github-actions · 2024-07-17T01:13:39Z

Code quality score

Lovely, the code quality is unchanged for this PR 😊

	Ruby file count	Similarity score (flay)	ABC complexity (flog)	Code smells (reek)	TOTALS
base	90	6.12	5.36	17.15	28.63
this branch	90	6.12	5.36	17.15	28.63
difference	0	0.0	0.0	0.0	0.0

app/sidekiq/text_extraction_worker.rb

eoin-boost

great feature thanks for this @richardmatthewsdev

richardmatthewsdev added 5 commits July 16, 2024 16:09

feat(ocr_pdf): OCR pdfs when the text extraction has failed

a050c45

feat(ocr): Update the text extraction to say how the text was processed

b94ed14

fix(ocr): Improve Pipeline view so it says its running while a text e…

1791145

…xtraction worker is running

fix(ocr): Update time while the ExtractionWorker is happening

133b461

Resolve merge conflicts

4cfe231

richardmatthewsdev added 6 commits July 19, 2024 15:30

Rubocop

0047ec4

Brakeman

c84dfe1

Attempt to install tesseract in the pipeline

e3184a0

attempt 2

f1b8ddd

Rubocop

e9a2e4d

Add missing language libraries for Tesseract

a0b73df

eoin-boost reviewed Jul 22, 2024

View reviewed changes

app/sidekiq/text_extraction_worker.rb Show resolved Hide resolved

eoin-boost approved these changes Jul 22, 2024

View reviewed changes

motizuki approved these changes Jul 22, 2024

View reviewed changes

richardmatthewsdev merged commit 3ba89b5 into main Jul 23, 2024
8 checks passed

richardmatthewsdev deleted the rm/ocr branch July 23, 2024 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest text from PDF images #73

Harvest text from PDF images #73

richardmatthewsdev commented Jul 16, 2024

github-actions bot commented Jul 17, 2024 •

edited

Loading

eoin-boost left a comment

Harvest text from PDF images #73

Harvest text from PDF images #73

Conversation

richardmatthewsdev commented Jul 16, 2024

github-actions bot commented Jul 17, 2024 • edited Loading

Code quality score

eoin-boost left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 17, 2024 •

edited

Loading