Add HOCRConverter (fixes #650) #651

richardpaulhudson · 2021-07-29T20:14:50Z

Pull request

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. This converter extracts the explicit text information from those PDFs that do have it and uses it to genxerate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

How Has This Been Tested?

layout = LAParams(all_texts=True)
extract_text_to_fp(in_file, out_file, output_type='hocr', laparams=layout)

tox also runs with Python 3.8 and 3.9.

Checklist

[ x] I have added tests that prove my fix is effective or that my feature
works
[x ] I have added docstrings to newly created methods and classes
[x ] I have optimized the code at least one time after creating the initial
version
[x ] I have updated the README.md or I am verified that this
is not necessary
[x ] I have updated the readthedocs documentation or I
verified that this is not necessary
[x ] I have added a consice human-readable description of the change to
CHANGELOG.md

willaaam · 2021-12-02T12:22:02Z

Would be amazing if this could be merged and included!

pietermarsman · 2022-01-25T20:15:48Z

Looks good to me.

I only wonder if this is something that should be added to pdfminer.six as core functionality. Alternatively, this could be something that everyone implements to their own liking. The composable api is perfectly suitable for adding functionality like this.

I'll post this question on the gitter.

pietermarsman · 2022-01-30T14:30:49Z

After some delibration I'm positive on adding hocr as an output format. It has two advantages: direct comparison of the output to ocr tools and usage of other tools (e.g. visualization) built for hocr.

I'll do a more detailed review now.

pietermarsman

Thanks for the super nice PR!

Can you add tests showing this works. Ideally you would use the simple1.pdf for this.

This PR is already very good, but I like to use each change as an opportunity to improve pdfminer.six a bit. So I added some comments on how to improve this PR.

pdfminer/converter.py

pietermarsman · 2022-02-02T21:47:41Z

@richardpaulhudson I used this PR a bit for testing if the new CI pipeline is functioning properly. Now it is :)

pietermarsman · 2022-02-22T20:18:59Z

@richardpaulhudson any plans on working on this in the future?

richardpaulhudson · 2022-03-11T09:58:19Z

Hi @pietermarsman, thank you for the review and sorry for not responding sooner — I've changed employers in the meantime and there seem to be issues with where my GitHub notification mails are ending up. I hope to be able to pick up working on this in the next couple of months.

pietermarsman · 2022-03-19T16:57:15Z

FYI, I've changed this MR to merge into master. The develop branch will be removed, because soon we will work with version tags to indicate the releases and the distinction between develop and master becomes obsolete.

pietermarsman · 2022-06-25T20:54:02Z

bump ;)

…o develop

richardpaulhudson · 2022-07-13T14:33:21Z

Sorry it's taken me so long to get back to this :-)

Can you add tests showing this works. Ideally you would use the simple1.pdf for this.

I can certainly see the need for some sort of regression test, but am unsure how to approach it. What I actually did myself was:

checked the hOCR output passed hocr-check (from the hocr-tools package)
commented in hocrjs and checked the rendering of the content in the browser corresponded to the original PDF file

neither of which lend themselves easily to a regression test.

The options are:

a regression test that just checks the conversion is carried out successfully without an error
a regression test that checks the output of the conversion is equal to the output of my conversion which I have verified with the two steps above. Issues with this are:
- there may be problems with the output that I'm not aware of because they weren't picked up by these two steps, but such a test would declare the output to be correct
- tests comparing large amounts of output at once tend to be brittle
a regression test that checks the output of the conversion for specific features, although I'm unsure what these would be

pietermarsman · 2022-08-08T20:27:11Z

I prefer option 1 (just checking if the code does not raise an error) or 2 (check for specific output). If you go for two, we do indeed need to have some output that we know is reasonably stable.

Having a test with output (option 2) is also a start of some documentation, as other developers can easily see what the expected output is of the tool

pietermarsman · 2022-08-14T09:53:45Z

@richardpaulhudson Thanks for the all your work!

Add HOCRConverter

7598220

Merge branch 'develop' into richardpaulhudson/develop

ff1d1db

Add line to README.md

aecb617

pietermarsman requested changes Jan 30, 2022

View reviewed changes

pietermarsman added 2 commits February 2, 2022 22:29

Merge branch 'develop' into richardpaulhudson/develop

cbfc3aa

Test cicd

f8bcb8e

pietermarsman self-requested a review February 2, 2022 21:39

pietermarsman previously approved these changes Feb 2, 2022

View reviewed changes

pietermarsman added 2 commits February 2, 2022 22:42

Test cicd 2

49fb8cb

Merge branch 'develop' into richardpaulhudson/develop

1145718

pietermarsman self-requested a review February 2, 2022 21:45

pietermarsman removed their request for review February 11, 2022 21:48

Merge branch 'develop' into richardpaulhudson/develop

f904d57

pietermarsman changed the base branch from develop to master March 19, 2022 16:42

pietermarsman dismissed their stale review via f904d57 March 19, 2022 19:46

Merge branch 'master' of https://github.com/pdfminer/pdfminer.six int…

54c8a07

…o develop

richardpaulhudson marked this pull request as draft July 13, 2022 11:36

Changes based on review comments

b5ef962

richardpaulhudson marked this pull request as ready for review July 26, 2022 15:44

Merge remote-tracking branch 'origin/master' into develop

08145cf

pietermarsman added 4 commits August 14, 2022 11:28

Remove whitespace changes to CHANGELOG.md

64da0a5

Remove duplicated html output

f667e67

Add link to hocr wiki

9407251

Add tests for extracting hocr and html

0995a4c

pietermarsman merged commit 77df431 into pdfminer:master Aug 14, 2022

bosd mentioned this pull request Aug 26, 2022

autodetect pdf type invoice-x/invoice2data#343

Open

bosd mentioned this pull request Sep 14, 2022

image to data invoice-x/invoice2data#393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HOCRConverter (fixes #650) #651

Add HOCRConverter (fixes #650) #651

richardpaulhudson commented Jul 29, 2021 •

edited by pietermarsman

Loading

willaaam commented Dec 2, 2021

pietermarsman commented Jan 25, 2022

pietermarsman commented Jan 30, 2022

pietermarsman left a comment

pietermarsman commented Feb 2, 2022

pietermarsman commented Feb 22, 2022

richardpaulhudson commented Mar 11, 2022

pietermarsman commented Mar 19, 2022

pietermarsman commented Jun 25, 2022

richardpaulhudson commented Jul 13, 2022

pietermarsman commented Aug 8, 2022

pietermarsman commented Aug 14, 2022

Add HOCRConverter (fixes #650) #651

Add HOCRConverter (fixes #650) #651

Conversation

richardpaulhudson commented Jul 29, 2021 • edited by pietermarsman Loading

willaaam commented Dec 2, 2021

pietermarsman commented Jan 25, 2022

pietermarsman commented Jan 30, 2022

pietermarsman left a comment

Choose a reason for hiding this comment

pietermarsman commented Feb 2, 2022

pietermarsman commented Feb 22, 2022

richardpaulhudson commented Mar 11, 2022

pietermarsman commented Mar 19, 2022

pietermarsman commented Jun 25, 2022

richardpaulhudson commented Jul 13, 2022

pietermarsman commented Aug 8, 2022

pietermarsman commented Aug 14, 2022

richardpaulhudson commented Jul 29, 2021 •

edited by pietermarsman

Loading