Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HOCRConverter (fixes #650) #651

Merged
merged 15 commits into from
Aug 14, 2022

Conversation

richardpaulhudson
Copy link

@richardpaulhudson richardpaulhudson commented Jul 29, 2021

Pull request

Fix #650
Fix #265

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. This converter extracts the explicit text information from those PDFs that do have it and uses it to genxerate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

How Has This Been Tested?

layout = LAParams(all_texts=True)
extract_text_to_fp(in_file, out_file, output_type='hocr', laparams=layout)

tox also runs with Python 3.8 and 3.9.

Checklist

  • [ x] I have added tests that prove my fix is effective or that my feature
    works
  • [x ] I have added docstrings to newly created methods and classes
  • [x ] I have optimized the code at least one time after creating the initial
    version
  • [x ] I have updated the README.md or I am verified that this
    is not necessary
  • [x ] I have updated the readthedocs documentation or I
    verified that this is not necessary
  • [x ] I have added a consice human-readable description of the change to
    CHANGELOG.md

@willaaam
Copy link

willaaam commented Dec 2, 2021

Would be amazing if this could be merged and included!

@pietermarsman
Copy link
Member

Looks good to me.

I only wonder if this is something that should be added to pdfminer.six as core functionality. Alternatively, this could be something that everyone implements to their own liking. The composable api is perfectly suitable for adding functionality like this.

I'll post this question on the gitter.

@pietermarsman
Copy link
Member

After some delibration I'm positive on adding hocr as an output format. It has two advantages: direct comparison of the output to ocr tools and usage of other tools (e.g. visualization) built for hocr.

I'll do a more detailed review now.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the super nice PR!

Can you add tests showing this works. Ideally you would use the simple1.pdf for this.

This PR is already very good, but I like to use each change as an opportunity to improve pdfminer.six a bit. So I added some comments on how to improve this PR.

pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Outdated Show resolved Hide resolved
pdfminer/converter.py Show resolved Hide resolved
@pietermarsman pietermarsman self-requested a review February 2, 2022 21:39
pietermarsman
pietermarsman previously approved these changes Feb 2, 2022
@pietermarsman pietermarsman self-requested a review February 2, 2022 21:45
@pietermarsman
Copy link
Member

@richardpaulhudson I used this PR a bit for testing if the new CI pipeline is functioning properly. Now it is :)

@pietermarsman pietermarsman removed their request for review February 11, 2022 21:48
@pietermarsman
Copy link
Member

@richardpaulhudson any plans on working on this in the future?

@richardpaulhudson
Copy link
Author

Hi @pietermarsman, thank you for the review and sorry for not responding sooner — I've changed employers in the meantime and there seem to be issues with where my GitHub notification mails are ending up. I hope to be able to pick up working on this in the next couple of months.

@pietermarsman pietermarsman changed the base branch from develop to master March 19, 2022 16:42
@pietermarsman
Copy link
Member

FYI, I've changed this MR to merge into master. The develop branch will be removed, because soon we will work with version tags to indicate the releases and the distinction between develop and master becomes obsolete.

@pietermarsman
Copy link
Member

bump ;)

@richardpaulhudson richardpaulhudson marked this pull request as draft July 13, 2022 11:36
@richardpaulhudson
Copy link
Author

Sorry it's taken me so long to get back to this :-)

Can you add tests showing this works. Ideally you would use the simple1.pdf for this.

I can certainly see the need for some sort of regression test, but am unsure how to approach it. What I actually did myself was:

  • checked the hOCR output passed hocr-check (from the hocr-tools package)
  • commented in hocrjs and checked the rendering of the content in the browser corresponded to the original PDF file

neither of which lend themselves easily to a regression test.

The options are:

  • a regression test that just checks the conversion is carried out successfully without an error
  • a regression test that checks the output of the conversion is equal to the output of my conversion which I have verified with the two steps above. Issues with this are:
    • there may be problems with the output that I'm not aware of because they weren't picked up by these two steps, but such a test would declare the output to be correct
    • tests comparing large amounts of output at once tend to be brittle
  • a regression test that checks the output of the conversion for specific features, although I'm unsure what these would be

@richardpaulhudson richardpaulhudson marked this pull request as ready for review July 26, 2022 15:44
@pietermarsman
Copy link
Member

I prefer option 1 (just checking if the code does not raise an error) or 2 (check for specific output). If you go for two, we do indeed need to have some output that we know is reasonably stable.

Having a test with output (option 2) is also a start of some documentation, as other developers can easily see what the expected output is of the tool

@pietermarsman pietermarsman merged commit 77df431 into pdfminer:master Aug 14, 2022
@pietermarsman
Copy link
Member

@richardpaulhudson Thanks for the all your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New converter for the hOCR format Add hOCR output type for pdf2txt
3 participants