Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grouping textlines when bounding box of parent container is wrong #386

Merged
merged 4 commits into from
Mar 14, 2020

Conversation

pietermarsman
Copy link
Member

@pietermarsman pietermarsman commented Mar 10, 2020

Pull request

Currently, LTLayoutContainer.group_textlines() searches for neighboring text lines within the bounding box of the parent container. If the bounding box of the parent container is wrong, and some of the text lines are outside it, no neighbours are found. Currently, the LTLayoutContainer.group_textlines() skips the text lines where no neighbours are found. Consequently, the content of these lines never appear in the output.

While the PDF is to blame in these cases, we can be more resilient to errorneous PDF's. In this case we should not drop the text line when no neighbouring lines are found. Instead, the line should be returned in its own LTTextBox.

Fixes #381

How Has This Been Tested?

Tested with errorneous PDF from #381. Added unittest for this edge case.

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

@pietermarsman pietermarsman added type: bug component: converter Related to any PDFLayoutAnalyzer labels Mar 10, 2020
Copy link

@jvalls-axa jvalls-axa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes all text in figures appear again even if wrong parent container box is used.

Nice catch !

@pietermarsman pietermarsman merged commit 1d773dc into develop Mar 14, 2020
@pietermarsman pietermarsman deleted the fix-grouping-textlines branch February 2, 2022 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer type: bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pdf2txt.py - Missed figure texts
2 participants