Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group text lines if they are centered (#382) #384

Merged
merged 6 commits into from
Mar 23, 2020

Conversation

jstockwin
Copy link
Member

@jstockwin jstockwin commented Mar 4, 2020

Description

Add a check when checking if lines should be merged to see if they are centered above each other.

Fixes #382

Note: it might also be worth having a read of #383, as there is perhaps a discussion there about how to deal with these checks for a negative line margin...

How Has This Been Tested?

Example PDF

Here is my test code (which simply loads the pdfs and prints the elements):

from pdfminer import converter, pdfdocument, pdfinterp, pdfpage, pdfparser
from pdfminer.layout import LTTextContainer, LAParams
path_to_file = "centered_text_example.pdf"

with open(path_to_file, "rb") as pdf_file:
    parser = pdfparser.PDFParser(pdf_file)
    document = pdfdocument.PDFDocument(parser)
    resource_manager = pdfinterp.PDFResourceManager()
    device = converter.PDFPageAggregator(resource_manager, laparams=LAParams())
    interpreter = pdfinterp.PDFPageInterpreter(resource_manager, device)
    for page in pdfpage.PDFPage.create_pages(document):
        interpreter.process_page(page)
        results = device.get_result()
        elements = [
            element for element in results if isinstance(element, LTTextContainer)
        ]
device.close()

print(elements)

Before this change, you get:

[
    <LTTextBoxHorizontal(0) 123.800,344.802,216.392,367.878 'Long Text 1\n'>, 
    <LTTextBoxHorizontal(1) 146.100,366.002,194.070,411.778 'Text 1\nText 2\n'>, 
    <LTTextBoxHorizontal(2) 146.100,302.502,194.070,348.278 'Text 3\nText 4\n'>, 
    <LTTextBoxHorizontal(3) 39.700,24.176,43.200,42.152 ' \n'>, 
    <LTTextBoxHorizontal(4) 395.500,24.176,399.000,42.152 ' \n'>
]

After this change you get:

[
    <LTTextBoxHorizontal(0) 123.800,302.502,216.392,411.778 'Text 1\nText 2\nLong Text 1\nText 3\nText 4\n'>,
    <LTTextBoxHorizontal(1) 39.700,24.176,43.200,42.152 ' \n'>
    <LTTextBoxHorizontal(2) 395.500,24.176,399.000,42.152 ' \n'>
]

(The last two elements in each are just artefacts in the example pdf, ignore them).

Checklist

I will do these things once I get a bit of guidance on whether this is an acceptable change. I'll also need a pointer on how to add a relevant test.

  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the README.md and other documentation, or I am sure that this is not necessary
  • I have added a consice human-readable description of the change to CHANGELOG.md
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version

@pietermarsman pietermarsman added component: converter Related to any PDFLayoutAnalyzer type: new feature labels Mar 14, 2020
@pietermarsman
Copy link
Member

I will do these things once I get a bit of guidance on whether this is an acceptable change.

It is

I'll also need a pointer on how to add a relevant test.

Look at tests/test_layout.py for inspiration.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

pdfminer/layout.py Outdated Show resolved Hide resolved
@jstockwin
Copy link
Member Author

@pietermarsman I've added 4 new commits, addressing your comments and then adding tests.

I couldn't find any tests for the function I've changed, but think what I added tests it fairly well (even if it takes a while of figuring out bounding boxes to work out what it's doing....).

Before my change, this test would fail because the centrally_aligned_overlapping text boxes would not be included.

Please could you take a look?

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the new *_aligned_with methods. They make find_neighbours() so much easier to read. Also the tests are really good :)

I've left 3 small cosmetic suggestions.

CHANGELOG.md Outdated Show resolved Hide resolved
pdfminer/layout.py Outdated Show resolved Hide resolved
pdfminer/layout.py Outdated Show resolved Hide resolved
@jstockwin
Copy link
Member Author

@pietermarsman PR updated with your comments

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@pietermarsman pietermarsman merged commit 1cc1b96 into pdfminer:develop Mar 23, 2020
@jstockwin jstockwin deleted the group-centered-text branch March 26, 2020 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer type: new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Centered text on different lines don't get grouped into text boxes
2 participants