Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up layout of text boxes #141

Merged
merged 5 commits into from
Nov 8, 2018
Merged

Conversation

timb07
Copy link
Contributor

@timb07 timb07 commented Apr 11, 2018

Processing some PDFs takes a lot of time in layout.py; in particular (for one PDF I've tried) in LTLayoutContainer.group_textboxes().

This PR has two changes:

  1. Remove utils.csort() and replace all uses with native list.sort()
  2. In group_textboxes(), make dists a sortedcontainers.SortedListWithKey() instead of a list, which removes the need to re-sort dists repeatedly

Note that replacing csort() (which uses "Decorate-Sort-Undecorate" to ensure the sort is stable) with the native sort doesn't change the behaviour, since Python sorting is stable since Python 2.2.

For the test PDF I have (which motivated these changes), running pdf2txt.py now takes around 7 seconds instead of 17 seconds.

(I also tested these changes with the file in #128, but no speed up was seen for that input.)

@timb07 timb07 mentioned this pull request Jun 25, 2018
dists = [ (c,d,obj1,obj2) for (c,d,obj1,obj2) in dists
if (obj1 in plane and obj2 in plane) ]
removed = {obj1, obj2}
to_remove = [ (c,d,obj1,obj2) for (c,d,obj1,obj2) in dists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are elements being removed from the dists list? Is it because accumulating elements to be removed, and removing them is less expensive than adding elements to dists (That satisfy the condition if (obj1 in plane and obj2 in plane)?
Also, instead of using the list comprehension, you can loop through the elements in dists, check which among them satisfies the condition, and remove it in the same loop. Something like -

for (c,d,obj1, obj2) in dists:
    if (obj1 in removed or obj2 in removed):
        dists.remove( (c,d,obj1, obj2))

@tataganesh tataganesh merged commit e03ecab into pdfminer:master Nov 8, 2018
@tataganesh
Copy link
Member

For now, I am merging this. Might take a look at the requested changes later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants