Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout speedup rebased #315

Merged
merged 2 commits into from
Oct 31, 2019
Merged

Layout speedup rebased #315

merged 2 commits into from
Oct 31, 2019

Conversation

mikkkee
Copy link
Contributor

@mikkkee mikkkee commented Oct 23, 2019

See #302 . Plus a small speed up from using a set of removed objects in the loops.

Checklist

  • There are (added) tests that prove my fix is effective or that my feature works
  • I have updated the README.md and other documentation, or I am sure that this is not necessary
  • I have added a consice human-readable description of the change to CHANGELOG.md
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version

@pietermarsman
Copy link
Member

pietermarsman commented Oct 23, 2019

Thanks for rebasing!

I've a couple of question that I want to get to know the answer, before merging this:

  • Is this implementation equivalent to the earlier implementation? (I'm going to review it)
  • Are there other ways we can speed up the code? For example, using a heap data structure instead of lists.
  • How can we monitor the performance of pdfminer.six in general. I hope there exist a tool that monitors the execution time of (parts of) the test suite. This way we can see if commits increase/decrease the execution time.

@pietermarsman
Copy link
Member

I can confirm that it is faster. I've tried a pdf with 48 pages and it goes from 69 seconds to 36.

@pietermarsman
Copy link
Member

I have my doubt about removing the order from the list. I think the order is important because it determines which page elements are merged first. In the original implementation 2 elements are merged if they have the lowest distance of all possible combinations of elements. In your implementation arbitrary elements are merged. (an element being an original pdf element or a group of those elements that are already merged). I'll try to create an example.

@pietermarsman
Copy link
Member

pietermarsman commented Oct 24, 2019

Found an example by including random.shuffle(boxes) at the start of .group_textboxes(). This changes the order of the textboxes randomly. This mimics the arbitrary order that a pdf can use to provide the elements of a page. And since .group_objects() and .group_textlines() do not enforce any order, this arbitrary order of pdf elements is carried on to .group_textboxes().

Output PR without shuffle

Hello

World

Hello

World

H e l

l o

W o r

l d

H e l

l o

W o r

l d

Example run with shuffle

Hello

Hello

H e l

World

World

l o

W o r

H e l

l o

W o r

l d

l d

I think that merging boxes with low distance first yields a more intuitive text order that using arbitrary order. Thus, I am against on merging this PR in its current state. That being said, I think this code should still be optimised.

Alternative speedup:

  • Use heap queues. With heaps it is easy to get the minimum element of a set, which is precisely what we want here.
  • Drop the inline isany() function. I'm not sure what it is for. It checks if there are objects between two other objects. But if objects with lowest distance are merged first, this is never the case (I think).
  • Drop plane. It is only used (for its intention) in isany. In the other places it acts just like a list or set. Replacing it by a set might even speed up more.

@mikkkee do you want to work on this?

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 25, 2019

Thanks @pietermarsman . Nice finding! Apologies I didn't dig much on the sorted list part. Let me try to get this done over the weekend.

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 25, 2019

@pietermarsman On how to monitor the performance, what do you think of adding CI checks on test time using e.g., nose-timer?

@pietermarsman
Copy link
Member

Yes, I was thinking about that to. That guarantees that it does not get much worse. But with travis-ci I expect the run-times to vary, based on how many jobs are running and on which hardware they run. And having test that fail occasionally because of the environment is just annoying.

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 28, 2019

Hi @pietermarsman, Thanks for the advice. I've used heapq to ensure we are merging closest objects first in the latest commit. The speed up is about 2x, from 81 s to 42 s.

Didn't change the plane part because a profile shows that the improvement of removing isanay and making plane a set is tiny (<1 s of the total 42 s in my case).

For performance monitoring, I think the first step can be just to add timing to tests to observe the stability of CI environment. No need to set any failing criteria of tests running time now.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes look good! But I hope you have some time to polish this function even more, because I think it is not very intuitive.

Could you try improving the docstring of the methods, the variable names and the overall flow of the method? I've made some suggestions about this in the review.

I also added a checklist to the description of the PR. This is the default todo list for all PR's.

I created a new issue for improving the distance function (#322) and possibly removing the isany method so you don't have to worry about that.

Comment on lines 661 to 666
removed = set([id1, id2])
dists = [ ele for ele in dists
if ((ele[2] not in removed) and (ele[3] not in removed)) ]
for other in plane:
dists.add((0, dist(group, other), group, other))
dists.append((0, dist(group, other), id(group), id(other), group, other))
heapq.heapify(dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be even faster by keeping a list of object id's that are already done. Then dists does not need to be rebuild every time (saves sorting time), new dists can be pushed on the heap immediately and after popping a dist of the heap you could check if the object is already done (fast using a set).

This code is then simplified to:

heapq.heappush(dists, (0, dist(group, other), id(group), id(other), group, other))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, seems it only takes 5s (vs 81 s) now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's really fast!

Comment on lines 642 to 643
# Appending id before obj for tuple comparison as __lt__ is disabled for obj
# by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move comments to docstring.

for other in plane:
dists.add((0, dist(group, other), group, other))
dists.append((0, dist(group, other), id(group), id(other), group, other))
heapq.heapify(dists)
plane.add(group)
assert len(plane) == 1, str(len(plane))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this assertion and the one at the start of the function. Having these inline asserts is annoying because they are not very helpful if something goes wrong.

# Appending id before obj for tuple comparison as __lt__ is disabled for obj
# by default.
dists.append((0, dist(obj1, obj2), id(obj1), id(obj2), obj1, obj2))
heapq.heapify(dists)
plane = Plane(self.bbox)
plane.extend(boxes)
while dists:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace by while len(dists) > 0

@pietermarsman
Copy link
Member

For performance monitoring, I think the first step can be just to add timing to tests to observe the stability of CI environment. No need to set any failing criteria of tests running time now.

Indeed, this is a good start. For starters, it would make it easier to check if a PR changes the execution time. I've created issue #323 for this.

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 30, 2019

Hi @pietermarsman, please check diff for latest commits that reflects things in the check list.

I have updated my forked branch to the main repo develop branch so many commits got included here. Let me know if you want me to send another PR. Thanks.

@pietermarsman
Copy link
Member

pietermarsman commented Oct 30, 2019

Hi @pietermarsman, please check diff for latest commits that reflects things in the check list.

Thanks a lot. This looks good. I will take some time tonight to check it properly.

I have updated my forked branch to the main repo develop branch so many commits got included here. Let me know if you want me to send another PR. Thanks.

I think the best way is to use an interactive rebase. This allows you to pick the commits of the upstream develop branch first (in order), and then your own commits. If you force-push that to your own develop branch. Are you able to do that?

Speedup by:
1/ Using a heap instead of a SortedList and avoid rebuilding the heap in
each iteration.
2/ Avoid potentially huge number of variable assignments in list comprehension.
3/ Avoid repeatly evaluating `obj is obj` in list comprehension by storing id(obj).
@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 30, 2019

I think the best way is to use an interactive rebase. This allows you to pick the commits of the upstream develop branch first (in order), and then your own commits. If you force-push that to your own develop branch. Are you able to do that?

Yup, done. Thanks!

@pietermarsman
Copy link
Member

I've tested it with the PDF reference (1000 pages). The execution time goes from 2m30s to 2m15s. Not a big difference.
The output is almost exactly the same! The only difference are a couple of parenthesis in equations that are switched.

@pietermarsman
Copy link
Member

The sample/nonfree/i1040nr.pdf file goes from 40s to 31s.

What file are you using to test it? Is there anything special about it? Going from 80s to 5s is a much bigger difference than I am measuring.

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 30, 2019

The file I used is 01.pdf. It is a 31x21 table. The script I used for testing is:

import time

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams

start = time.time()
with open('01.pdf', "rb") as f:
        parser = PDFParser(f)
        document = PDFDocument(parser)
        laparams = LAParams(
            char_margin=1.0,
            line_margin=0.5,
            word_margin=0.1,
            detect_vertical=True,
            all_texts=True,
        )
        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
end = time.time()
print("Time elapsed: {}s".format(end-start))

@mikkkee
Copy link
Contributor Author

mikkkee commented Oct 31, 2019

Is there anything special about it?

Not sure what exactly makes the file so slow, but in my test profile, the while loop was executed for 952764 times.

@pietermarsman
Copy link
Member

Wow, on my laptop it goes from 200s to 10s 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants