Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Processing some PDFs takes a lot of time in
layout.py
; in particular (for one PDF I've tried) inLTLayoutContainer.group_textboxes()
.This PR has two changes:
utils.csort()
and replace all uses with nativelist.sort()
group_textboxes()
, makedists
asortedcontainers.SortedListWithKey()
instead of a list, which removes the need to re-sortdists
repeatedlyNote that replacing
csort()
(which uses "Decorate-Sort-Undecorate" to ensure the sort is stable) with the native sort doesn't change the behaviour, since Python sorting is stable since Python 2.2.For the test PDF I have (which motivated these changes), running
pdf2txt.py
now takes around 7 seconds instead of 17 seconds.(I also tested these changes with the file in #128, but no speed up was seen for that input.)