-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdf2txt: clean up construction of LAParams from arguments #682
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes: ``` $ pdf2txt.py --boxes-flow=disabled test.pdf Traceback (most recent call last): File "tools/pdf2txt.py", line 204, in <module> sys.exit(main()) File "tools/pdf2txt.py", line 198, in main outfp = extract_text(**vars(A)) File "tools/pdf2txt.py", line 66, in extract_text pdfminer.high_level.extract_text_to_fp(fp, **locals()) File "pdfminer/high_level.py", line 85, in extract_text_to_fp interpreter.process_page(page) File "pdfminer/pdfinterp.py", line 896, in process_page self.device.end_page(page) File "pdfminer/converter.py", line 51, in end_page self.cur_item.analyze(self.laparams) File "pdfminer/layout.py", line 822, in analyze group.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 575, in analyze LTTextGroup.analyze(self, laparams) File "pdfminer/layout.py", line 362, in analyze obj.analyze(laparams) File "pdfminer/layout.py", line 577, in analyze self._objs.sort( File "pdfminer/layout.py", line 578, in <lambda> key=lambda obj: (1 - laparams.boxes_flow) * obj.x0 TypeError: unsupported operand type(s) for -: 'int' and 'str' ``` Related: Issue pdfminer#477, PR pdfminer#479
* avoid specifying default values twice * construct LAParams earlier, rather than passing its components around * fix crash with --boxes_flow=disabled
pietermarsman
approved these changes
Jan 25, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! Again a very nice improvement!
Will do some small changes myself and then merge it.
…om parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.
I did some more changes:
|
LGTM, thanks. |
Beants
added a commit
to HiTalentAlgorithms/pdfminer.six
that referenced
this pull request
Feb 14, 2022
* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR cleans up the way pdf2txt constructs the
LAParams
object it passes toextract_text_to_fp
.Specifically:
LAParams._validate
runs to validate the chosen parameters.As a deliberate side-effect of point 2 above, this fixes the following crash:
Thus, this supersedes PR #657 (which was a minimal fix for just that crash).
Related: Issue #477, PR #479
Fix #540
How Has This Been Tested?
pdf2txt --boxes_flow=disabled
, and verified that (a) there was no crash, and (b) the alternate layout algorithm was used inLTLayoutContainer.analyze
.Checklist
works
version
I have updated the README.md orI have verified that thisis not necessary
I have updated the readthedocs documentation orIverified that this is not necessary
CHANGELOG.md