Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

Merged
merged 21 commits into from
Sep 27, 2021

Conversation

htInEdin
Copy link

@htInEdin htInEdin commented May 5, 2021

Pull request

I have only just caught up with pdfminer.six, was using pdfminer.20191103

Two bug fixes:

pdfminer.20191103 allowed me to use a TextIOWrapper as the output stream for a TextConverter, but this fails with pdfminer.six because _is_binary_stream fails to recognise TextIOWrapper as non-binary.

All that's needed is to test for instances of io.TextIOBase

Fixes #615.

Also, fixed a bug in psparser which failed to remove all of an escaped \r\n.

Fixes #624.

How Has This Been Tested?

I have a link extractor (patched version of pdfx) which ran on a 200 page pdf, finding 1300+ links, using pdfminer.20191103, crashed with pdfminer.20201018, runs again with the same output as before when patched with these changes. Running on a sample of 900+ pdf files from Common Crawl of August 2019, it finds more and cleaner links that when using pdfminer.20191103.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version
  • I have updated the README.md or I am verified that this is not necessary
  • I have updated the readthedocs documentation or verified that this is not necessary
  • Commit includes a ChangeLog, I don't understand the new format, sorry, but have supplied an old-format version

@htInEdin htInEdin changed the title Hst Bug: _is_binary_stream should recognize TextIOWrapper as non-binary May 6, 2021
@htInEdin htInEdin changed the title Bug: _is_binary_stream should recognize TextIOWrapper as non-binary Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed May 27, 2021
@pietermarsman
Copy link
Member

@htInEdin I did some changes to make it more readable. Could you double-check the psparser.py logic for literal strings? I find that these kinds of things are very easy to get wrong.

@htInEdin
Copy link
Author

htInEdin commented Aug 31, 2021 via email

@pietermarsman
Copy link
Member

@htInEdin bump ;)

(Just a friendly reminder, no hurry)

@htInEdin
Copy link
Author

I've reviewed your work above, thanks, and _parse_string(_1) in particular, and I think it's good to go.

@htInEdin
Copy link
Author

htInEdin commented Sep 22, 2021 via email

…lowing

 the Python official guidance that warning.warn is directed at _developers_,
 not users

 * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning,
			PDFNoValidXRefWarning

 * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning

 * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning

 * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather
				  than PDFNoValidXRefWarning
@htInEdin
Copy link
Author

Bother, I still haven't quite gotten the hang of branch management. The above 4 commits were meant to be for a new 'preferLoggingToWarning' pull request, now merged into this one :-(.
d6fc079 is the correct commit to pull from for this request. Let me know if I need to do some surgery...

@pietermarsman pietermarsman merged commit 33d7dde into pdfminer:develop Sep 27, 2021
@pietermarsman
Copy link
Member

@htInEdin I think I fixed it by reverting them.

Merged it now. Thanks for the work!

@htInEdin htInEdin deleted the hst branch September 28, 2021 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants