-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type annotations #661
Type annotations #661
Conversation
This comment has been minimized.
This comment has been minimized.
@pietermarsman, @jstockwin -- this got far out of hand from what I originally intended in scope (I've now annotated nearly the entire codebase), but feels like it is ready for review and discussion. There's plenty of scope for changes, but I'm hoping the value is obvious just from the number of issues it has already uncovered (see comments in the PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work! 👍 This should help a lot to keep pdfminer.six maintainable and bug free (well... less bugs).
I only had time for a shallow review. I want to look at it thoroughly before merging.
tools/dumppdf.py
Outdated
# Likely bug: writing bytes to text I/O. | ||
out.write(obj.get_rawdata()) # type: ignore [arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case that shows this crashes? If it does, we can change this code such that it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Note to self:
$ tools/dumppdf.py -o out -a --raw-stream samples/simple1.pdf
Traceback (most recent call last):
File "tools/dumppdf.py", line 388, in <module>
main()
File "tools/dumppdf.py", line 378, in main
dumppdf(
File "tools/dumppdf.py", line 259, in dumppdf
dumpallobjs(outfp, doc, codec, show_fallback_xref)
File "tools/dumppdf.py", line 133, in dumpallobjs
dumpxml(out, obj, codec=codec)
File "tools/dumppdf.py", line 65, in dumpxml
out.write(obj.get_rawdata()) # type: ignore [arg-type]
TypeError: write() argument must be str, not bytes
$ tools/dumppdf.py -o out -a --binary-stream samples/simple1.pdf
Traceback (most recent call last):
File "tools/dumppdf.py", line 388, in <module>
main()
File "tools/dumppdf.py", line 378, in main
dumppdf(
File "tools/dumppdf.py", line 259, in dumppdf
dumpallobjs(outfp, doc, codec, show_fallback_xref)
File "tools/dumppdf.py", line 133, in dumpallobjs
dumpxml(out, obj, codec=codec)
File "tools/dumppdf.py", line 68, in dumpxml
out.write(obj.get_data()) # type: ignore [arg-type]
TypeError: write() argument must be str, not bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test for this (which fails), but it's hard to tell what the expected output actually is here -- both paths were trying to write raw binary data into the middle of an XML file. We could reach around the file and write raw bytes there, but would that be useful? We could choose a codec like base64, but would that be useful? I'm tempted to mark this as a known issue and leave it for someone else to decide how to fix it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, indeed. Let's not make this bigger than it already is.
All the comments and type ignores are a big improvement, explicitly showing were we can improve.
Squashed commit of the following: commit fa229f7 Merge: eaab3c6 c3e3499 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 20:33:06 2021 -0700 Merge branch 'develop' into mypy (and fixed types) commit eaab3c6 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 20:00:45 2021 -0700 reformat all multi-line function defs to one-arg-per-line commit 3fe2b69 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:58:48 2021 -0700 ccitt nit -- avoid casting needlessly commit 15983d8 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:58:36 2021 -0700 tweak CHANGELOG commit 13dc0ba Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:43:46 2021 -0700 add failing tests for dumppdf crash commit 6b509c5 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:24:23 2021 -0700 ccitt: apply misc PR feedback commit feb031b Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:18:26 2021 -0700 add missing None return type to all __init__ methods commit c0d62d6 Author: Andrew Baumann <ab@ab.id.au> Date: Mon Sep 6 15:13:08 2021 -0700 minor cleanup, remove a few more Any types commit b52a059 Author: Andrew Baumann <ab@ab.id.au> Date: Sun Sep 5 22:37:28 2021 -0700 tighten up types, avoid Any in favour of explicit casts commit e58fd48 Author: Andrew Baumann <ab@ab.id.au> Date: Sun Sep 5 14:10:49 2021 -0700 annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes) commit 6052906 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:37:38 2021 -0700 python 3.7 back-compat commit 4dbcf87 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:32:43 2021 -0700 annotate pdfminer.jbig2 commit 0d40b7c Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 22:31:33 2021 -0700 annotate pdf2txt.py commit 5f82eb4 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Sep 4 09:16:31 2021 -0700 cleanup: make Plane generic commit 624fc92 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 23:16:51 2021 -0700 bluntly ignore calls to cryptography.hazmat commit 96b2043 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 23:01:06 2021 -0700 finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2 commit 0ab5863 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 21:51:56 2021 -0700 annotate pdffont commit 4b689f1 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 18:30:02 2021 -0700 annotate a couple more scripts; document sketchy code commit 291981f Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 15:02:01 2021 -0700 pacify flake8 commit 45d2ce9 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 14:31:48 2021 -0700 annotate dumppdf, and comment likely bugs commit 7278d83 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:49:58 2021 -0700 enable mypy on tests and tools, fix one implicit reexport bug commit 4a83166 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:25:59 2021 -0700 pdfdocument: per dumppdf.py, get_dest accepts either bytes or str commit 43701e1 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 13:25:00 2021 -0700 layout: LAParams.boxes_flow may be None commit 164f816 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:45:09 2021 -0700 add whitespace, pacify flake8 commit 893b9fb Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:40:33 2021 -0700 support old Python without typing.Protocol commit dc24508 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Sep 3 09:12:03 2021 -0700 Move "# type: ignore" comments to fix mypy on Python < 3.8 The placement of these comments got more flexible in 3.8 due to python/mypy#1032 Satisfying older Python and fitting in flake8's 79-character line limit was quite a challenge! commit da03afe Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 22:59:58 2021 -0700 fix text output from HTMLConverter commit 5401276 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 22:40:22 2021 -0700 annotate high_level.py and the immediately-reachable internal APIs (mostly converters) commit cc49051 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Sep 2 17:04:35 2021 -0700 * expand and improve annotations in cmap, encryption/decompression and fonts * disallow untyped calls; this way, we have a core set of typed code that can grow over time (just not for ccitt, because there's a ton of work lurking there) * expand "typing: none" comments to suppress a specific error code commit 92df54b Author: Andrew Baumann <ab@ab.id.au> Date: Wed Sep 1 20:50:59 2021 -0700 update CHANGELOG commit f72aaea Merge: ff787a9 8ea9f10 Author: Andrew Baumann <ab@ab.id.au> Date: Wed Sep 1 20:47:03 2021 -0700 Merge branch 'develop' into mypy commit ff787a9 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Aug 21 21:46:14 2021 -0700 be more precise about types on ps/pdf stacks, remove most of the Any annotations commit be15501 Author: Andrew Baumann <ab@ab.id.au> Date: Sat Aug 21 10:13:58 2021 -0700 silence missing imports, (maybe?) hook to tox commit ff4b6a9 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 22:49:06 2021 -0700 turn on more strict checks, and untangle the layout mess with generics Status: $ mypy pdfminer pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame" pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes" pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL" Found 5 errors in 4 files (checked 27 source files) pdfdevice.py:191 appears to be a real bug commit 5c9c0b1 Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 17:22:41 2021 -0700 finish annotating layout commit 0e6871c Author: Andrew Baumann <ab@ab.id.au> Date: Fri Aug 20 16:54:46 2021 -0700 general progress on annotations * finish utils * annotate more of pdfinterp, pdfdevice * document reason for # type: ignore comments * fix cyclic imports * satisfy flake8 commit 17d59f4 Author: Andrew Baumann <ab@ab.id.au> Date: Thu Aug 19 21:38:50 2021 -0700 WIP on type annotations With the possible exception of psparser.py, this is far from complete. $ mypy pdfminer pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame" pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
@0xabu Thanks for this big effort! I'm going to check the code again in a while (when I'm less drowsy) and merge it if everything looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not had a chance to look at everything in detail, but the bits I've looked at LGTM. Thanks for working on this, it's a nice improvement! 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could not spot any potential problems.
But to be honest, if there is a potential problem, I'm not sure I would find it. There are so many changes.
But since mypy is happy (that's a good sign! 😄 ) and the test are happy, and both I and @jstockwin and you are happy, I'm pretty confident it will work.
Thanks @0xabu for all this work! |
No worries. Hopefully it makes life easier in the future. I have a few improvements of my own I can now tackle. |
Um, there are some unqualified dependencies on typing_extensions:
I guess these should be wrapped up like this?:
And a note in the README.md that 3.5--3.7 have a dependency on |
Found the culprit: typing-extensions is a dependency of mypy and it is installed during the cicd tests. Will create a fix. |
Darn, I guess we were getting that as an implicit dependency via mypy and I didn't notice. I think only the use in pdf2txt.py is worth keeping, so I could do as you suggest there, but we probably also have to add typing_extensions as an explicit dependency in setup.py? |
This PR adds type annotations and mypy checks to pdfminer. This can catch bugs early, and helps developer productivity -- besides serving as documentation, it also enables a much richer experience with autocompletion, type hints, intellisense, etc. in modern editors.
Many of the annotations are non-trivial (e.g. unions, generics, and more use of
Any
than one might prefer) because much of pdfminer is highly dynamic in its use of types, and the codebase has lots of sites where multiple types are possible (partly due to PDF parsing, and partly as a result of Python2 legacy, e.g. with str and bytes).My general strategy to dealing with type errors raised by mypy was:
(e.g. assert x is not None)
(that might be violated, e.g., by a non-conforming PDF file), or if
there's a valid conversion that mypy cannot recognise (e.g. the size of
tuples returned by chop and friends, or the type of struct.unpack), add
a cast. This has no runtime effect, but serves as documentation that
something potentially-fishy is going on. With some refactoring and stronger
annotations, some of these casts could be removed in the future.
should eventually be to remove these as annotations expand to cover
more of the code.
Related: issue #362.