Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type annotations #661

Merged
merged 2 commits into from
Oct 9, 2021
Merged

Type annotations #661

merged 2 commits into from
Oct 9, 2021

Conversation

0xabu
Copy link
Contributor

@0xabu 0xabu commented Aug 20, 2021

This PR adds type annotations and mypy checks to pdfminer. This can catch bugs early, and helps developer productivity -- besides serving as documentation, it also enables a much richer experience with autocompletion, type hints, intellisense, etc. in modern editors.

Many of the annotations are non-trivial (e.g. unions, generics, and more use of Any than one might prefer) because much of pdfminer is highly dynamic in its use of types, and the codebase has lots of sites where multiple types are possible (partly due to PDF parsing, and partly as a result of Python2 legacy, e.g. with str and bytes).

My general strategy to dealing with type errors raised by mypy was:

  • If the code would obviously crash at runtime, add an assertion
    (e.g. assert x is not None)
  • Otherwise, if there's an unchecked conversion between object types
    (that might be violated, e.g., by a non-conforming PDF file), or if
    there's a valid conversion that mypy cannot recognise (e.g. the size of
    tuples returned by chop and friends, or the type of struct.unpack), add
    a cast. This has no runtime effect, but serves as documentation that
    something potentially-fishy is going on. With some refactoring and stronger
    annotations, some of these casts could be removed in the future.
  • Otherwise, add a # type: ignore comment to silence the issue -- the goal
    should eventually be to remove these as annotations expand to cover
    more of the code.

Related: issue #362.

@0xabu

This comment has been minimized.

@0xabu 0xabu changed the title [RFC] work in progress on type annotations [RFC] type annotations Aug 21, 2021
@0xabu 0xabu marked this pull request as ready for review September 3, 2021 06:11
@0xabu 0xabu changed the title [RFC] type annotations Type annotations Sep 3, 2021
@0xabu
Copy link
Contributor Author

0xabu commented Sep 4, 2021

@pietermarsman, @jstockwin -- this got far out of hand from what I originally intended in scope (I've now annotated nearly the entire codebase), but feels like it is ready for review and discussion. There's plenty of scope for changes, but I'm hoping the value is obvious just from the number of issues it has already uncovered (see comments in the PR).

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! 👍 This should help a lot to keep pdfminer.six maintainable and bug free (well... less bugs).

I only had time for a shallow review. I want to look at it thoroughly before merging.

pdfminer/high_level.py Outdated Show resolved Hide resolved
pdfminer/arcfour.py Outdated Show resolved Hide resolved
pdfminer/ccitt.py Outdated Show resolved Hide resolved
pdfminer/ccitt.py Outdated Show resolved Hide resolved
pdfminer/ccitt.py Outdated Show resolved Hide resolved
pdfminer/encodingdb.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
tools/dumppdf.py Outdated
Comment on lines 64 to 65
# Likely bug: writing bytes to text I/O.
out.write(obj.get_rawdata()) # type: ignore [arg-type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case that shows this crashes? If it does, we can change this code such that it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Note to self:

$ tools/dumppdf.py -o out -a --raw-stream samples/simple1.pdf
Traceback (most recent call last):
  File "tools/dumppdf.py", line 388, in <module>
    main()
  File "tools/dumppdf.py", line 378, in main
    dumppdf(
  File "tools/dumppdf.py", line 259, in dumppdf
    dumpallobjs(outfp, doc, codec, show_fallback_xref)
  File "tools/dumppdf.py", line 133, in dumpallobjs
    dumpxml(out, obj, codec=codec)
  File "tools/dumppdf.py", line 65, in dumpxml
    out.write(obj.get_rawdata())  # type: ignore [arg-type]
TypeError: write() argument must be str, not bytes

$ tools/dumppdf.py -o out -a --binary-stream samples/simple1.pdf
Traceback (most recent call last):
  File "tools/dumppdf.py", line 388, in <module>
    main()
  File "tools/dumppdf.py", line 378, in main
    dumppdf(
  File "tools/dumppdf.py", line 259, in dumppdf
    dumpallobjs(outfp, doc, codec, show_fallback_xref)
  File "tools/dumppdf.py", line 133, in dumpallobjs
    dumpxml(out, obj, codec=codec)
  File "tools/dumppdf.py", line 68, in dumpxml
    out.write(obj.get_data())  # type: ignore [arg-type]
TypeError: write() argument must be str, not bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for this (which fails), but it's hard to tell what the expected output actually is here -- both paths were trying to write raw binary data into the middle of an XML file. We could reach around the file and write raw bytes there, but would that be useful? We could choose a codec like base64, but would that be useful? I'm tempted to mark this as a known issue and leave it for someone else to decide how to fix it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, indeed. Let's not make this bigger than it already is.

All the comments and type ignores are a big improvement, explicitly showing were we can improve.

Squashed commit of the following:

commit fa229f7
Merge: eaab3c6 c3e3499
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:33:06 2021 -0700

    Merge branch 'develop' into mypy (and fixed types)

commit eaab3c6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 20:00:45 2021 -0700

    reformat all multi-line function defs to one-arg-per-line

commit 3fe2b69
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:48 2021 -0700

    ccitt nit -- avoid casting needlessly

commit 15983d8
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:58:36 2021 -0700

    tweak CHANGELOG

commit 13dc0ba
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:43:46 2021 -0700

    add failing tests for dumppdf crash

commit 6b509c5
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:24:23 2021 -0700

    ccitt: apply misc PR feedback

commit feb031b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:18:26 2021 -0700

    add missing None return type to all __init__ methods

commit c0d62d6
Author: Andrew Baumann <ab@ab.id.au>
Date:   Mon Sep 6 15:13:08 2021 -0700

    minor cleanup, remove a few more Any types

commit b52a059
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 22:37:28 2021 -0700

    tighten up types, avoid Any in favour of explicit casts

commit e58fd48
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sun Sep 5 14:10:49 2021 -0700

    annotate ccitt.py, and fix one definite bug (array.tostring was renamed tobytes)

commit 6052906
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:37:38 2021 -0700

    python 3.7 back-compat

commit 4dbcf87
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:32:43 2021 -0700

    annotate pdfminer.jbig2

commit 0d40b7c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 22:31:33 2021 -0700

    annotate pdf2txt.py

commit 5f82eb4
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Sep 4 09:16:31 2021 -0700

    cleanup: make Plane generic

commit 624fc92
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:16:51 2021 -0700

    bluntly ignore calls to cryptography.hazmat

commit 96b2043
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 23:01:06 2021 -0700

    finish annotating, and disallow_untyped_defs for pdfminer.* _except_ ccitt and jbig2

commit 0ab5863
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 21:51:56 2021 -0700

    annotate pdffont

commit 4b689f1
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 18:30:02 2021 -0700

    annotate a couple more scripts; document sketchy code

commit 291981f
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 15:02:01 2021 -0700

    pacify flake8

commit 45d2ce9
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 14:31:48 2021 -0700

    annotate dumppdf, and comment likely bugs

commit 7278d83
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:49:58 2021 -0700

    enable mypy on tests and tools, fix one implicit reexport bug

commit 4a83166
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:59 2021 -0700

    pdfdocument: per dumppdf.py, get_dest accepts either bytes or str

commit 43701e1
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 13:25:00 2021 -0700

    layout: LAParams.boxes_flow may be None

commit 164f816
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:45:09 2021 -0700

    add whitespace, pacify flake8

commit 893b9fb
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:40:33 2021 -0700

    support old Python without typing.Protocol

commit dc24508
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Sep 3 09:12:03 2021 -0700

    Move "# type: ignore" comments to fix mypy on Python < 3.8

    The placement of these comments got more flexible in 3.8 due to
    python/mypy#1032

    Satisfying older Python and fitting in flake8's 79-character line
    limit was quite a challenge!

commit da03afe
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:59:58 2021 -0700

    fix text output from HTMLConverter

commit 5401276
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 22:40:22 2021 -0700

    annotate high_level.py and the immediately-reachable internal APIs (mostly converters)

commit cc49051
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Sep 2 17:04:35 2021 -0700

     * expand and improve annotations in cmap, encryption/decompression and fonts
     * disallow untyped calls; this way, we have a core set of
       typed code that can grow over time
       (just not for ccitt, because there's a ton of work lurking there)
     * expand "typing: none" comments to suppress a specific error code

commit 92df54b
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:50:59 2021 -0700

    update CHANGELOG

commit f72aaea
Merge: ff787a9 8ea9f10
Author: Andrew Baumann <ab@ab.id.au>
Date:   Wed Sep 1 20:47:03 2021 -0700

    Merge branch 'develop' into mypy

commit ff787a9
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 21:46:14 2021 -0700

    be more precise about types on ps/pdf stacks, remove most of the Any annotations

commit be15501
Author: Andrew Baumann <ab@ab.id.au>
Date:   Sat Aug 21 10:13:58 2021 -0700

    silence missing imports, (maybe?) hook to tox

commit ff4b6a9
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 22:49:06 2021 -0700

    turn on more strict checks, and untangle the layout mess with generics

    Status:
    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/pdfdevice.py:191: error: Argument 1 to "write" of "IO" has incompatible type "str"; expected "bytes"
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
    Found 5 errors in 4 files (checked 27 source files)

    pdfdevice.py:191 appears to be a real bug

commit 5c9c0b1
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 17:22:41 2021 -0700

    finish annotating layout

commit 0e6871c
Author: Andrew Baumann <ab@ab.id.au>
Date:   Fri Aug 20 16:54:46 2021 -0700

    general progress on annotations
     * finish utils
     * annotate more of pdfinterp, pdfdevice
     * document reason for # type: ignore comments
     * fix cyclic imports
     * satisfy flake8

commit 17d59f4
Author: Andrew Baumann <ab@ab.id.au>
Date:   Thu Aug 19 21:38:50 2021 -0700

    WIP on type annotations

    With the possible exception of psparser.py, this is far from complete.

    $ mypy pdfminer
    pdfminer/ccitt.py:565: error: Cannot find implementation or library stub for module named "pygame"
    pdfminer/ccitt.py:565: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
    pdfminer/pdfdocument.py:7: error: Skipping analyzing "cryptography.hazmat.backends": found module but no type hints or library stubs
    pdfminer/pdfdocument.py:8: error: Skipping analyzing "cryptography.hazmat.primitives.ciphers": found module but no type hints or library stubs
    pdfminer/image.py:84: error: Cannot find implementation or library stub for module named "PIL"
@pietermarsman
Copy link
Member

@0xabu Thanks for this big effort! I'm going to check the code again in a while (when I'm less drowsy) and merge it if everything looks good.

Copy link
Member

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not had a chance to look at everything in detail, but the bits I've looked at LGTM. Thanks for working on this, it's a nice improvement! 🎉

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not spot any potential problems.

But to be honest, if there is a potential problem, I'm not sure I would find it. There are so many changes.

But since mypy is happy (that's a good sign! 😄 ) and the test are happy, and both I and @jstockwin and you are happy, I'm pretty confident it will work.

@pietermarsman pietermarsman merged commit 9406040 into pdfminer:develop Oct 9, 2021
@0xabu 0xabu deleted the mypy branch October 9, 2021 16:02
@pietermarsman
Copy link
Member

Thanks @0xabu for all this work!

@0xabu
Copy link
Contributor Author

0xabu commented Oct 10, 2021

No worries. Hopefully it makes life easier in the future. I have a few improvements of my own I can now tackle.

@htInEdin
Copy link

Um, there are some unqualified dependencies on typing_extensions:

>: ack typing_extensions --python
tools/pdf2txt.py
8:from typing_extensions import Literal

pdfminer/utils.py
10:from typing_extensions import Literal

I guess these should be wrapped up like this?:

if sys.version_info < (3,8):
  from typing_extensions import Literal
else:
  from typing import Literal

And a note in the README.md that 3.5--3.7 have a dependency on typing_extensions?

@pietermarsman
Copy link
Member

Found the culprit: typing-extensions is a dependency of mypy and it is installed during the cicd tests. Will create a fix.

@0xabu
Copy link
Contributor Author

0xabu commented Oct 11, 2021

Darn, I guess we were getting that as an implicit dependency via mypy and I didn't notice.

I think only the use in pdf2txt.py is worth keeping, so I could do as you suggest there, but we probably also have to add typing_extensions as an explicit dependency in setup.py?

https://pypi.org/project/typing-extensions/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants