Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swap pycryptodome to the faster, smaller, and industry standard cryto… #456

Merged

Conversation

lithiumFlower
Copy link
Contributor

@lithiumFlower lithiumFlower commented Jul 10, 2020

There is an outstanding issue #429

While the size of pycryptodome is not as much as titled, it is still quite large (~40 MB).

Cryptography and all of its dependencies is ~7 MB. The dependencies (cffi, six, pycparser) are well known and common in other packages.

Cryptography defaults to using its shipped openssl library and has become the industry standard for python.
Why is openssl advantageous compared to pycryptodome?

  1. All crypto routines are purely written in C and wrapped in python.
  2. openssl utilizes hardware accelerated AES. This is the most significant performance boost a symmetric encryption routine can get.
  3. While openssl has had its security shortcomings in the past, it's one of the most used and updated cryto libs in existence. This means comparatively hardened security compared to pycryptodome. Of course, as far as I can tell from my cursory look at the codebase pdfminer only decrypts. This makes the security impact low.

How Has This Been Tested?
I ran the shipped test suits including make crypts on the sample pdfs.

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works (I have not added existing codepaths, the existing testcases test these)
  • I have added docstrings to newly created methods and classes (no new methods or classes)
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me. Some small requests.

I tested with all the pdf's in samples/encryption/ and they work :) I believe that those pdf's are not part of our test suite. It would be great if you can add them. They are encrypted with the password "foo".

setup.py Show resolved Hide resolved
pdfminer/pdfdocument.py Outdated Show resolved Hide resolved
pdfminer/pdfdocument.py Outdated Show resolved Hide resolved
pdfminer/pdfdocument.py Outdated Show resolved Hide resolved
@pietermarsman pietermarsman linked an issue Jul 11, 2020 that may be closed by this pull request
@lithiumFlower
Copy link
Contributor Author

lithiumFlower commented Jul 12, 2020

@pietermarsman When you say that the files in samples_files/encryption are not part of the test suite, do you mean they're not used on travis? They are currently run as part of the Makefile's crypts task. If you use the parent Makefile and run make tests it will run nosetests, the processing of the free and nonfree pdfs in sample_files, as well as the crypts target mentioned previously.

I can add/refactor to use any of these targets instead of the existing tox.ini logic if that's what you're suggesting. However, if travis doesn't have the required programs installed then I'm not interested in debugging that.

The options would be:
1. Run all three tests (nosetests, free/nonfree, crypts) - simply call the parent Makefile's test target instead of nosetests .. in tox.ini
2 Run nosetests and crypts - leave the current nose call and add an additional child Makefile call to make crypts

If that is what you were suggesting then please let me know which would be preferred. If not, then I didn't understand your comment.

@corneliusroemer
Copy link

There is an outstanding issue #429

While the size of pycryptodome is not as much as titled, it is still quite large (~40 MB).

Cryptography and all of its dependencies is ~7 MB. The dependencies (cffi, six, pycparser) are well known and common in other packages.

I'm the one who opened the issue. I think there's a slight misunderstanding as to the origin of the extra 260MB for pycryptodome.

The fact is, that in order to install pdfminer on Alpine Linux, all the requirements can add to 260 MB. I needed to install the following, which all count towards the 260 MB:
apk add gcc g++ make libffi-dev openssl-dev

My personal request was to somehow change pdfminer so that it has less bloating (be it pdfminer module proper or requirements) when installing on Alpine Linux. Whether switching to a different crypto package helps, I don't know. It depends as much on its requirements as on the package itself. So I guess one would have to test that specifically.

@lithiumFlower
Copy link
Contributor Author

lithiumFlower commented Jul 13, 2020

@corneliusroemer w.r.t to your comment #456 (comment)

I'm the one who opened the issue. I think there's a slight misunderstanding as to the origin of the extra 260MB for pycryptodome.

Thanks for the heads up, though I am aware of those particulars (I commented on the issue itself). If there were a prebuilt wheel of pycryptodome for Alpine it would probably be of size similar or on the order of the windows size (~40 MB).

@corneliusroemer
Copy link

@lithiumFlower Thanks! That makes sense. Ah right, it's called wheel if you don't need build tools!

@pietermarsman
Copy link
Member

@pietermarsman When you say that the files in samples_files/encryption are not part of the test suite, do you mean they're not used on travis?

Yes, that's what I mean. My preferred solution is to add them to the nosetest test suite. In particular, test_pdf2txt.py is the place to test if we can extract a pdf without error. For the encrypted pdf's we don't have any nosetests yet. I'm also ok with removing the makefile in the encryption directory to eliminate this alternative testing option.

@lithiumFlower
Copy link
Contributor Author

lithiumFlower commented Jul 16, 2020

@pietermarsman I've removed the samples Makefile and ported its functionality to tests/test_samples.py. I've used a new test file as its purpose is different from test_pdf2txt.py. Additionally, there were no .refs in the repository meaning there was nothing to test against to ensure you've not changed the output, so I added them for each pdf. From the travis build we can see that nonfree/kampo.pdf parses differently with python 3.8 than the other versions. I don't have any familiarity with the internals of pdfminer to say why this would be.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant adding it to test_tools_pdf2txt.py. This only tests if the pdf can be processed without errors and that's also what we want to test with this PR.

I agree that comparing the text output for some pdf's might be a good idea. But I rather add those tests as part of a different PR where we choose the pdf's deliberately.

@lithiumFlower
Copy link
Contributor Author

lithiumFlower commented Jul 19, 2020

@pietermarsman Ok. In the interest of merging the specific improvement of pycryptodome -> cryptography I'm going to revert to before the Makefile port and make that a separate MR for the Makefile -> nosetests port that people can do with what they please. Then, making the swap is available and ready if a maintainer chooses to use it.

How does that sound to you?

@pietermarsman
Copy link
Member

pietermarsman commented Jul 20, 2020

Probably easier if I do it myself. Can you give my chances one more look before I merge it?

@lithiumFlower
Copy link
Contributor Author

LGTM

@pietermarsman pietermarsman merged commit c10cf3c into pdfminer:develop Jul 20, 2020
@pietermarsman
Copy link
Member

Great! Thanks for all the work :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removing Pycryptodome Dependency: Massive bloat (extra 260 MB)
3 participants