-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbundle the cmaps data #207
Comments
Hi @pombredanne. Could you point out as to how should I go about doing this? |
I agree that it is a little bit weird to include these into the package. But on the other hand, it worked for years and it is really easy to use. Also, they don't change a lot. Alternatives are:
Enough alternatives, but I'm not sure why the current setup needs to change. |
I think that it would be better to not use pickle for cmaps, my main concern being security (not size). I read some discussion in some closed PR so I tought of giving my opinion here. |
I think its this PR: #117 The benefit of having the cmaps is the fact that pdfminer.six has "batteries-included": CJK just works. And I'm not seeing the downsides of having them. Why is the increased package size a problem? The security issue with pickles is a valid one. People need to trust the pdfminer.six community in order to trust the pickles. This is also mentioned by @KOLANICH in #117. I'm even more worried about the compatibility of the pickles with our code. Not sure what kind of objects are in the files, but if we change the corresponding code classes it might break them. In summary, I'm more concerned about the file types than about the file size. |
Its also very possible that pickle has a large overhead and that changing the file type reduces the file size, or makes it easier to compress. |
Some observations:
Preferred solution here is to pack pdfminer.six with compressed json cmap's. Instead of json, any other format that does not execute arbitrary code is also fine. |
I've started working on this. Storing the cmap's as gzipped jsons takes 8.2mb. Loading the gzipped jsons takes 0.3 seconds (on average). Loading the pickles takes 0.13 seconds. Often you will only need 1 file, but it is a pitty that it is slower. I also tried storing all data in a single gzipped json, this takes 0.34 seconds. An uncompressed json takes 26mb and 0.25 seconds to load. Another problem with json is that it does not support int keys, and they are there. Using a binary serialization format with schema could be an option. Maybe protobuf or avro. |
These should IMHO be made available only as an option and in a separate wheel... The current wheel is ~ 6MB and the almost all of it except 100KB is from the cmap data.
The text was updated successfully, but these errors were encountered: