Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbundle the cmaps data #207

Open
pombredanne opened this issue Nov 11, 2018 · 7 comments · May be fixed by #940
Open

Unbundle the cmaps data #207

pombredanne opened this issue Nov 11, 2018 · 7 comments · May be fixed by #940

Comments

@pombredanne
Copy link
Contributor

These should IMHO be made available only as an option and in a separate wheel... The current wheel is ~ 6MB and the almost all of it except 100KB is from the cmap data.

@tataganesh
Copy link
Member

Hi @pombredanne. Could you point out as to how should I go about doing this?

@pietermarsman
Copy link
Member

I agree that it is a little bit weird to include these into the package. But on the other hand, it worked for years and it is really easy to use. Also, they don't change a lot.

Alternatives are:

Enough alternatives, but I'm not sure why the current setup needs to change.

@typhoon71
Copy link
Contributor

typhoon71 commented Jul 22, 2020

I think that it would be better to not use pickle for cmaps, my main concern being security (not size).
Generating them during install would be nice, but using another compression would be good too.

I read some discussion in some closed PR so I tought of giving my opinion here.

@pietermarsman
Copy link
Member

I think its this PR: #117

The benefit of having the cmaps is the fact that pdfminer.six has "batteries-included": CJK just works. And I'm not seeing the downsides of having them. Why is the increased package size a problem?

The security issue with pickles is a valid one. People need to trust the pdfminer.six community in order to trust the pickles. This is also mentioned by @KOLANICH in #117. I'm even more worried about the compatibility of the pickles with our code. Not sure what kind of objects are in the files, but if we change the corresponding code classes it might break them.

In summary, I'm more concerned about the file types than about the file size.

@pietermarsman
Copy link
Member

Its also very possible that pickle has a large overhead and that changing the file type reduces the file size, or makes it easier to compress.

@pietermarsman
Copy link
Member

pietermarsman commented Jan 27, 2024

Some observations:

  • The pickles are built from the text files in cmapsrc/. These are 7.4mb. The pickles are 6.6mb.
  • There is lots of redundancy in the text files. Zipping them results in 1.9mb.
  • I've checked the content of the pickles and these are just simple dicts with str and int keys and values. Json would do just as fine.
  • Rebuilding the pickles takes a couple of seconds, so doing that on runtime is not an option.
  • The files are only loaded in CMapDB in cmapdb.py

Preferred solution here is to pack pdfminer.six with compressed json cmap's. Instead of json, any other format that does not execute arbitrary code is also fine.

@pietermarsman pietermarsman linked a pull request Jan 27, 2024 that will close this issue
5 tasks
@pietermarsman
Copy link
Member

pietermarsman commented Jan 27, 2024

I've started working on this. Storing the cmap's as gzipped jsons takes 8.2mb.

Loading the gzipped jsons takes 0.3 seconds (on average). Loading the pickles takes 0.13 seconds. Often you will only need 1 file, but it is a pitty that it is slower.

I also tried storing all data in a single gzipped json, this takes 0.34 seconds.

An uncompressed json takes 26mb and 0.25 seconds to load.

Another problem with json is that it does not support int keys, and they are there.

Using a binary serialization format with schema could be an option. Maybe protobuf or avro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants