Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjStm compression and PDF linearization doesn't work together #3603

Closed
SteveHawk opened this issue Jun 21, 2024 · 12 comments
Closed

ObjStm compression and PDF linearization doesn't work together #3603

SteveHawk opened this issue Jun 21, 2024 · 12 comments
Labels
fix developed release schedule to be determined wontfix no intention to resolve

Comments

@SteveHawk
Copy link

Description of the bug

Since v1.24.1 introduced use_objstms option in Document.save(), setting use_objstms=1 and linear=True together doesn't work on some documents, results in a broken PDF file. On version >= 1.24.3, some documents even cause the program to crash.

How to reproduce the bug

Here's a minimal reproducible program:

import fitz

def test(filename: str) -> None:
    with fitz.open(filename) as doc:
        doc.ez_save("output.pdf", use_objstms=1, linear=True)
    with fitz.open("output.pdf") as doc:
        for page in doc:
            page.get_pixmap(dpi=72)

test("2401.08541v1.pdf")
test("1706.03762v7.pdf")

We ran into the problem when processing some internal documents, but managed to reproduce the issue on two random paper downloaded from arXiv. Here are the files:

1706.03762v7.pdf
2401.08541v1.pdf

When running the program, it spits out error logs like below during the pixmap generation, possibly due to the file is broken.

MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

MuPDF error: syntax error: cannot find XObject resource 'Im3'

MuPDF error: syntax error: cannot find XObject resource 'Fm1'

MuPDF error: syntax error: cannot find XObject resource 'Fm2'

MuPDF error: syntax error: cannot find XObject resource 'Fm3'

MuPDF error: syntax error: cannot find XObject resource 'Fm4'

MuPDF error: syntax error: cannot find XObject resource 'Fm5'

And the result PDF file is either blank or only contains some lines with no texts when opening in Ubuntu's Evince document viewer. Opening it in chrome does show texts, but the font is altered and figures are gone.

Also, it seems like turning on garbage collection affects the crash pattern, when using ez_save, the first file crashes the program, when using save with no gc, the second file crashes the program. They all crash with such log:

realloc(): invalid next size
fish: Job 1, 'python test.py' terminated by signal SIGABRT (Abort)

PyMuPDF version

1.24.5

Operating system

Linux

Python version

3.11

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Jun 21, 2024
@JorjMcKie
Copy link
Collaborator

Thank you for submitting this.

This happens inside the base library MuPDF. I am going to transfer the issue to the team for investigation.

@JorjMcKie
Copy link
Collaborator

MuPDF issue reference https://bugs.ghostscript.com/show_bug.cgi?id=707835

@SteveHawk
Copy link
Author

@JorjMcKie Thanks!

@JorjMcKie JorjMcKie added wontfix no intention to resolve and removed upstream bug bug outside this package labels Aug 16, 2024
@JorjMcKie
Copy link
Collaborator

The MuPDF team has determined that object streams and linearization cannot be used together.
Therefore this issue cannot fixed, and you must not use object streams when saving with linearization - and vice versa.
We will update the documentation and also prevent specifying concurrent use of these options.

@JorjMcKie JorjMcKie reopened this Aug 19, 2024
@JorjMcKie
Copy link
Collaborator

Re-opening until the corresponding changes have been published.

@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Aug 23, 2024
@JorjMcKie
Copy link
Collaborator

The label "fix developed" refers to changes that prevent combined specification of options "linear" and "use_objstms" in Document.save() and befriended methods.
A corresponding adjustment in the documentation has also been made.

@SteveHawk
Copy link
Author

@JorjMcKie Thanks for the follow-up!

Just out of curiosity, can you elaborate on the reasons behind these two options cannot be used together? e.g. it's due to some implementation limitations, or by design/spec it's impossible, or something else.

I'd like to know if this is something we can maybe hope for getting implemented in the future.

@JorjMcKie
Copy link
Collaborator

First thing to realize is that this PDF concept has ever been problematic. The PDF specification does not seem to make anyone happy because of its overall (undue) complexity and its actual benefits are doubted by many.

We do see a trend away from linear file formats towards files with optimum compression that can be downloaded fast as a whole across today's highspeed networks.

Second, the goal of fast web access is contradictory to the desire having an as-small-as-possible PDF file. A linear PDF by its very nature duplicates information - thus adding to the file size. These duplicated structures should not be compressed anyway in order to foster easy access to objects needed early.
Using object streams means that object definitions vanish from the set of directly accessible PDF objects: they will become members inside some compressed other object. In order to see what they are, that containing stream object must be decompressed. All this adds to the memory requirement of the hosting file server and of course also to its processing power consumption.

When combining linearity with standard compression plus the maximum garbage collection, the resulting file size is already a good-enough result for PDFs intended for page-wise display across internet connections.

So the MuPDF team came to the described conclusion - which will not ever be reverted as far as we can see.

@SteveHawk
Copy link
Author

Your explanation is super informative and helpful, thank you so much!

this PDF concept has ever been problematic

Soooo true, couldn't agree more. They are a nightmare to deal with.

We do see a trend away from linear file formats towards files with optimum compression that can be downloaded fast as a whole across today's highspeed networks.

This is very interesting. At work, we have met some extreme edge cases, where the gigantic monster PDF files can be tens of thousands of pages long, and over a gigabyte large. More compression means saving a buck on object storage, and linearization means customer can view the file faster (we have some enterprise customer who only have a 20mbps connection to their desk, which could take over 10 min for the file to load, which sucks). That's why I tried to use these two options together.

These duplicated structures should not be compressed anyway in order to foster easy access to objects needed early.

In order to see what they are, that containing stream object must be decompressed.

That explains a lot, so I guess object stream compression versus linearization is a pretty much a space-time tradeoff that we have to choose from.

@JorjMcKie
Copy link
Collaborator

I understand. Thanks for your appreciation!
Maybe you have an idea already how much space can be saved by object streams. This may range from single to low 2-digits percentages.
The old version of the Adobe spec (the one with 1310 pages) is in linear format and has 30 MB size. Saving it with object streams, giving up linear structures takes a lot of time (think it was 30 minutes or so). Now I have a version sitting on my computer with 20 MB size.
Nice saving of 33%. But paging through it does take longer - as long as not all compressed object definition have been unwrapped.
The new version PDF 32000-1:2008 with 756 pages has never been linearized (a notable fact per se!) and originally is 20 MB.
Saving it with object streams was very fast (seconds) and sizes 8.3 MB.

@SteveHawk
Copy link
Author

That's so cool to know!

I did know how much space object streams can save, and I went back to check on some of the impressive cases i tried. And it turns out, the last example that I experimented was originally 19.9MB, and object stream compressed it down to exactly 8.3MB. What a coincidence!

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined wontfix no intention to resolve
Projects
None yet
Development

No branches or pull requests

3 participants