Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read files with new version, which worked previously #761

Open
Lucky-1994 opened this issue May 30, 2022 · 6 comments
Open

Unable to read files with new version, which worked previously #761

Lucky-1994 opened this issue May 30, 2022 · 6 comments
Labels
component:document Related to PDFDocument status: needs more info type:anomaly Errors caused by deviations from the PDF Reference

Comments

@Lucky-1994
Copy link

Lucky-1994 commented May 30, 2022

Bug report

We have some files that were previously processable with PDF Miner, but with the upgrade from version 20211012 to 20220506 it is not possible anymore. I debugged to find the root cause of it and it seems be related to these changes here:

except PDFNoValidXRef:
if fallback:
parser.fallback = True
newxref = PDFXRefFallback()
newxref.load(parser)
self.xrefs.append(newxref)

Previously the fallback code was executed always and made these PDF files work, but now it is only happening, when we have the exception, which does not occur in these documents.

Sadly i can not provide these documents.

@pietermarsman
Copy link
Member

Hi,

Thanks for opening this issue. This was a risk we explicitly took when changing how the fallback is used in #684 (comment).

I would like to keep this issue open for a while, gather example PDF's and get a feeling of how big the impact of this change is. Since the xref is a pretty important part of the PDF I expect almost all PDF's will have a valid one.

@Lucky-1994
Copy link
Author

Is there a possibility for a setting so we can decide ourselves, if we want the best performance or rather have higher compatibility with some files that are not created properly. This would help us a lot.

@pietermarsman
Copy link
Member

Yes, that's one way in which we could solve this issue.

@prgx-erodri02
Copy link

I also have documents that are affected by this. It manifests as pages with no content. The documents are contracts and I'm checking to see if there's a possibility to share a sample

As an example, the following is a qpdf --check output for a 2 page document where pdfminer is not returning any text content for the second page:
PDF Version: 1.7
File is not encrypted
File is linearized
WARNING: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: shared object 2 length mismatch: hint table = 2995; computed = 857
WARNING: shared object 3 length mismatch: hint table = 2717; computed = 857
WARNING: shared object 4 length mismatch: hint table = 6076; computed = 2995
WARNING: shared object 5 length mismatch: hint table = 825; computed = 857
WARNING: shared object 6 length mismatch: hint table = 7179; computed = 857
WARNING: shared object 7 length mismatch: hint table = 881; computed = 857
WARNING: shared object 8 length mismatch: hint table = 6963; computed = 2717
WARNING: page length mismatch for page 0: hint table = 28605; computed length = 10966 (offset = 1314)
WARNING: page length mismatch for page 1: hint table = 9515; computed length = 807 (offset = 29919)
WARNING: page 1: shared object 19: in hint table but not computed list
WARNING: page 1: shared object 20: in hint table but not computed list
WARNING: page 1: shared object 21: in hint table but not computed list
WARNING: page 1: shared object 22: in hint table but not computed list
WARNING: page 1: shared object 26: in computed list but not hint table
WARNING: page 1: shared object 29: in computed list but not hint table
WARNING: page 1: shared object 32: in computed list but not hint table
WARNING: page 1: shared object 35: in computed list but not hint table

Metadata about the document:
Application: wkhtmltopdf 0.12.5
PDF Producer: Qt 4.8.7

Some sort of compatibility flag would be great

@prgx-erodri02
Copy link

As additional information, the document loads fine when I checked out commit dc530f3 on master, but does not load properly at commit 4b138a6 on master. So it's definitely the linked issue, 684, that caused this regression

@pietermarsman
Copy link
Member

@prgx-erodri02 Can you share them? And also write down what you expect the output to look like. I expect in almost all cases the issues can be fixed by repairing the PDF with something like mutool. That will repair the xref, and make your documents parsable.

In 1.5 year there has been little activity here so I think it is fair to assume that most PDF's have either an up-to-date xref, or no xref.

@pietermarsman pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference and removed type: bug labels Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:document Related to PDFDocument status: needs more info type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

No branches or pull requests

3 participants