-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to read files with new version, which worked previously #761
Comments
Hi, Thanks for opening this issue. This was a risk we explicitly took when changing how the fallback is used in #684 (comment). I would like to keep this issue open for a while, gather example PDF's and get a feeling of how big the impact of this change is. Since the xref is a pretty important part of the PDF I expect almost all PDF's will have a valid one. |
Is there a possibility for a setting so we can decide ourselves, if we want the best performance or rather have higher compatibility with some files that are not created properly. This would help us a lot. |
Yes, that's one way in which we could solve this issue. |
I also have documents that are affected by this. It manifests as pages with no content. The documents are contracts and I'm checking to see if there's a possibility to share a sample As an example, the following is a qpdf --check output for a 2 page document where pdfminer is not returning any text content for the second page: Metadata about the document: Some sort of compatibility flag would be great |
@prgx-erodri02 Can you share them? And also write down what you expect the output to look like. I expect in almost all cases the issues can be fixed by repairing the PDF with something like mutool. That will repair the xref, and make your documents parsable. In 1.5 year there has been little activity here so I think it is fair to assume that most PDF's have either an up-to-date xref, or no xref. |
Bug report
We have some files that were previously processable with PDF Miner, but with the upgrade from version 20211012 to 20220506 it is not possible anymore. I debugged to find the root cause of it and it seems be related to these changes here:
pdfminer.six/pdfminer/pdfdocument.py
Lines 723 to 728 in 86e3487
Previously the fallback code was executed always and made these PDF files work, but now it is only happening, when we have the exception, which does not occur in these documents.
Sadly i can not provide these documents.
The text was updated successfully, but these errors were encountered: