Unable to read files with new version, which worked previously #761

Lucky-1994 · 2022-05-30T13:28:53Z

Bug report

We have some files that were previously processable with PDF Miner, but with the upgrade from version 20211012 to 20220506 it is not possible anymore. I debugged to find the root cause of it and it seems be related to these changes here:

pdfminer.six/pdfminer/pdfdocument.py

Lines 723 to 728 in 86e3487

    
           except PDFNoValidXRef: 
        
               if fallback: 
        
                   parser.fallback = True 
        
                   newxref = PDFXRefFallback() 
        
                   newxref.load(parser) 
        
                   self.xrefs.append(newxref)

Previously the fallback code was executed always and made these PDF files work, but now it is only happening, when we have the exception, which does not occur in these documents.

Sadly i can not provide these documents.

pietermarsman · 2022-06-25T19:35:15Z

Hi,

Thanks for opening this issue. This was a risk we explicitly took when changing how the fallback is used in #684 (comment).

I would like to keep this issue open for a while, gather example PDF's and get a feeling of how big the impact of this change is. Since the xref is a pretty important part of the PDF I expect almost all PDF's will have a valid one.

Lucky-1994 · 2022-07-20T08:17:14Z

Is there a possibility for a setting so we can decide ourselves, if we want the best performance or rather have higher compatibility with some files that are not created properly. This would help us a lot.

pietermarsman · 2022-08-08T20:15:55Z

Yes, that's one way in which we could solve this issue.

prgx-erodri02 · 2024-01-10T09:53:20Z

I also have documents that are affected by this. It manifests as pages with no content. The documents are contracts and I'm checking to see if there's a possibility to share a sample

As an example, the following is a qpdf --check output for a 2 page document where pdfminer is not returning any text content for the second page:
PDF Version: 1.7
File is not encrypted
File is linearized
WARNING: linearized file contains an uncompressed object after a compressed one in a cross-reference stream
WARNING: shared object 2 length mismatch: hint table = 2995; computed = 857
WARNING: shared object 3 length mismatch: hint table = 2717; computed = 857
WARNING: shared object 4 length mismatch: hint table = 6076; computed = 2995
WARNING: shared object 5 length mismatch: hint table = 825; computed = 857
WARNING: shared object 6 length mismatch: hint table = 7179; computed = 857
WARNING: shared object 7 length mismatch: hint table = 881; computed = 857
WARNING: shared object 8 length mismatch: hint table = 6963; computed = 2717
WARNING: page length mismatch for page 0: hint table = 28605; computed length = 10966 (offset = 1314)
WARNING: page length mismatch for page 1: hint table = 9515; computed length = 807 (offset = 29919)
WARNING: page 1: shared object 19: in hint table but not computed list
WARNING: page 1: shared object 20: in hint table but not computed list
WARNING: page 1: shared object 21: in hint table but not computed list
WARNING: page 1: shared object 22: in hint table but not computed list
WARNING: page 1: shared object 26: in computed list but not hint table
WARNING: page 1: shared object 29: in computed list but not hint table
WARNING: page 1: shared object 32: in computed list but not hint table
WARNING: page 1: shared object 35: in computed list but not hint table

Metadata about the document:
Application: wkhtmltopdf 0.12.5
PDF Producer: Qt 4.8.7

Some sort of compatibility flag would be great

prgx-erodri02 · 2024-01-10T09:59:11Z

As additional information, the document loads fine when I checked out commit dc530f3 on master, but does not load properly at commit 4b138a6 on master. So it's definitely the linked issue, 684, that caused this regression

pietermarsman · 2024-01-16T20:38:50Z

@prgx-erodri02 Can you share them? And also write down what you expect the output to look like. I expect in almost all cases the issues can be fixed by repairing the PDF with something like mutool. That will repair the xref, and make your documents parsable.

In 1.5 year there has been little activity here so I think it is fair to assume that most PDF's have either an up-to-date xref, or no xref.

pietermarsman added type: bug component:document Related to PDFDocument labels Jun 25, 2022

pietermarsman added the status: needs more info label Aug 9, 2022

pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference and removed type: bug labels Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read files with new version, which worked previously #761

Unable to read files with new version, which worked previously #761

Lucky-1994 commented May 30, 2022 •

edited

Loading

pietermarsman commented Jun 25, 2022

Lucky-1994 commented Jul 20, 2022

pietermarsman commented Aug 8, 2022

prgx-erodri02 commented Jan 10, 2024

prgx-erodri02 commented Jan 10, 2024

pietermarsman commented Jan 16, 2024

Unable to read files with new version, which worked previously #761

Unable to read files with new version, which worked previously #761

Comments

Lucky-1994 commented May 30, 2022 • edited Loading

pietermarsman commented Jun 25, 2022

Lucky-1994 commented Jul 20, 2022

pietermarsman commented Aug 8, 2022

prgx-erodri02 commented Jan 10, 2024

prgx-erodri02 commented Jan 10, 2024

pietermarsman commented Jan 16, 2024

Lucky-1994 commented May 30, 2022 •

edited

Loading