Find a work-around for scanned pdf #2

mvicenzi · 2022-04-25T12:33:45Z

Scanned pdf documents are not read properly and operations fail.
Investigate this issue: is it a PyPDF2 limitation? is there a workaround?

Possible starting points:

pubpub-zz · 2022-05-07T08:17:13Z

@mvicenzi, can you provide some pdf samples and a test code for analysis

mvicenzi · 2022-05-07T16:43:36Z

@pubpub-zz you can find the simplest code to reproduce the error here. The sample pdf file is this one, which was generated with a scanner.

pubpub-zz · 2022-05-07T20:04:34Z

I've tried with the dev version without any error.can you retry with the latest version of PyPDF2?

mvicenzi · 2022-05-07T21:46:52Z

I upgraded PyPDF2 from 1.27.3 to 1.27.12, but I still see errors.
I'm on python 3.10.2

Traceback (most recent call last):
  File "C:\Users\matte\Desktop\pdf_tools\debugging_scanned.py", line 7, in <module>
    merger.append(path1, bookmark=None, pages=None, import_bookmarks=True)
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\merger.py", line 227, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\merger.py", line 157, in merge
    pages = (0, pdfr.getNumPages())
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 318, in getNumPages
    self._flatten()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 706, in _flatten
    catalog = self.trailer[TK.ROOT].getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py", line 553, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py", line 198, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 801, in getObject
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

I also tried cloning again the repo to check if uploading the pdf had fixed the file somehow, but no.... Still the same error as above.

pubpub-zz · 2022-05-10T21:37:15Z

I've understood : your PDF does not respect the rule where the xref table should start at 0, inducing a PdfReadError if strict is asserted. PyPDF2 moved to strict = False as default, but PdfFileMerger has been forgotten. I will push the fix however meanwhile you can initialize with strict = False:

merger = PdfFileMerger( strict = False)

tracked in mvicenzi/pdf_tools#2 as said in title

See mvicenzi/pdf_tools#2

mvicenzi added bug Something isn't working enhancement New feature or request labels Apr 25, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue May 10, 2022

BUG : Merger to be initialize with strict=False default

0019e3b

tracked in mvicenzi/pdf_tools#2 as said in title

pubpub-zz mentioned this issue May 10, 2022

BUG : Merger to be initialize with strict=False default py-pdf/pypdf#871

Merged

MartinThoma pushed a commit to py-pdf/pypdf that referenced this issue May 13, 2022

MAINT: Initialize PdfMerger with strict=False by default (#871)

a9c31a4

See mvicenzi/pdf_tools#2

mvicenzi linked a pull request May 25, 2022 that will close this issue

2 find a work around for scanned pdf #4

Merged

mvicenzi closed this as completed in #4 May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find a work-around for scanned pdf #2

Find a work-around for scanned pdf #2

mvicenzi commented Apr 25, 2022

pubpub-zz commented May 7, 2022 •

edited

Loading

mvicenzi commented May 7, 2022

pubpub-zz commented May 7, 2022

mvicenzi commented May 7, 2022

pubpub-zz commented May 10, 2022

Find a work-around for scanned pdf #2

Find a work-around for scanned pdf #2

Comments

mvicenzi commented Apr 25, 2022

pubpub-zz commented May 7, 2022 • edited Loading

mvicenzi commented May 7, 2022

pubpub-zz commented May 7, 2022

mvicenzi commented May 7, 2022

pubpub-zz commented May 10, 2022

pubpub-zz commented May 7, 2022 •

edited

Loading