Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a work-around for scanned pdf #2

Closed
mvicenzi opened this issue Apr 25, 2022 · 5 comments · Fixed by #4
Closed

Find a work-around for scanned pdf #2

mvicenzi opened this issue Apr 25, 2022 · 5 comments · Fixed by #4
Labels
bug Something isn't working enhancement New feature or request

Comments

@mvicenzi
Copy link
Owner

Scanned pdf documents are not read properly and operations fail.
Investigate this issue: is it a PyPDF2 limitation? is there a workaround?

Possible starting points:

@mvicenzi mvicenzi added bug Something isn't working enhancement New feature or request labels Apr 25, 2022
@pubpub-zz
Copy link

pubpub-zz commented May 7, 2022

@mvicenzi, can you provide some pdf samples and a test code for analysis

@mvicenzi
Copy link
Owner Author

mvicenzi commented May 7, 2022

@pubpub-zz you can find the simplest code to reproduce the error here. The sample pdf file is this one, which was generated with a scanner.

@pubpub-zz
Copy link

I've tried with the dev version without any error.can you retry with the latest version of PyPDF2?

@mvicenzi
Copy link
Owner Author

mvicenzi commented May 7, 2022

I upgraded PyPDF2 from 1.27.3 to 1.27.12, but I still see errors.
I'm on python 3.10.2

Traceback (most recent call last):
  File "C:\Users\matte\Desktop\pdf_tools\debugging_scanned.py", line 7, in <module>
    merger.append(path1, bookmark=None, pages=None, import_bookmarks=True)
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\merger.py", line 227, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\merger.py", line 157, in merge
    pages = (0, pdfr.getNumPages())
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 318, in getNumPages
    self._flatten()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 706, in _flatten
    catalog = self.trailer[TK.ROOT].getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py", line 553, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py", line 198, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Users\matte\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 801, in getObject
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

I also tried cloning again the repo to check if uploading the pdf had fixed the file somehow, but no.... Still the same error as above.

@pubpub-zz
Copy link

I've understood : your PDF does not respect the rule where the xref table should start at 0, inducing a PdfReadError if strict is asserted. PyPDF2 moved to strict = False as default, but PdfFileMerger has been forgotten. I will push the fix however meanwhile you can initialize with strict = False:

merger = PdfFileMerger( strict = False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants