Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to handle decompression error on some broken PDF files #637

Merged
merged 8 commits into from
Dec 11, 2021

Conversation

sthenault
Copy link

@sthenault sthenault commented Jul 1, 2021

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.

Fix #636

Checklist

  • [* ] I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Like the test case.

Suggested some small changes to improve code quality.

Comment on lines 201 to 202
if i < len(data) - 3:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 3 seems to be cherry-picked specifically for the PDF in case. Maybe just always ignore the error if not in settings.STRICTmode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it has been empirically found using this file and another one :) The idea is that the error fixed here comes from some CRC check at the end of the data, which has been found to be in the last 4 bytes. This could probably be confirmed by the zlib spec. https://docs.python.org/3/library/zlib.html#zlib.adler32 is about int32 checksum, which seems consistent.

If we just plain raise the error in STRICT mode, the corrupted file won't be readable in this mode as before this patch. This may be fine but could be even done before calling decompress_corrupted

Will a some comment about this in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is in settings.STRICT mode it should raise the error asap and not try the decompress_corrupted() at all. That's how things are right now, so no changes needed there.

Since decompress_corrupted only deals with the non strict mode, I think it's fair to just extract as much as possible and not raise the error ever.

But I can also see your argument that when it raises an error before the last 4 bytes there is data missing. And we should make the user aware of that. Maybe raise a warning in that case? See pdfpage.py:140 for an example on how to do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is in settings.STRICT mode it should raise the error asap and not try the decompress_corrupted() at all.

It is still trying decompress_corrupted() in strict mode. That should not happen.

pdfminer/pdftypes.py Outdated Show resolved Hide resolved
pdfminer/pdftypes.py Show resolved Hide resolved
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
where zlib error is detected before the CRC checksum.
@pietermarsman pietermarsman merged commit 10f6fb4 into pdfminer:develop Dec 11, 2021
Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this pull request Feb 14, 2022
* develop:
  Check blackness in github actions (pdfminer#711)
  Changed `log.info` to  `log.debug` in six files (pdfminer#690)
  Update README.md batch for Continuous integration
  Update actions.yml so that it will run for all PR's
  Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704)
  Added feature: page labels (pdfminer#680)
  Remove obsolete returns (pdfminer#707)
  Revert "Remove obsolete returns"
  Remove obsolete returns
  Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684)
  Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673)
  Change log.info into log.debug to make pdfinterp.py less verbose
  Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659)
  export type annotations in package (pdfminer#679)
  fix typos in PR template (pdfminer#681)
  pdf2txt: clean up construction of LAParams from arguments (pdfminer#682)
  Fixes jbig2 writer to write valid jb2 files
  Add support for JPEG2000 image encoding
  Added test case for CCITTFaxDecoder (pdfminer#700)
  Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No content detected on some pdf files
2 participants