Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Enable fallback in case of exceptions #684

Merged
merged 12 commits into from
Feb 1, 2022

Conversation

tongbaojia
Copy link
Contributor

@tongbaojia tongbaojia commented Oct 18, 2021

Pull request

  • We were investigating the use of the fallback kwarg in PDFDocument, and discovered a pass statement. We think when the exception is caught, the fallback should be changed to True, as commented out.
  • The commit that changed this behavior was: 846cd18#diff-19c70cd6ee74a06a40b8124e99c5da47009d4b7d3d08f588290052c7538. Without further context, the pass feels like a missed comment out to us.
  • One line change in the exception of PDFNoValidXRef
  • Doesn't have an issue yet.

How Has This Been Tested?

This change seems to be valid for the document we are processing.

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

@tongbaojia tongbaojia changed the title [Fix fallback options] [Fix] Enable fallback in case of exceptions Oct 18, 2021
@0xabu
Copy link
Contributor

0xabu commented Oct 21, 2021

This code has been unchanged since 2014...

Do you have a sample PDF file that demonstrates an issue with the current logic?
Do you get an exception or other concrete failure?

@tongbaojia
Copy link
Contributor Author

This code has been unchanged since 2014...

Do you have a sample PDF file that demonstrates an issue with the current logic? Do you get an exception or other concrete failure?

Hi,

I agree this has not been changed for a long time. It is a little bit worrying to update the working code.

We discussed the code a bit more, please see the updated PR for the new proposed change.

Currently, the behavior is not affected if PDFNoValidXRef is raised.
After the change:

  • if fallback is False, the behavior is the same as before, regardless of if PDFNoValidXRef is raised.
  • if fallback is True
    • if PDFNoValidXRef is raised, same behavior as before
    • if PDFNoValidXRef is NOT raised, we won't trigger the downstream fallback logic. We think this is the right way, as the code before the commit mentioned above had fallback = True only when the PDFNoValidXRef is raised. We tested on ~ 1000 doc and didn't observe any difference in any cases, whether the fallback option is needed or not.

This will increase efficiency to load PDFDocument for cases where fallback is not needed.

@pietermarsman
Copy link
Member

I agree on many things. For one, this could be a major performance gain if only parts of the PDF are read, or if no caching is used. The PDFXRefFallback requires reading the entire PDF file and indexing all the objects.

I also agree that it is scary to change this since it has been the same for so long.

The old implementation is very safe, it always indexes all objects (if fallback=True which is always used as a default). But this has a big performance cost. The new implementation fixes this performance cost, and only indexes all objects if there is no xref specified in the PDF. This is less safe, for example if the PDF has one or more xrefs but they do not index all objects. Such a PDF is obviously broken, but not unimaginable. Some proof that this rarely happens is provided by the ~1000 PDF's that @tongbaojia used to test the new implementation.

Rather than trusting on the integrity of PDF's, I would prefer an implementation that gets best of both worlds: not index all objects by default, but does allow to get the position of all objects if needed.

A potential way to do this is to load the PDFXRefFallback lazily, only when an object id is not found in the regular PDFXRefs. E.g. by changing its get_pos() method to also call self.load(parser) if that has not happened before.

@pietermarsman
Copy link
Member

I've did some more thinking on this and think the current fix is good to go.

In the unlikely event of a broken PDF with an xref that does not list all objects that are internally referenced, we can create another PR to create the lazy fallback option. For now, lets assume that if the xref is there, it also lists all the objects that are referenced in the PDF. If the xref is not there, we will use the fallback.

@pietermarsman pietermarsman merged commit 4b138a6 into pdfminer:develop Feb 1, 2022
@pietermarsman
Copy link
Member

@tongbaojia Thanks!

@tongbaojia
Copy link
Contributor Author

@pietermarsman Thanks for accepting it!

Indeed there is a very significant gain in creating PDFDocument Objects with the update (I think when we measured it the speed increased almost by a factor of 10).

Feel free to tag me in case this update raises errors in the future.

Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this pull request Feb 14, 2022
* develop:
  Check blackness in github actions (pdfminer#711)
  Changed `log.info` to  `log.debug` in six files (pdfminer#690)
  Update README.md batch for Continuous integration
  Update actions.yml so that it will run for all PR's
  Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704)
  Added feature: page labels (pdfminer#680)
  Remove obsolete returns (pdfminer#707)
  Revert "Remove obsolete returns"
  Remove obsolete returns
  Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684)
  Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673)
  Change log.info into log.debug to make pdfinterp.py less verbose
  Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659)
  export type annotations in package (pdfminer#679)
  fix typos in PR template (pdfminer#681)
  pdf2txt: clean up construction of LAParams from arguments (pdfminer#682)
  Fixes jbig2 writer to write valid jb2 files
  Add support for JPEG2000 image encoding
  Added test case for CCITTFaxDecoder (pdfminer#700)
  Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants