[Fix] Enable fallback in case of exceptions #684

tongbaojia · 2021-10-18T15:04:44Z

Pull request

We were investigating the use of the fallback kwarg in PDFDocument, and discovered a pass statement. We think when the exception is caught, the fallback should be changed to True, as commented out.
The commit that changed this behavior was: 846cd18#diff-19c70cd6ee74a06a40b8124e99c5da47009d4b7d3d08f588290052c7538. Without further context, the pass feels like a missed comment out to us.
One line change in the exception of PDFNoValidXRef
Doesn't have an issue yet.

How Has This Been Tested?

This change seems to be valid for the document we are processing.

Checklist

I have added tests that prove my fix is effective or that my feature
works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial
version
I have updated the README.md or I am verified that this
is not necessary
I have updated the readthedocs documentation or I
verified that this is not necessary
I have added a consice human-readable description of the change to
CHANGELOG.md

0xabu · 2021-10-21T17:13:37Z

This code has been unchanged since 2014...

Do you have a sample PDF file that demonstrates an issue with the current logic?
Do you get an exception or other concrete failure?

tongbaojia · 2021-10-22T14:04:17Z

This code has been unchanged since 2014...

Do you have a sample PDF file that demonstrates an issue with the current logic? Do you get an exception or other concrete failure?

Hi,

I agree this has not been changed for a long time. It is a little bit worrying to update the working code.

We discussed the code a bit more, please see the updated PR for the new proposed change.

Currently, the behavior is not affected if PDFNoValidXRef is raised.
After the change:

if fallback is False, the behavior is the same as before, regardless of if PDFNoValidXRef is raised.
if fallback is True
- if PDFNoValidXRef is raised, same behavior as before
- if PDFNoValidXRef is NOT raised, we won't trigger the downstream fallback logic. We think this is the right way, as the code before the commit mentioned above had fallback = True only when the PDFNoValidXRef is raised. We tested on ~ 1000 doc and didn't observe any difference in any cases, whether the fallback option is needed or not.

This will increase efficiency to load PDFDocument for cases where fallback is not needed.

pietermarsman · 2022-01-30T16:03:34Z

I agree on many things. For one, this could be a major performance gain if only parts of the PDF are read, or if no caching is used. The PDFXRefFallback requires reading the entire PDF file and indexing all the objects.

I also agree that it is scary to change this since it has been the same for so long.

The old implementation is very safe, it always indexes all objects (if fallback=True which is always used as a default). But this has a big performance cost. The new implementation fixes this performance cost, and only indexes all objects if there is no xref specified in the PDF. This is less safe, for example if the PDF has one or more xrefs but they do not index all objects. Such a PDF is obviously broken, but not unimaginable. Some proof that this rarely happens is provided by the ~1000 PDF's that @tongbaojia used to test the new implementation.

Rather than trusting on the integrity of PDF's, I would prefer an implementation that gets best of both worlds: not index all objects by default, but does allow to get the position of all objects if needed.

A potential way to do this is to load the PDFXRefFallback lazily, only when an object id is not found in the regular PDFXRefs. E.g. by changing its get_pos() method to also call self.load(parser) if that has not happened before.

pietermarsman · 2022-02-01T00:11:59Z

I've did some more thinking on this and think the current fix is good to go.

In the unlikely event of a broken PDF with an xref that does not list all objects that are internally referenced, we can create another PR to create the lazy fallback option. For now, lets assume that if the xref is there, it also lists all the objects that are referenced in the PDF. If the xref is not there, we will use the fallback.

pietermarsman · 2022-02-01T00:21:07Z

@tongbaojia Thanks!

tongbaojia · 2022-02-01T00:44:05Z

@pietermarsman Thanks for accepting it!

Indeed there is a very significant gain in creating PDFDocument Objects with the update (I think when we measured it the speed increased almost by a factor of 10).

Feel free to tag me in case this update raises errors in the future.

* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)

tongbaojia and others added 7 commits June 8, 2020 17:58

check obj type

226009d

update changelog

4b36d30

Update CHANGELOG.md

3913e65

merge

1e8538f

Merge remote-tracking branch 'upstream/develop' into develop

7ee0116

Merge remote-tracking branch 'upstream/develop' into develop

830161b

add changes

1a4d4c2

tongbaojia changed the title ~~[Fix fallback options]~~ [Fix] Enable fallback in case of exceptions Oct 18, 2021

update change

ab2cbef

update changelog

247ba9e

pietermarsman added 3 commits February 1, 2022 01:15

Use fallback in except clause

4f01c50

Update changelog.md

6adbc1d

Merge branch 'develop' into TT_fix_fallback

3a86ee6

pietermarsman merged commit 4b138a6 into pdfminer:develop Feb 1, 2022

pietermarsman mentioned this pull request Mar 20, 2022

Exception when parsing PDF 'portfolios' #403

Closed

pietermarsman mentioned this pull request Jun 25, 2022

Unable to read files with new version, which worked previously #761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Enable fallback in case of exceptions #684

[Fix] Enable fallback in case of exceptions #684

tongbaojia commented Oct 18, 2021 •

edited

Loading

0xabu commented Oct 21, 2021

tongbaojia commented Oct 22, 2021

pietermarsman commented Jan 30, 2022

pietermarsman commented Feb 1, 2022

pietermarsman commented Feb 1, 2022

tongbaojia commented Feb 1, 2022

[Fix] Enable fallback in case of exceptions #684

[Fix] Enable fallback in case of exceptions #684

Conversation

tongbaojia commented Oct 18, 2021 • edited Loading

0xabu commented Oct 21, 2021

tongbaojia commented Oct 22, 2021

pietermarsman commented Jan 30, 2022

pietermarsman commented Feb 1, 2022

pietermarsman commented Feb 1, 2022

tongbaojia commented Feb 1, 2022

tongbaojia commented Oct 18, 2021 •

edited

Loading