Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Paeth PNG filter compression (predictor value = 4) #537

Merged
merged 10 commits into from
Aug 26, 2021

Conversation

edugonza
Copy link

@edugonza edugonza commented Oct 27, 2020

Pull request

Fixes #339 Added support for Paeth PNG compression filter (predictor value = 4). Fixed bugs in other filters.

How Has This Been Tested?

I have tested this with a private PDF file that I am not allowed to share.

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

@cslotboom
Copy link

Kind of random, but I was having issues with reading certain pdfs. Implementing your code in this patch fixed my problem completely and, as far as I can tell, I've had no errors since including it. Looking forward for this being integrated into the base package!

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edugonza,

Thanks for opening this PR and taking the time to improve pdfminer.six.

The code of the PR looks good. I did not check if the actual implementation of the type 4 predictor matches the libpng specification. But since you added the link to the spec yourself I think it's ok.

One request: can add a test case with a pdf with this type of path predictor? I know yours is proprietary but perhaps you can find an example on the internet or even generate one.

@pietermarsman
Copy link
Member

@cslotboom is it possible to share your pdf with a type 4 png path predictor?

@edugonza
Copy link
Author

Hi @pietermarsman,

Unfortunately, I was not able to create a PDF with a PNG encoded with this type of predictor. I do not have any open/free document to use. If you know or have documentation on how to create such a document, I could try again.

Best regards
Edu

@cslotboom
Copy link

I'm not 100% I have an example that uses this type of predictor explicitly, I just know that this code has fixed my errors.

I'm using pdfminer to read lines in PDF drawings. From what I've found, any PDF with lines in it has a risk of not working without this patch. I suspect that the scale/rounding of the pdf had something to do with it, as it seemed random which pdfs worked and which didn't.
I can do my best to dig something up this weekend.

@fgregg
Copy link

fgregg commented Jul 29, 2021

this also fixed issues for me. thanks @edugonza!

@cslotboom
Copy link

For reference, this is roughly the type of pdf that would fail without this patch. Can not say for certain if it has a 4 png path predictor or not.
TestFloor.pdf

@pietermarsman
Copy link
Member

@cslotboom The *TestFloor.pdf` is also working using the development branch of pdfminer.six.

The *2018.pdf` shared in the linked issue (I attached it here again) does indeed show that it fixes the issue.

2018.pdf

@pietermarsman
Copy link
Member

@edugonza Thanks for your work!

Do you have some references for the changes to predictor values of 1 and 3? This changes current behavior and I'm always very careful with that, but it also looks like it is an improvement. Anyway, I would like a comment that explains what the algorithm is based on because (I think) it is not part of the PNG specification you already linked. Let me know if you have something.

@edugonza
Copy link
Author

Hi @pietermarsman ,

If I remember well, the PNG specification in the link describes how to apply the filter to raw data. However, the piece of code at hand performs the inverse operation, to decode data from the filtered version to raw. That is why the code does not seem to apply the filter as it is, but undoes what the specification describes.

I hope that explains the differences.
KR
Edu

for j, r in enumerate(line1):
left = int(line2[j-bpp]) if j-bpp >= 0 else 0
up = int(line0[j])
c = ((r + math.floor(left + up)) // 2) & 255
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line must be:

c = (r + math.floor(left + up) // 2) & 255

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be wrong in the current implementation as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it in the latest commit

…nd add pieces of the docs to show what is going on.
@pietermarsman
Copy link
Member

@edugonza I went over all the filter types, refactored the names for better understanding and now I'm pretty confident that everything is according to the docs.

I want to check one last time before I merge. If you have time, can you validate/check/compare the current implementation against the docs as well?

I would like to get this right in one go :)

@pietermarsman pietermarsman merged commit ea00f56 into pdfminer:develop Aug 26, 2021
@pietermarsman
Copy link
Member

@edugonza thanks for all the work!

@edugonza
Copy link
Author

Hi @pietermarsman,

Sorry for my late reply. Thank you for all the fixes and corrections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unsupported predictor value
4 participants