Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted black and white from optimization #1015

Open
Jmuccigr opened this issue Sep 18, 2022 · 9 comments
Open

Inverted black and white from optimization #1015

Jmuccigr opened this issue Sep 18, 2022 · 9 comments

Comments

@Jmuccigr
Copy link
Contributor

Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.

@jbarlow83
Copy link
Collaborator

Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.

@Jmuccigr
Copy link
Contributor Author

I've got 6.0.2.

@Jmuccigr
Copy link
Contributor Author

Any thoughts?

@jbarlow83
Copy link
Collaborator

Thoughts

  • it's hard to get monochrome right because there are various options to invert that are not always respected by all programs
  • because of the above, it's hard to investigate without a PDF
  • you could use qpdf's new --json features as a way of showing me the structure of the PDF without the content
  • using a heuristic is really tempting
  • I don't know when I'll have bandwidth

@alirf81
Copy link

alirf81 commented Oct 24, 2022

Any updates on this issue? I have similar problems and the version of pikepdf is 6.2.1

@jbarlow83
Copy link
Collaborator

@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.

@poldy8
Copy link

poldy8 commented Nov 13, 2022

Hi there. Thank you so much for working on and maintaining this project.

I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:

ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf

If I do it without --output-type pdf everything seems fine.

I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.

If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.

Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?

Many thanks!

@Jmuccigr
Copy link
Contributor Author

Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.

@vejkse
Copy link

vejkse commented Jan 19, 2023

[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]

Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.

  1. The original PDF file was A.pdf, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.
  2. pdfimages -tiff A.pdf B
  3. img2pdf --output C.pdf B-000.tif
  4. ocrmypdf --language eng --output-type pdf C.pdf D.pdf — The resulting file D.pdf is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.

Here are all the files, except B-000.tif since GitHub doesn’t allow me to upload it.
A.pdf
C.pdf
D.pdf

Versions:

  • OCRmyPDF 14.0.1
  • python-pikepdf 6.2.6
  • ghostscript 9.56.1
  • img2pdf 0.4.4
  • poppler 22.12.0 (for pdfimages)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants