Inverted black and white from optimization #1015

Jmuccigr · 2022-09-18T14:52:18Z

Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.

jbarlow83 · 2022-09-18T18:47:51Z

Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.

Jmuccigr · 2022-09-19T10:28:40Z

I've got 6.0.2.

Jmuccigr · 2022-09-25T11:13:39Z

Any thoughts?

jbarlow83 · 2022-09-28T05:46:58Z

Thoughts

it's hard to get monochrome right because there are various options to invert that are not always respected by all programs
because of the above, it's hard to investigate without a PDF
you could use qpdf's new --json features as a way of showing me the structure of the PDF without the content
using a heuristic is really tempting
I don't know when I'll have bandwidth

alirf81 · 2022-10-24T10:07:14Z

Any updates on this issue? I have similar problems and the version of pikepdf is 6.2.1

jbarlow83 · 2022-10-24T10:44:15Z

@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.

poldy8 · 2022-11-13T15:08:01Z

Hi there. Thank you so much for working on and maintaining this project.

I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:

ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf

If I do it without --output-type pdf everything seems fine.

I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.

If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.

Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?

Many thanks!

Jmuccigr · 2022-11-13T16:49:16Z

Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.

vejkse · 2023-01-19T14:10:13Z

[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]

Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.

The original PDF file was A.pdf, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.
pdfimages -tiff A.pdf B
img2pdf --output C.pdf B-000.tif
ocrmypdf --language eng --output-type pdf C.pdf D.pdf — The resulting file D.pdf is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.

Here are all the files, except B-000.tif since GitHub doesn’t allow me to upload it.
A.pdf
C.pdf
D.pdf

Versions:

OCRmyPDF 14.0.1
python-pikepdf 6.2.6
ghostscript 9.56.1
img2pdf 0.4.4
poppler 22.12.0 (for pdfimages)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverted black and white from optimization #1015

Inverted black and white from optimization #1015

Jmuccigr commented Sep 18, 2022

jbarlow83 commented Sep 18, 2022

Jmuccigr commented Sep 19, 2022

Jmuccigr commented Sep 25, 2022

jbarlow83 commented Sep 28, 2022

alirf81 commented Oct 24, 2022 •

edited

Loading

jbarlow83 commented Oct 24, 2022

poldy8 commented Nov 13, 2022 •

edited

Loading

Jmuccigr commented Nov 13, 2022

vejkse commented Jan 19, 2023

Inverted black and white from optimization #1015

Inverted black and white from optimization #1015

Comments

Jmuccigr commented Sep 18, 2022

jbarlow83 commented Sep 18, 2022

Jmuccigr commented Sep 19, 2022

Jmuccigr commented Sep 25, 2022

jbarlow83 commented Sep 28, 2022

alirf81 commented Oct 24, 2022 • edited Loading

jbarlow83 commented Oct 24, 2022

poldy8 commented Nov 13, 2022 • edited Loading

Jmuccigr commented Nov 13, 2022

vejkse commented Jan 19, 2023

alirf81 commented Oct 24, 2022 •

edited

Loading

poldy8 commented Nov 13, 2022 •

edited

Loading