Enable image compression #1546

finevine · 2023-01-12T14:05:55Z

Explanation

I want to replace images in a pdf with compressed ones.
Gettings the images and saving them to disk work like a charm with the example in doc.
But I cannot change them in the pdf

Code Example

How would your feature be used?

from pypdf import PdfReader, PdfWriter
reader = PdfReader(input_file_path, strict=False)
for page in reader.pages:
        page.images = [compress(image_file_object) for image_file_object in page.images]
        writer.add_page(page)
...  # your new feature in action!

I have found a bunch of code that aimed at coding this feature:

but it doesn't work as expected.

MartinThoma · 2023-01-12T21:56:10Z

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

MartinThoma · 2023-01-13T08:07:44Z

There are also quite a lot of image compression algorithms

finevine · 2023-01-13T09:40:54Z

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

Yes, it does but as mentionned in the doc, it does not work on all pdf files (at least it does not increase size as GS do)


+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+
| Size (Mo) |                   type                    | pypdf | ratio  | quality |  gs   |  ratio  |      | pike-pdf |  ratio  |           quality            |
+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+
|       108 | Img and OCR                               |   8,6 | 92,04% | good    | 137,3 | -27,13% | good |      163 | -50,93% | good                         |
|        20 | Traditional text and images               |    20 | 0,00%  | good    |  12,3 | 38,50%  | good |       11 | 45,00%  | good for some images         |
|        44 | Repeated huge photos in a word doc to pdf |    44 | 0,00%  | good    |   4,4 | 90,00%  | good |       21 | 52,27%  | very poor on repeated images |
|      56,6 | 400 pages of a book converted to images   |  56,6 | 0,00%  | good    |  69,2 | -22,26% | good |     53,5 | 5,48%   | good                         |
|       5,8 | 96 pages of text and images               |   3,6 | 37,93% | good    |     3 | 48,28%  | good |      2,4 | 58,62%  | OK except for alpha layer    |
+-----------+-------------------------------------------+-------+--------+---------+-------+---------+------+----------+---------+------------------------------+

pypdf column uses you quoted method
gs column uses https://github.com/theeko74/pdfc (call to ghostscript)
pike-pdf column uses more or less https://github.com/theeko74/pdfc (resize big images with pillowimage.resize((width/2, height/2), Image.BILINEAR)

pubpub-zz · 2023-01-13T21:00:53Z

@pubpub-zz Running the following code compresses all content streams with DEFLATE, including the images, right?

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

Having a quick look to the code, It seams that only the content is deflated but not Ximages where big images should be.

@finevine,
I dislike the idea of implementing a non lossless compression : there will be too many options. However, there might be a possibility to implement a visitor function to replace the images

finevine · 2023-01-13T21:25:44Z

Yeah but the idea could be to implement a setter to images and let people change images and act on each, as pikepdf offer but without too many options (pikepdf is too complicated I think) That was the idea of my original feature request title. I'm willing to help but don't know how to... It's only an idea. Having only a getter on images is frustrating!

pubpub-zz · 2023-01-31T22:18:11Z

some notes/ideas about image setting:
(just a draft to be cleaned-up

from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject, NullObject
from PIL import Image
from io import BytesIO

w = PdfWriter()
w.append("resources/labeled-edges-center-image.pdf")

for p in w.pages:
    for image_file_object in p.images:
        print(image_file_object.name)
        ii = Image.open(BytesIO(image_file_object.data))
        b = BytesIO()
        ii.save(b, "pdf", quality=60, resolution=19.0, optimize=True)
        rrr = PdfReader(b)
        n = NameObject("/" + "".join(image_file_object.name.split(".")[:-1]))
        ind = p["/Resources"]["/XObject"].raw_get(n)
        w._objects[ind.idnum] = NullObject()  # to cleanup file
        p["/Resources"]["/XObject"][n] = (
            rrr.pages[0]["/Resources"]["/XObject"]["/image"].clone(w).indirect_reference
        )
w.write("tt.pdf")

edit : code updated

Having the capability to replace images trivially extends to compressing a PDF file size by reducing the contained images. Closes #1546

finevine assigned MartinThoma Jan 12, 2023

MartinThoma changed the title ~~Replace images in a page.images~~ Enable image compression Jan 12, 2023

MartinThoma added the is-feature A feature request label Jan 12, 2023

finevine closed this as completed Jan 13, 2023

finevine reopened this Jan 13, 2023

pubpub-zz mentioned this issue Feb 2, 2023

Images with transparency mask are not correctly extracted #1599

Closed

pubpub-zz mentioned this issue May 20, 2023

ENH: Add capability to replace image #1849

Merged

MartinThoma closed this as completed in #1849 Jun 13, 2023

MartinThoma pushed a commit that referenced this issue Jun 13, 2023

ENH: Add capability to replace image (#1849)

4a0d73f

Having the capability to replace images trivially extends to compressing a PDF file size by reducing the contained images. Closes #1546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable image compression #1546

Enable image compression #1546

finevine commented Jan 12, 2023

MartinThoma commented Jan 12, 2023

MartinThoma commented Jan 13, 2023

finevine commented Jan 13, 2023 •

edited

Loading

pubpub-zz commented Jan 13, 2023

finevine commented Jan 13, 2023 via email •

edited

Loading

pubpub-zz commented Jan 31, 2023 •

edited by MartinThoma

Loading

Enable image compression #1546

Enable image compression #1546

Comments

finevine commented Jan 12, 2023

Explanation

Code Example

MartinThoma commented Jan 12, 2023

MartinThoma commented Jan 13, 2023

finevine commented Jan 13, 2023 • edited Loading

pubpub-zz commented Jan 13, 2023

finevine commented Jan 13, 2023 via email • edited Loading

pubpub-zz commented Jan 31, 2023 • edited by MartinThoma Loading

finevine commented Jan 13, 2023 •

edited

Loading

finevine commented Jan 13, 2023 via email •

edited

Loading

pubpub-zz commented Jan 31, 2023 •

edited by MartinThoma

Loading