Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems rendering objects independently in PDF #1501

Closed
davidmarogy opened this issue Nov 19, 2021 · 7 comments
Closed

Problems rendering objects independently in PDF #1501

davidmarogy opened this issue Nov 19, 2021 · 7 comments

Comments

@davidmarogy
Copy link

davidmarogy commented Nov 19, 2021

Hello,

since the Weasyprint Update 53.3 all images, texts etc. are rendered as one big image in the PDF.
Using Adobe Acrobat Pro in the preflight mode, no texts objects, image objects and vectore objects are shown anymore. For better understanding i attached both files. The first file showes the created PDF using Weasyprint Version 52.5 and the second Version 53.3. Futhermore Fonts are also not detected anymore in the preflight mode.

image (4)

image (5)

The problem with this is that i am converting my PDF using ghostscript to a CMYK format. Now that the whole document is just one big image, ghostscript cannot detect any textelements and is formating it not correctly. On Version 52.5 this worked without any problems.

Here is my Code for generating the pdf. All my styles are directly inserted into the html.:

def create_document_pdf(source_document, target_path="") -> bytes:
    relative_document_html_url = reverse("document_html", kwargs={"document_id": source_document.id})
    document_html_url = backend_url(relative_document_html_url)
    fonts = source_document.shop.organization.fonts.all().order_by("name")
    if fonts.exists():
        font_config = FontConfiguration()
        result = HTML(document_html_url).write_pdf(font_config=font_config)
    else:
        result = HTML(document_html_url).write_pdf()
    return result

Do i need to configure something else to render them seperately? Or is this now the standard?
I attached both pdfs and the preflight mode of Adobe Acrobat Pro(PDF-Analyze-> List Siteobjects, grouped by object type).

Best Regards,
David

converted_newest_weasy.pdf
converted_old_weasyprint.pdf
image (6)

@davidmarogy davidmarogy changed the title Problems Rendering objects independently in PDF Problems rendering objects independently in PDF Nov 19, 2021
@liZe
Copy link
Member

liZe commented Nov 22, 2021

Hello,

Hello!

since the Weasyprint Update 53.3 all images, texts etc. are rendered as one big image in the PDF.

That’s strange…

We’ve changed the way PDFs are generated since version 53, as we decided to use our own PDF generator instead of relying on Cairo. So, it’s totally normal to get something different between version 52 and version 53.

But the objects included in the PDF file should be quite close. We don’t embed everything in one big image, and we actually create the same types of objects as before: text, vector drawing, etc. The PDF files also include fonts, for sure.

In the PDF file you attached, the content is actually an image. But this PDF is generated by Ghostscript, and this is why it’s only an image: Ghostscript, for some reason, transformed the whole content into a single image.

The reason why Ghostscript was able to keep the objects with version 52 but not with 53 is a mystery. Do you have any message in Ghostscript’s logs related to this topic?

@liZe
Copy link
Member

liZe commented Dec 11, 2021

Hello!

Did you find the time to check Ghostscript’s logs?

@liZe
Copy link
Member

liZe commented Jan 3, 2022

Feel free to reopen if needed.

@liZe liZe closed this as completed Jan 3, 2022
@davidmarogy
Copy link
Author

davidmarogy commented Jan 27, 2022

Hello,

@liZe sorry i cannot reopen this issue should i create a new one or could you reopen it?

i am realy sorry i totally forgott about my issue, because i reverted weasyprint to the older 52.5 version. But know because of some word break errors in 52.5 i had to switch back to the newer version.

I tried it again with weasyprint version 54 and the problem occured again.

It seems this problem occures if you convert the pdf to an ps using ghostscript. I need this step to convert rgb to cmyk with plain black instead of rich black. Ghostscript doesn't realy say that an error occured:

GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.

Additionally i had the same problem converting a pdf with an transparent image in my pdf files, but in this particular test i didn't had one. Here is a quote from the ghostscript documentation.

I added a zip with my newest test with all the converted pdfs and postscript files. Perhaps you could find a difference between them.

Would be nice if Weasyprint would add an option to convert the pdf to CMYK :D so that the ghostscript step is not needed.

Here is my ghostscript code:

def convert_pdf_to_cmyk(pdf_bytes: bytes) -> bytes:
    if pdf_bytes is not None:
        with NamedTemporaryFile(prefix="touriprint_pdf_", suffix=".pdf") as rgb_pdf_file:
            rgb_pdf_file.write(pdf_bytes)
            rgb_pdf_file.seek(0)
            # Converting pdf from RGB to CMYK
            # https://stackoverflow.com/questions/6241282/converting-pdf-to-cmyk-with-identify-recognizing-cmyk
            # HACK to convert rich black to plain CMYK black we need to convert it to ghostscript than to pdf
            # using a colour conversion script.
            # By default RGB->CMYK will create rich black instead plain K black
            # https://stackoverflow.com/questions/6248563/converting-any-pdf-to-black-k-only-cmyk/9024346#9024346).
            with NamedTemporaryFile(prefix="ghostscript_", suffix=".ps") as ghostscript_file:
                command = [
                    "gs",
                    "-q",
                    "-o",
                    ghostscript_file.name,
                    "-dNOPAUSE",
                    "-dBATCH",
                    "-sDEVICE=ps2write",
                    rgb_pdf_file.name,
                ]
                subprocess.check_call(command)
                with NamedTemporaryFile(prefix="converted_", suffix=".pdf") as converted_pdf_file:
                    command = [
                        "gs",
                        "-q",
                        "-o",
                        converted_pdf_file.name,
                        "-sDEVICE=pdfwrite",
                        "-dNOPAUSE",
                        "-dBATCH",
                        "-sProcessColorModel=DeviceCMYK",
                        "-sColorConversionStrategy=CMYK",
                        "-sColorConversionStrategyForImages=CMYK",
                        "-dOverrideICC",
                        "-dEncodeColorImages=true",
                        os.path.join(DOCUMENT_DATA_DIR, "rgb_to_plain_cmyk_black.ps"),
                        ghostscript_file.name,
                    ]
                    subprocess.check_call(command)
                    pdf_bytes = converted_pdf_file.read()
    return pdf_bytes

weasyprint_test_new_old_version.zip

@liZe
Copy link
Member

liZe commented Jan 28, 2022

Would be nice if Weasyprint would add an option to convert the pdf to CMYK :D so that the ghostscript step is not needed.

Yes, that would be awesome. There’s already an open issue for that: #1091.

We really don’t know why Ghostscript generates an image with this PDF. Asking the Ghostscript devs is probably the best solution to know what’s going on, and to see if it’s possible to "fix" that in WeasyPrint.

@davidmarogy
Copy link
Author

davidmarogy commented Feb 3, 2022

Ah okei how is the progress going on this issue? Would it also be possible to convert the rich black texts to plain black?

I opend a issue and asked the developer for help: https://bugs.ghostscript.com/show_bug.cgi?id=704872
I am looking forward for their answer.

Update: It seems that the new engine is creating the pdf with images having an alpha channel set to 1. Ghostscript tries to identify all images which use pointless transperency, but because of the new engine it does not detect it anymore.
Here is a small quote of their answer.


In fact, the transparency is pointless, since all the graphics states set the alpha to 1 
and there is no other transparency in the
file, but we can't know that without processing the whole file
(and even then there are cases where it would be difficult, as well as highly 
time-consuming, to be certain, such as examining the value of every image sample).

Now we do try to avoid pointless transparency, but there are limits.

The '52' file is created using Cairo, which is a well known producer of this sort 
of thing and we can detect that it doesn't really need the transparency. 

The '54' file is produced in an utterly different manner and I suspect is 
using a totally different PDF engine. In this case the transparency 
definition has moved from the page level to a Form 
XObject in the depths of the document, and we 
can no longer detect the fact that it does not truly use transparency.```

@liZe
Copy link
Member

liZe commented Feb 7, 2022

Thanks for the ticket open for Ghostscript.

As explained by the gs devs, using PostScript as an intermediate format looks like a bad idea, as PostScript misses a lot of features that PDF has, including transparency management.

We could try to avoid useless transparency, but that would require a lot of work: we’ve tried to add a quick fix, but the problem is complicated to "solve" (and even Cairo didn’t "solve" it)… And again, there’s no bug in WeasyPrint or in Ghostscript, there’s just an optimization in Ghostscript for Cairo that doesn’t work anymore now that we don’t use Cairo.

(For the record: we tried to avoid to call set_state in set_alpha when the alpha is 1 and the previous alpha state is None. It works for simple cases, but it fails for example when the state is previously changed by a function different from set_alpha. The "real" optimization would be to keep track of the current alpha state though the whole drawing process, but that’s the PDF drawing library’s work, not WeasyPrint’s.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants