Is it possible to create a PDF with UTF-8 character encoding? #181

shaolinh84 · 2017-12-08T14:02:31Z

This is my failing test in kotlin:

    @Test
    fun test_parseToPdf_convertsMarkdownToPdfWithUTF8CharacterSet() {
        val markdown = "Общие"
        val inputStream = markdownParser.parseToPdf(markdown, "test").inputStream()
        val fileText = pdfText(inputStream)
        assertThat(fileText).contains("Общие")
    }

    private fun pdfText(input: InputStream): String? {
        try {
            var document =  PDDocument.load(input)
            val stripper = PDFTextStripper()
            return stripper.getText(document)
        } catch (e: Exception) {
            e.printStackTrace()
        }
        return null
    }

This is my parser class:

class MarkdownParser(private val parser: Parser,
                     private val htmlRenderer: HtmlRenderer) {

    fun parseToHtml(markdownContent: String): String {
        val document = parser.parse(markdownContent)
        return htmlRenderer.render(document)
    }

    fun parseToPdf(markdownContent: String, path: String): ByteArray {
        val options = PegdownOptionsAdapter.flexmarkOptions(
                Extensions.ALL and (Extensions.ANCHORLINKS or Extensions.EXTANCHORLINKS_WRAP).inv()
        ).toMutable()

        val html = parseToHtml(markdownContent)
        val out = ByteArrayOutputStream()

        PdfConverterExtension.exportToPdf(out, html, path, options)

        return out.toByteArray()
    }

}

Parser is com.vladsch.flexmark.parse.Parser, HtmlRenderer is com.vladsch.flexmark.html.HtmlRenderer.

As I am just passing Outputstream to the PdfConverterExtension I don't have control in writing the data. Is there a possibility to create PDF with UTF-8 Characters? The html content still has the correct HTML encoding

The text was updated successfully, but these errors were encountered:

vsch · 2017-12-08T18:48:04Z

@shaolinh84, I am looking into it because it seems that the openhtmltopdf is not converting the characters in the HTML (taken from the String variable passed to openhtmltopdf):

<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body>
<ul>
    <li>Test PDF with Unicode chars: Общие</li>
</ul>

</body></html>

The resulting PDF is:

It could be some configuration that is missing.

vsch · 2017-12-08T19:05:27Z

@shaolinh84, it seems that the PDF conversion depends on the fonts which are used and whether they have the given Unicode characters.

You should skip the flexmark-java PDF converter and build your PDF conversion with the code used in the converter and add fonts available in the PDF. I have not done this yet so it is a theoretical solution.

The code in PDF converter extension is:

    public static void exportToPdf(final OutputStream os, final String html, final String url, final PdfRendererBuilder.TextDirection defaultTextDirection) {
        try {
            // There are more options on the builder than shown below.
            PdfRendererBuilder builder = new PdfRendererBuilder();

            if (defaultTextDirection != null) {
                builder.useUnicodeBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
                builder.useUnicodeBidiReorderer(new ICUBidiReorderer());
                builder.defaultTextDirection(defaultTextDirection); // OR RTL
            }

            org.jsoup.nodes.Document doc;
            doc = Jsoup.parse(html);

            Document dom = DOMBuilder.jsoup2DOM(doc);
            builder.withW3cDocument(dom, url);
            builder.toStream(os);
            builder.run();
        } catch (Exception e) {
            e.printStackTrace();
            // LOG exception
        } finally {
            try {
                os.close();
            } catch (IOException e) {
                // swallow
            }
        }
    }

The pdf renderer builder has a function to add a font to the pdf conversion, from what I understand.

    public PdfRendererBuilder useFont(FSSupplier<InputStream> supplier, String fontFamily, Integer fontWeight, PdfRendererBuilder.FontStyle fontStyle, boolean subset) {
        this._fonts.add(new PdfRendererBuilder.AddedFont(supplier, fontWeight, fontFamily, subset, fontStyle));
        return this;
    }

A better example is in the issues for openhtmltopdf: danfickle/openhtmltopdf#129

a-reznic · 2017-12-18T13:17:15Z

I have some isssue...

vsch · 2019-01-24T22:42:20Z

A solution to the font problem is to define an embedded TrueType font in the style or stylesheet and set the body tag to use this font. OpenHtmlToPDF will use the characters from the font which has them defined.

For example including Noto Serif/Sans/Mono fonts and adding noto-serif, noto-sans and noto-mono families to CSS to allow PDF to use these for rendering text.

However, the PDF converter requires TrueType fonts and Noto CJK fonts are OpenFonts which cannot be used. The solution is to download a TrueType Unicode font that supports CJK character set and add it to the custom rendering profile to be used for PDF.

For my test I used arialuni.ttf from https://www.wfonts.com/font/arial-unicode-ms

If the installation directory for the fonts is /usr/local/fonts/ then the following in the stylesheet should be added:

@font-face {
  font-family: 'noto-cjk';
  src: url('file:///usr/local/fonts/arialuni.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}


@font-face {
  font-family: 'noto-mono';
  src: url('file:///usr/local/fonts/NotoMono-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

body {
    font-family: 'noto-sans', 'noto-cjk', sans-serif;
    overflow: hidden;
    word-wrap: break-word;
    font-size: 14px;
}

var,
code,
kbd,
pre {
    font: 0.9em 'noto-mono', Consolas, "Liberation Mono", Menlo, Courier, monospace;
}

Sample PdfConverter.java updated. Wiki Page with information added: PDF-Renderer-Converter

vsch added 🪲 bug 😖 usability 🚂 workaround available and removed 🪲 bug 😖 usability 🚂 workaround available labels Dec 8, 2017

vsch added 🪲 bug 🔥 enhancement labels Feb 5, 2018

vsch mentioned this issue May 22, 2018

Is it possible to export Arabic and Chinese characters in pdf export #234

Closed

vsch added 🚂 workaround available 📖 to document and removed 🪲 bug 🔥 enhancement labels Jan 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to create a PDF with UTF-8 character encoding? #181

Is it possible to create a PDF with UTF-8 character encoding? #181

shaolinh84 commented Dec 8, 2017 •

edited

Loading

vsch commented Dec 8, 2017

vsch commented Dec 8, 2017

a-reznic commented Dec 18, 2017

vsch commented Jan 24, 2019

Is it possible to create a PDF with UTF-8 character encoding? #181

Is it possible to create a PDF with UTF-8 character encoding? #181

Comments

shaolinh84 commented Dec 8, 2017 • edited Loading

vsch commented Dec 8, 2017

vsch commented Dec 8, 2017

a-reznic commented Dec 18, 2017

vsch commented Jan 24, 2019

shaolinh84 commented Dec 8, 2017 •

edited

Loading