Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to create a PDF with UTF-8 character encoding? #181

Open
shaolinh84 opened this issue Dec 8, 2017 · 4 comments
Open

Is it possible to create a PDF with UTF-8 character encoding? #181

shaolinh84 opened this issue Dec 8, 2017 · 4 comments

Comments

@shaolinh84
Copy link

shaolinh84 commented Dec 8, 2017

This is my failing test in kotlin:

    @Test
    fun test_parseToPdf_convertsMarkdownToPdfWithUTF8CharacterSet() {
        val markdown = "Общие"
        val inputStream = markdownParser.parseToPdf(markdown, "test").inputStream()
        val fileText = pdfText(inputStream)
        assertThat(fileText).contains("Общие")
    }

    private fun pdfText(input: InputStream): String? {
        try {
            var document =  PDDocument.load(input)
            val stripper = PDFTextStripper()
            return stripper.getText(document)
        } catch (e: Exception) {
            e.printStackTrace()
        }
        return null
    }

This is my parser class:

class MarkdownParser(private val parser: Parser,
                     private val htmlRenderer: HtmlRenderer) {

    fun parseToHtml(markdownContent: String): String {
        val document = parser.parse(markdownContent)
        return htmlRenderer.render(document)
    }

    fun parseToPdf(markdownContent: String, path: String): ByteArray {
        val options = PegdownOptionsAdapter.flexmarkOptions(
                Extensions.ALL and (Extensions.ANCHORLINKS or Extensions.EXTANCHORLINKS_WRAP).inv()
        ).toMutable()

        val html = parseToHtml(markdownContent)
        val out = ByteArrayOutputStream()

        PdfConverterExtension.exportToPdf(out, html, path, options)

        return out.toByteArray()
    }

}

Parser is com.vladsch.flexmark.parse.Parser, HtmlRenderer is com.vladsch.flexmark.html.HtmlRenderer.

As I am just passing Outputstream to the PdfConverterExtension I don't have control in writing the data. Is there a possibility to create PDF with UTF-8 Characters? The html content still has the correct HTML encoding

@vsch
Copy link
Owner

vsch commented Dec 8, 2017

@shaolinh84, I am looking into it because it seems that the openhtmltopdf is not converting the characters in the HTML (taken from the String variable passed to openhtmltopdf):

<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body>
<ul>
    <li>Test PDF with Unicode chars: Общие</li>
</ul>

</body></html>

The resulting PDF is:

image

It could be some configuration that is missing.

@vsch
Copy link
Owner

vsch commented Dec 8, 2017

@shaolinh84, it seems that the PDF conversion depends on the fonts which are used and whether they have the given Unicode characters.

You should skip the flexmark-java PDF converter and build your PDF conversion with the code used in the converter and add fonts available in the PDF. I have not done this yet so it is a theoretical solution.

The code in PDF converter extension is:

    public static void exportToPdf(final OutputStream os, final String html, final String url, final PdfRendererBuilder.TextDirection defaultTextDirection) {
        try {
            // There are more options on the builder than shown below.
            PdfRendererBuilder builder = new PdfRendererBuilder();

            if (defaultTextDirection != null) {
                builder.useUnicodeBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
                builder.useUnicodeBidiReorderer(new ICUBidiReorderer());
                builder.defaultTextDirection(defaultTextDirection); // OR RTL
            }

            org.jsoup.nodes.Document doc;
            doc = Jsoup.parse(html);

            Document dom = DOMBuilder.jsoup2DOM(doc);
            builder.withW3cDocument(dom, url);
            builder.toStream(os);
            builder.run();
        } catch (Exception e) {
            e.printStackTrace();
            // LOG exception
        } finally {
            try {
                os.close();
            } catch (IOException e) {
                // swallow
            }
        }
    }

The pdf renderer builder has a function to add a font to the pdf conversion, from what I understand.

    public PdfRendererBuilder useFont(FSSupplier<InputStream> supplier, String fontFamily, Integer fontWeight, PdfRendererBuilder.FontStyle fontStyle, boolean subset) {
        this._fonts.add(new PdfRendererBuilder.AddedFont(supplier, fontWeight, fontFamily, subset, fontStyle));
        return this;
    }

A better example is in the issues for openhtmltopdf: danfickle/openhtmltopdf#129

@a-reznic
Copy link

I have some isssue...

@vsch
Copy link
Owner

vsch commented Jan 24, 2019

A solution to the font problem is to define an embedded TrueType font in the style or stylesheet and set the body tag to use this font. OpenHtmlToPDF will use the characters from the font which has them defined.

For example including Noto Serif/Sans/Mono fonts and adding noto-serif, noto-sans and noto-mono families to CSS to allow PDF to use these for rendering text.

However, the PDF converter requires TrueType fonts and Noto CJK fonts are OpenFonts which cannot be used. The solution is to download a TrueType Unicode font that supports CJK character set and add it to the custom rendering profile to be used for PDF.

For my test I used arialuni.ttf from https://www.wfonts.com/font/arial-unicode-ms

If the installation directory for the fonts is /usr/local/fonts/ then the following in the stylesheet should be added:

@font-face {
  font-family: 'noto-cjk';
  src: url('file:///usr/local/fonts/arialuni.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}


@font-face {
  font-family: 'noto-mono';
  src: url('file:///usr/local/fonts/NotoMono-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

body {
    font-family: 'noto-sans', 'noto-cjk', sans-serif;
    overflow: hidden;
    word-wrap: break-word;
    font-size: 14px;
}

var,
code,
kbd,
pre {
    font: 0.9em 'noto-mono', Consolas, "Liberation Mono", Menlo, Courier, monospace;
}

Sample PdfConverter.java updated. Wiki Page with information added: PDF-Renderer-Converter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants