Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic and hebrew texts not supporting #270

Open
Srimathi-Thirumoorthy opened this issue Feb 13, 2024 · 6 comments
Open

Arabic and hebrew texts not supporting #270

Srimathi-Thirumoorthy opened this issue Feb 13, 2024 · 6 comments

Comments

@Srimathi-Thirumoorthy
Copy link

Arabic and hebrew texts not supporting

@mohamnag
Copy link

what kind of help is required here? I'm a Frasi native speaker that can help change code and verify the results, however I'm not really sure if I know what piece of code is to be changed here. I took a short look at the latest version and can't really spot the place where the drawing of an element with unicode text is happening.

@mohamnag
Copy link

FYI, I tracked it down to this method com.lowagie.text.pdf.BaseFont#convertToBytes(java.lang.String) and it looks like the encoding is always set to Cp1252 from which I would not expect much to render any non-latin chars. maybe properly setting the charset on that (don't know how) will fix the issue. eventually using a font that has proper characters too.

@asolntsev
Copy link
Contributor

@mohamnag Hi. Wow, thank you for debugging this problem with fonts.
Yes, now I see: FS always uses encoding winansi (which I guess means Cp1252). I don't know why, but it was used from the very beginning 01.02.2006 :)

I think we can change this encoding. Can you provide a simple example of such html and font, so we could add this example to FS tests?

@mohamnag
Copy link

well I went on and used a custom font where I can set the encoding. the result was unfortunately still problematic.

lets take this sample HTML:

<html lang="fa">
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
    <style>
        .rtl-font {
            font-family: Vazirmatn;
            direction: rtl;
        }
    </style>
</head>
<body>
<div style="background-color: blue">
    تست فارسی
</div>
<div class="rtl-font" style="background-color: green">
    تست فارسی
</div>
<div dir="rtl" style="background-color: red; font-family: Vazirmatn">
    تست فارسی
</div>
</body>
</html>

I have the font (can get it for free from https://github.com/rastikerdar/vazirmatn/releases/tag/v33.003) unzipped into resources directory and this is my Java code:

        try (OutputStream outputStream = new FileOutputStream("build/pdf/method4.pdf")) {
            // parse and improve HTML
            Document document = Jsoup.parse(new File(inputHtml.getFile()), "UTF-8");
            document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
            var htmlString = document.html();

            // initialize Flying Saucer
            ITextRenderer renderer = new ITextRenderer();
            SharedContext sharedContext = renderer.getSharedContext();
            sharedContext.setPrint(true);
            sharedContext.setInteractive(false);

            renderer
                    .getFontResolver()
                    .addFont(
                            Main.class.getClassLoader().getResource("Vazirmatn/ttf/Vazirmatn-Regular.ttf").toString(),
                            BaseFont.IDENTITY_H,
                            true
                    );

            renderer.setDocumentFromString(htmlString);

            renderer.layout();
            renderer.createPDF(outputStream);
            // relative resources: see https://www.baeldung.com/java-html-to-pdf#dependencies-4
        }

now this is the output that FS is giving me:
image

and this is what a browser gives me (ignoring the font not being applied):
image

there are two problems here:

  1. the connection between letters: farsi/arabic letters get connected and change shape based on position and neighbouring letters. this is somehow not handled
  2. the RTL orientation is not applied. the first letter ت should be positioned right most but is left most.

in general I would first go for solving this problem using a custom font (which for sure has all chars) and then maybe looking into fixing that charset for default font.

@mohamnag
Copy link

btw, you have probably seen this example of RTL rendering using OpenPDF but I just to mention it: https://github.com/LibrePDF/OpenPDF/blob/master/pdf-toolbox/src/test/java/com/lowagie/examples/fonts/styles/RightToLeft.java

I don't know if this is different than what FS is doing under the hood when working with OpenPDF but I couldn't find any of those methods being called.

@mohamnag
Copy link

I also found this post: https://groups.google.com/g/flying-saucer-users/c/n0CfuYfpQ6I/m/3iJIaZ4IAAAJ
and a whole thread there that is related to this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants