Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bi-directional text rendering issues in translated version of book #1413

Closed
moaminsharifi opened this issue Oct 23, 2023 · 17 comments
Closed

Comments

@moaminsharifi
Copy link
Collaborator

There are a number of issues with the rendering of bi-directional text in RTL languages. These issues can cause text to be displayed incorrectly, making it difficult or impossible to read.

Some of the most common issues include:

  • Text being displayed in the wrong order, from left to right instead of right to left.
  • Text being displayed overlapping.
  • Text being displayed with incorrect spacing.
  • Text being displayed with incorrect kerning.

Issue description

As an example, when you compile the latest version of the repo in Persian, this is your first page like that:
Welcome to Comprehensive Rust 🦀 - Comprehensive Rust 🦀 - issue in BiDi

It seems right for someone who doesn't know RTl languages; however, if you are learning in your native language, it can be confusing.
Welcome to Comprehensive Rust 🦀 - Comprehensive Rust 🦀 - issue in BiDi - highlight it

this is how must be rendered in RTl languages:

Welcome to Comprehensive Rust 🦀 - Comprehensive Rust 🦀 - fix issue in BiDi

Proposed solving way for Solving Bi-directional issue

Why not directly fix in mdbook?

My first step was to check out the Contributing to MDBook document before I started a github issue here

The current PR backlog is beyond what we can process at this time. Only issues that have an E-Help-wanted or Feature accepted label will likely receive reviews.
mdBook/CONTRIBUTING.md
As far as I can tell, supporting BiDi is not a priority for them at the moment

My Proposed method:

The way I fix the bi-di issue with the html tags is to add dir="auto" html attribute just after rendering by js and add "unicode-bidi:embed;" to those classes.

Refrences:
#671
mdBook issue#1486 Support for Right to Left

@moaminsharifi
Copy link
Collaborator Author

My Pull request at:
#1414

@Manishearth
Copy link

The way I fix the bi-di issue with the html tags is to add dir="auto" html attribute just after rendering by js and add "unicode-bidi:embed;" to those classes.

No, please don't dynamically change dir. The solution to this is to apply the appropriate dir tags to the English content. Using unicode-bidi: embed on them is also fine.

@mgeisler
Copy link
Collaborator

Thanks @moaminsharifi for raising this. We are probably the first project to test out the brand new right-to-left functionality in mdbook. So your feedback is valuable to find and fix problems.

Note that the blocks of English text you highlighted appear because the translation is incomplete. Would the page look correct otherwise? Also, you should connect with the people in #671 to discuss the problems — a small group of volunteers showed up recently and started working hard on the translation.

Thanks @Manishearth for jumping in here! I have near-zero knowledge of this, so I need help others here.

What I can explain is how the translation system works:

  • We parse the English Markdown files and extract translatable text from them.
  • We look this text up in the PO file and if we find a translation, we emit this text instead.

This probably gives us limited options to apply any special tags — it especially gives us limited way to apply tags from within a translation (unless the "tag" can be encoded as a Unicode character in the replacement text?).

The extract_messages_list_with_paragraphs test and other extract_messages tests show the kinds of "messages" we extract from the Markdown input.

I hope this helps a bit with the overall flow, otherwise I'm happy to explain more.

@moaminsharifi
Copy link
Collaborator Author

moaminsharifi commented Oct 23, 2023

Thanks @moaminsharifi for raising this. We are probably the first project to test out the brand new right-to-left functionality in mdbook. So your feedback is valuable to find and fix problems.

Note that the blocks of English text you highlighted appear because the translation is incomplete. Would the page look correct otherwise? Also, you should connect with the people in #671 to discuss the problems — a small group of volunteers showed up recently and started working hard on the translation.

Thanks @Manishearth for jumping in here! I have near-zero knowledge of this, so I need help others here.

What I can explain is how the translation system works:

  • We parse the English Markdown files and extract translatable text from them.
  • We look this text up in the PO file and if we find a translation, we emit this text instead.

This probably gives us limited options to apply any special tags — it especially gives us limited way to apply tags from within a translation (unless the "tag" can be encoded as a Unicode character in the replacement text?).

The extract_messages_list_with_paragraphs test and other extract_messages tests show the kinds of "messages" we extract from the Markdown input.

I hope this helps a bit with the overall flow, otherwise I'm happy to explain more.
Thanks for your awnser @mgeisler. let's dive in.
Are we need it? yes because It is unavoidable at some point not to mix English (LTR text) and Persian (RTL text).
The current method is to add this html tag in order to make it rtl, however mixing several html tags causes problems, like the first screenshot.

<html ... dir="{{ text_direction }}">

This is what I do to support BIDI by just adding a dir="auto" to the beginning of each tag, like this:
<p dir="auto">
As a result, there is no more conflict.

And the issue is vice versa, puting RTL text in LTR text raise same issue,

@moaminsharifi
Copy link
Collaborator Author

The way I fix the bi-di issue with the html tags is to add dir="auto" html attribute just after rendering by js and add "unicode-bidi:embed;" to those classes.

No, please don't dynamically change dir. The solution to this is to apply the appropriate dir tags to the English content. Using unicode-bidi: embed on them is also fine.

I think we can also create a renderer without changing mdBook, as specified in https://rust-lang.github.io/mdBook/format/configuration/renderers.html.

Then, if the language is RTL and this text isn't translated, add dir="ltr" to it (or any other way like unicode-bidi css modifier)

But the whole point of my PR #1414 was make it simple as possible with some js,css in client browser.

@Manishearth
Copy link

I think we can also create a renderer without changing mdBook

I mean, if you don't want to change mdBook, using straight up divs and spans should work just fine

@moaminsharifi
Copy link
Collaborator Author

I think we can also create a renderer without changing mdBook

I mean, if you don't want to change mdBook, using straight up divs and spans should work just fine

If I understand your comment correctly, conflicting content can also occur in h1-h6, p, details, ul, and li tags.
My method is at the end of the pipeline with rendering in the client browser, so at least the entire processing system needs to be changed or MDBook needs to be updated.

@Manishearth
Copy link

If I understand your comment correctly, conflicting content can also occur in h1-h6, p, details, ul, and li tags.

Yes, you can stick dir attrs on spans or divs within them.

The core thing is that base directionality is a function of the author's intent. Using auto directionality on all these elements works somewhat, but it is fraught when you e.g. have a bullet point that starts with a word (probably a proper noun) in the Latin script but is supposed to be an overall sentence in an RTL sentence. Like the sentence "Rust َچہَ١ ہے", which should be RTL but will be detected as LTR instead because it happens to start with "Rust".

The author actually has to have the ability to signal this intent. Fortunately, they do: Markdown allows HTML inside it. It should ideally be rare to need a directionality shift.

The one thing that may not work is where you need to affect the block layout of an item.

My method is at the end of the pipeline with rendering in the client browser

I mean, it's not, changing the dir tag will retrigger the rendering pipeline. It's incorrect, and also has the general problems dir=auto has.

@mgeisler
Copy link
Collaborator

Yes, you can stick dir attrs on spans or divs within them.

Small note: we don't have spans or divs in the Markdown (of course). Any solution should hopefully preserve the Markdown files the way they look now: as fairly regular Markdown files without a lot of HTML.

The best place to inject anything would be during translation (I believe). We do know the source and target languages and we have full control over the Markdown AST at this point. So we could for example wrap untranslated text with <div dir="..."> ... </div> when we see that the source language is a left-to-right language and the target language is right-to-left.

We would have to be careful about this, though, since we are still working in the Markdown layer. We're talking about transforming

- foo
- bar
- baz

into

- Translated foo
<div dir="...">

- bar

</div>

- Translated baz

This is not a completely faithful translation, since the div will break the list.

Fortunately, they do: Markdown allows HTML inside it. It should ideally be rare to need a directionality shift.

This is probably where the disconnect is: the translation pipeline (https://github.com/google/mdbook-i18n-helpers) does not give translators a full Markdown file at a time. Instead it extracts text from the Markdown AST and replaces this with a translation (still in the AST).

The translator also doens't know or "intend" to do a directionality shift: the shift today are there because the translation is very incomplete.

@moaminsharifi, I tried asking this above: would the page look okay (or nearly okay) if all paragraphs and list items were translated? Could you perhaps try experimenting with a smaller page? For example, is 1.2. میان‌برهای صفحه کلی looking correct?

If the fully-translated pages are usable, then the problem is much smaller: it will in some sense fix itself as the translation progresses. If the fully-translated pages are also broken, well then we have a bigger problem and I'll need the help of you both to fix it.

@moaminsharifi
Copy link
Collaborator Author

Yes, you can stick dir attrs on spans or divs within them.

Small note: we don't have spans or divs in the Markdown (of course). Any solution should hopefully preserve the Markdown files the way they look now: as fairly regular Markdown files without a lot of HTML.

The best place to inject anything would be during translation (I believe). We do know the source and target languages and we have full control over the Markdown AST at this point. So we could for example wrap untranslated text with <div dir="..."> ... </div> when we see that the source language is a left-to-right language and the target language is right-to-left.

We would have to be careful about this, though, since we are still working in the Markdown layer. We're talking about transforming

- foo
- bar
- baz

into

- Translated foo
<div dir="...">

- bar

</div>

- Translated baz

This is not a completely faithful translation, since the div will break the list.

Fortunately, they do: Markdown allows HTML inside it. It should ideally be rare to need a directionality shift.

This is probably where the disconnect is: the translation pipeline (https://github.com/google/mdbook-i18n-helpers) does not give translators a full Markdown file at a time. Instead it extracts text from the Markdown AST and replaces this with a translation (still in the AST).

The translator also doens't know or "intend" to do a directionality shift: the shift today are there because the translation is very incomplete.

@moaminsharifi, I tried asking this above: would the page look okay (or nearly okay) if all paragraphs and list items were translated? Could you perhaps try experimenting with a smaller page? For example, is 1.2. میان‌برهای صفحه کلی looking correct?

If the fully-translated pages are usable, then the problem is much smaller: it will in some sense fix itself as the translation progresses. If the fully-translated pages are also broken, well then we have a bigger problem and I'll need the help of you both to fix it.

@mgeisler, It seems okay, not great, but okay.

1.2. میان‌برهای صفحه کلی

At this point in the conversation, it's important to suggest that we invest in translating the book first, and then we can inject RTL text into the LTR version after we have a fully translated version.

My point of this issue was to mention I can see as a web developer how important it is to make multi-language versions of websites (in this case a book) and who bidi issue makes it hard for other language readers to follow. Now we know how we can convert from different perspectives and how to get around issues where they arise.

for now it's better to archive this issue and PR, the next person who want to translate into any RTL languages can just checkout and you said (@mgeisler) the problem is smaller than what I showed.

- Translated foo
<div dir="..."> 
 - bar 
 </div> 
 - Translated baz

Thanks to @mgeisler @Manishearth, for your contribution to the conversion

@mgeisler
Copy link
Collaborator

@mgeisler, It seems okay, not great, but okay.

Thanks for confirming!

for now it's better to archive this issue and PR, the next person who want to translate into any RTL languages can just checkout and you said (@mgeisler) the problem is smaller than what I showed.

If you like, we could add a note about the problem of mixing the two directionalities to the translation instructions. The note could point to this issue and then perhaps someone can help us improve the situation down the line.

I'm blind to the issues myself, so I appreciate you and @Manishearth looking at them. I can mostly help you by explaining the mechanism we use to do the translations — which we've mostly built ourselves so we have the option of modifying it to an extent.

I'll close this for now and then we can revisit later if new information appears.

@mgeisler mgeisler closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023
@Manishearth
Copy link

Small note: we don't have spans or divs in the Markdown (of course).

Hm? Markdown supports embedding most HTML tags

So we could for example wrap untranslated text with

...
when we see that the source language is a left-to-right language and the target language is right-to-left.

Yeah that would be nice. In some cases it is ideal to do it on the wrapping tag.

If the fully-translated pages are usable, then the problem is much smaller: it will in some sense fix itself as the translation progresses. If the fully-translated pages are also broken, well then we have a bigger problem and I'll need the help of you both to fix it.

This is the assumption I have been operating off of: this ought to be a rare problem, and when it crops up we should use the existing HTML solutions for embedding languages.

The translator also doens't know or "intend" to do a directionality shift: the shift today are there because the translation is very incomplete.

Well, they will in some cases; especially around code.

This is probably where the disconnect is: the translation pipeline (https://github.com/google/mdbook-i18n-helpers) does not give translators a full Markdown file at a time. Instead it extracts text from the Markdown AST and replaces this with a translation (still in the AST).

No, I understand, but within a markdown chunk you can still use HTML tags, yes?

@Manishearth
Copy link

FWIW a thing that turned out useful in the Rust Website translation was that I hooked up a custom function called ENGLISH that would set the lang attribute when found, specifically for english-language words (like "Rust") in a translated page. Primarily necessary when the element is allcapsed and has an "i" in it since Turkish uppercases differently.

@mgeisler
Copy link
Collaborator

mgeisler commented Oct 26, 2023

Small note: we don't have spans or divs in the Markdown (of course).

Hm? Markdown supports embedding most HTML tags

Sorry, I meant to say that we try to avoid HTML in our English source files today. I was afraid you wanted us to mark up things with spans in the Markdown — which would be doable, but costly in terms of readability and maintenance.

The translator also doens't know or "intend" to do a directionality shift: the shift today are there because the translation is very incomplete.

Well, they will in some cases; especially around code.

I see, you're right then. I was thinking of the case where one list item is in English because the translation is still in progress.

This is probably where the disconnect is: the translation pipeline (https://github.com/google/mdbook-i18n-helpers) does not give translators a full Markdown file at a time. Instead it extracts text from the Markdown AST and replaces this with a translation (still in the AST).

No, I understand, but within a markdown chunk you can still use HTML tags, yes?

Yes, that is correct! The translators can inject arbitrary HTML into the translation.

So maybe that is what is needed? Could translators translate the example on Code Samples like this:

#: src/cargo/code-samples.md:13
msgid ""
"```rust,editable\n"
"fn main() {\n"
"    println!(\"Edit me!\");\n"
"}\n"
"```"
msgstr ""
"<div dir=\"...\">\n"
"\n"
"```rust,editable\n"
"fn main() {\n"
"    println!(\"Edit me!\");\n"
"}\n"
"```\n"
"\n"
"</div>"

That is, wrap the Markdown code block in a div element with dir set to something special?

If there is a common correct way to do this, then we could talk about detecting such code blocks in mdbook-gettext and do this automatically for certain languages.

@moaminsharifi
Copy link
Collaborator Author

Small note: we don't have spans or divs in the Markdown (of course).

Hm? Markdown supports embedding most HTML tags

Sorry, I meant to say that we try to avoid HTML in our English source files today. I was afraid you wanted us to mark up things with spans in the Markdown — which would be doable, but costly in terms of readability and maintenance.

The translator also doens't know or "intend" to do a directionality shift: the shift today are there because the translation is very incomplete.

Well, they will in some cases; especially around code.

I see, you're right then. I was thinking of the case where one list item is in English because the translation is still in progress.

This is probably where the disconnect is: the translation pipeline (https://github.com/google/mdbook-i18n-helpers) does not give translators a full Markdown file at a time. Instead it extracts text from the Markdown AST and replaces this with a translation (still in the AST).

No, I understand, but within a markdown chunk you can still use HTML tags, yes?

Yes, that is correct! The translators can inject arbitrary HTML into the translation.

So maybe that is what is needed? Could translators translate the example on Code Samples like this:

#: src/cargo/code-samples.md:13
msgid ""
"```rust,editable\n"
"fn main() {\n"
"    println!(\"Edit me!\");\n"
"}\n"
"```"
msgstr ""
"<div dir=\"...\">\n"
"\n"
"```rust,editable\n"
"fn main() {\n"
"    println!(\"Edit me!\");\n"
"}\n"
"```\n"
"\n"
"</div>"

That is, wrap the Markdown code block in a div element with dir set to something special?

If there is a common correct way to do this, then we could talk about detecting such code blocks in mdbook-gettext and do this automatically for certain languages.

I checkit out, It's seems because of not purly ``` convert to html and we have some Javascript which render it, It's not working at all:

اجرای cargo روی ماشین local - Comprehensive Rust 🦀

@Manishearth
Copy link

Well for me code blocks are already forced-LTR, which is desired behavior anyway. I was thinking about other block level elements.

I checkit out, It's seems because of not purly ``` convert to html and we have some Javascript which render it, It's not working at all:

No, it's not because of the JS, it's because the code block CSS has a direction: ltr !important , which is probably the right call for Rust code anyway.

But yeah, the code block HTML is tricky and you'd need to make the Ace editor support RTL as well, which would take a while; it has a lot of tricky layout bits.

@moaminsharifi
Copy link
Collaborator Author

Well for me code blocks are already forced-LTR, which is desired behavior anyway. I was thinking about other block level elements.

I checkit out, It's seems because of not purly ``` convert to html and we have some Javascript which render it, It's not working at all:

No, it's not because of the JS, it's because the code block CSS has a direction: ltr !important , which is probably the right call for Rust code anyway.

But yeah, the code block HTML is tricky and you'd need to make the Ace editor support RTL as well, which would take a while; it has a lot of tricky layout bits.

image
It comes from:
```rust
fn main() {
println!("Edit me!");
}
```
at src/cargo/running-locally.md:43
how style set? use default user text align
google github io_comprehensive-rust_cargo_running-locally
but in other hand when we have:
```rust,editable
fn main() {
println!("Edit me!");
}
```
at src/cargo/code-samples.md:13
works well because of editable word in markdown, it converts block from <pre> to <code class="editable ace_editor ...

google github io_comprehensive-rust_cargo_code-samples

cc: @mgeisler

mgeisler pushed a commit that referenced this issue Oct 31, 2023
Part of #671 
and #1413

In the code part of content which always is in english and must be
`text-align: left` but with `<html ... dir=rtl >` cuz conflict.

---------

Co-authored-by: Kaveh <hamidrkp@riseup.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants