Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RTL languages in the canvas renderer #1899

Closed
wants to merge 5 commits into from

Conversation

Tyriar
Copy link
Member

@Tyriar Tyriar commented Jan 17, 2019

Part of #701

Before:

screen shot 2019-01-17 at 9 15 13 am

After:

screen shot 2019-01-17 at 9 15 21 am

This does slow things down (~50% slower) but I think @jerch is working on improving the character joiner code to be faster, if the registerCharacterJoiner API works with codes instead of a big string it shouldn't make much difference.

DOM/WebGL renderer support will work when they support the character joiner API.

@Tyriar Tyriar self-assigned this Jan 17, 2019
@Tyriar Tyriar requested a review from jerch January 17, 2019 17:18
@Tyriar Tyriar mentioned this pull request Jan 17, 2019
@babakks
Copy link

babakks commented Jan 17, 2019

It's doing good in individual words (e.g., "اینجا" instead of "ا‌ج‌ن‌ی‌ا"), but still puts the words in reverse order. The correct sentence is: "اینجا را بخوانید" (not "بخوانید را اینجا").

(I'm a Persian native.)

@Tyriar
Copy link
Member Author

Tyriar commented Jan 17, 2019

@babakks I included space as well, I think it looks right now.

@Tyriar
Copy link
Member Author

Tyriar commented Jan 17, 2019

I guess it will all fall apart when a line wraps though

@babakks
Copy link

babakks commented Jan 17, 2019

@babakks I included space as well, I think it looks right now.

I meant the order of the words:
image

@Tyriar
Copy link
Member Author

Tyriar commented Jan 17, 2019

@babakks did you check the latest?

screen shot 2019-01-17 at 2 01 09 pm

@babakks
Copy link

babakks commented Jan 18, 2019

@babakks did you check the latest?

screen shot 2019-01-17 at 2 01 09 pm

That's perfect. Good job!

@jerch
Copy link
Member

jerch commented Jan 20, 2019

@Tyriar
Wow nice job! Sadly I cannot test it myself. I think we should get native peeps involved that want to help testing it thoroughly, before going further down the rabbit hole. Bidi is complicated and full of edgecases, that might need to be addressed in a certain way to work with terminals. @babakks interested?

I put the character joiner on my TODO list, it certainly would benefit from some overhaul after we are done with buffer transistions.

@Tyriar
Copy link
Member Author

Tyriar commented Jan 21, 2019

I think this gets pretty close to macOS' default terminal where I'm pretty sure it all falls apart on wrapped lines

Terminal.app (اینجا را بخوانید wrapped):

screen shot 2019-01-21 at 10 48 07 am

Maybe this will be fine after optimizations to have very basic support (on non-wrapped lines).

@Tyriar Tyriar added the reference A closed issue/pr that is expected to be useful later as a reference label Jan 26, 2019
@Tyriar
Copy link
Member Author

Tyriar commented Jan 26, 2019

Closing this off, would probably be just as bad having it work sometimes and really mess things up other times. If anything it had some lessons learned 😃

@Tyriar Tyriar closed this Jan 26, 2019
@HKalbasi
Copy link

HKalbasi commented Feb 6, 2022

@Tyriar this looks like a definite improvement, what it exactly messed up? If it doesn't make problem for ltr, it would be better than current situation.

@Tyriar
Copy link
Member Author

Tyriar commented Feb 7, 2022

@HKalbasi words ended up in reverse order and there were likely other problems with wrapped words. I found I didn't understand the problem fully so I wasn't comfortable making a half fix.

@HKalbasi
Copy link

HKalbasi commented Feb 8, 2022

I have an approximate knowledge of the bidi unicode algorithm, since I'm a native RTL speaker and this problem is so common. Let me try to explain it:

There are four kind of unicode characters in bidi:

  • Strong LTR: alphabet of LTR lanugages, e.g. a, b, x, ....
  • Strong RTL: alphabet of RTL languages, like ا ب ج ...
  • Weak LTR: numbers like ۱ 1 ۲ 2 ...
  • Neutral: white spaces, dot, comma, math signs and ...

The original categories are a bit more verbose.

For rendering a text:

  • Divide the sentence into blocks that consist of consecutive characters with the same type.
  • Merge two blocks that there is only neutral and weak LTR blocks between them.
  • Put blocks in the direction of first strong character of the text. Most terminals omit this step and always put blocks in LTR.
  • Render each block per its type. Render weak LTR blocks inside RTL blocks as LTR.

An example:

some ltr SOME RTL 123 MORE, RTL some ltr
LLLLNLLLNRRRRNRRRNWWWNRRRRNNRRRNLLLLNLLL
LLLLLLLLNRRRRRRRRRRRRRRRRRRRRRRNLLLLLLLL
some ltr LTR ,EROM 123 LTR EMOS some ltr

From what I'm seeing here, it will become correct per the unicode spec if we join all RTLs and neutrals and weak LTRs.

About line wrapping, I don't see the problem, and how RTL differs from LTR in it. The screen shot above looks good to me.

@Tyriar
Copy link
Member Author

Tyriar commented Feb 8, 2022

@HKalbasi thanks, the spec for the algorithm is quite long though, so I imagine reading through that and working through the subtleties will take quite a while. There is also this terminal-specific one which recommends a different approach and is also very long https://terminal-wg.pages.freedesktop.org/bidi/. After reading all that and understanding the problem fully I'm guessing it's probably possible via character joiners but I'm not certain which is what concerns me. I feel like it'll take a pretty significant amount of effort to do this right, the ambiguity also bugs me as I cannot truly test this as I don't know any of these languages.

@HKalbasi
Copy link

HKalbasi commented Feb 8, 2022

I think anything here is better than nothing, reversed words are better than reversed letters, screenshots here are even better, and if we join RTLs with weaks and neutrals, it would make RTL fine, even if it doesn't work right for all cases.

Also I would like to know what is the limit/cost of joining? Is it possible to join everything and let the browser solve bidi?

@Tyriar
Copy link
Member Author

Tyriar commented Feb 8, 2022

Joining the whole line would cause some pretty bad performance issues. All joined characters get their own texture, meaning essentially all repeat characters/words would be cache misses and need to get re-rendered, that would thrash the backing texture and cause it to get cleared frequently.

It seems like some combination of joined characters (for rtl words) and then the ability to reorder the words while keeping the word textures separate would be needed for it to function properly, right now the character joiner system isn't capable of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reference A closed issue/pr that is expected to be useful later as a reference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants