Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+0670 superscript alef should be written with horizontal spacing when input after fathah #217

Closed
adamiturabi opened this issue Apr 17, 2021 · 14 comments
Labels

Comments

@adamiturabi
Copy link

When U+0670 superscript alef is input after a fathah, I believe it should be written with horizontal spacing after the fathah. This image should illustrate what I mean:
image
However, even in the "better image" the vertical positioning is not correct. It is too high in ذلك and too low in هذا. (I faked it with U+0202f in the former and tatweel in the latter.)

Thank you for your continued support to this great typeface.

@khaledhosny
Copy link
Member

In the first case, the small alef should be placed on tatweel هَـٰذا, in the second case it should be placed on a non-breaking space ذَ ٰلك (U+00A0, though copying from here might change it into regular space):

I can support NNBSP (U+202F) as an alternate base as well.

The issue arises from the face that small alef (an other small letters) have different rules in Quranic orthography (some times they are used as combining marks and other times as standalone letters that don’t affect letter joining, but Unicode recommends using a base character in the second case instead of encoding an alternate set of small letters with different properties).

@adamiturabi
Copy link
Author

Thanks for the quick reply. The main use that I see in the Quran is that superscript alef is used

  1. On vowel letters و and ى. For example عَلَىٰ ، صَلَوٰة
  2. After a fathah, in which case it is floating like in هذا and ذلك

It seems to me that both usages will be supported if the sequence U+064E->U+0670 offsets the superscript alef horizontally without affecting letter joining. Otherwise if U+0670 is input without U+064E then it may be placed vertically on top of the letter.

Calibri Arabic handles it this way as far as I can see.

Unless there is other usage besides the above two which won't be handled?

Thanks again.

@khaledhosny
Copy link
Member

The problem is that depending on the presence of fatha is a hack and goes against Unicode making the small alef a combining mark. Using a charcters as a seat is more reliable (you can have هـٰذا without requiring a fatha) and actually what Unicode recommends (I can’t find the text right now, so you will have to take my word on this). I used to do the fatha hack but was later convinced better not do it. Using kashida/nbsp/nnbsp is likely to work on more font (even if with less optimal rendering) than the fatha hack.

@adamiturabi
Copy link
Author

adamiturabi commented Apr 18, 2021

Thanks again for the detailed explanation. I happily take your word on it. However, do you think it will be harmless to add the fatha+dagger alef hack (in addition to explicit input over tatweel/nbsp) as it will separate character input from glyph typesetting, which, as I understand, is an underlying tenet of Unicode philosophy.

It will also make the behavior similar to how Amiri deals with inline hamza in words like خطيءة. Amiri correctly joins the ي with the ة, unlike most other fonts, which require superscript hamza over a tatweel.

@moyogo
Copy link

moyogo commented Apr 18, 2021

I think the use of tatweel and no-break space was proposed in L2/09-358 and there is a UTC action item 139-A60 for a formal proposal.

The Unicode 13.0.0 chapter 9 doesn't mention this use of tatweel or no-break space but this is similar to the use of hamza above on tatweel.

@khaledhosny
Copy link
Member

it will separate character input from glyph typesetting, which, as I understand, is an underlying tenet of Unicode philosophy.

I already had this before (even before Calibri Arabic was design) but I removed it for the reasons above, and I’d rather people followed a standard way to encode this sequence (with reasonable fallback for fonts that don’t handle it nicely) rather than depend on font-specific hacks. I’d have preferred a more semantic way to encode this, but Unicode seems to be reluctant (the cleanest way would be a separate character, and I encourage you to work on a proposal to Unicode if you feel sstrongly enough about this issue).

It will also make the behavior similar to how Amiri deals with inline hamza in words like خطيءة.

This is also another non-standard feature of Amiri that I wish to drop at some point for the exact same reasons.

I think the use of tatweel and no-break space was proposed in L2/09-358 and there is a UTC action item 139-A60 for a formal proposal.

Thanks @moyogo for the links.

@adamiturabi
Copy link
Author

Thank you @khaledhosny and @moyogo .

I’d have preferred a more semantic way to encode this, but Unicode seems to be reluctant (the cleanest way would be a separate character, and I encourage you to work on a proposal to Unicode if you feel sstrongly enough about this issue).

There are a couple of Unicode documents by Thomas Milo discussing this issue:
https://unicode.org/L2/L2014/14109-inline-chars.pdf
https://unicode.org/L2/L2013/13226-koran-ortho.pdf

The Unicode 13.0.0 chapter 9 doesn't mention this use of tatweel or no-break space but this is similar to the use of hamza above on tatweel.

It will also make the behavior similar to how Amiri deals with inline hamza in words like خطيءة.
This is also another non-standard feature of Amiri that I wish to drop at some point for the exact same reasons.

There are rare cases in non-Quranic script where superscript hamzah over a tatweel character will not suffice. For example, لَءَّال la22aal (pearl-seller) will break the mandatory lam-alef ligature if written with tatweel: لَـَّٔال.

It seems a complicated situation that you are definitely more qualified to address. As a user, however, semantic encoding is quite nice to have.

Thanks for discussing.

@adamiturabi
Copy link
Author

adamiturabi commented Apr 28, 2021

@khaledhosny @moyogo I hope it’s ok if I re-open this discussion a bit. I appreciate your point about not wanting to have a font-specific hack.

Doing some research, I found this description of U+034F ͏COMBINING GRAPHEME JOINER (CGJ): https://en.wikipedia.org/wiki/Combining_Grapheme_Joiner

The discussion on the rendering of Hebrew diacritics seems quite relevant.

Could we use CGJ in the case of dagger alef and hamza? Here is how it could potentially be used:

Dagger alef:

Input sequence Rendering
heh + dagger image
heh + fatha + dagger image
heh + fatha + CGJ + dagger image
heh + CGJ + dagger image
heh + dagger + thal image
heh + CGJ + dagger + thal image
heh + fatha + CGJ + thal image
thal + dagger image
thal + CGJ + dagger image
thal + fatha + CGJ + dagger image
waw + dagger image
waw + CGJ + dagger image

This way one common method can be used for both joining characters and non-joining characters (dal, thal, waw, etc.). Instead of using tatweel for joining characters and NBSP for non-joining characters. Also, we are not relying on the presence of fatha to determine whether to horizontally offset the dagger. (I now appeciate your point about wanting to have هـٰذا displayed without a fatha on the heh.)

Floating hamza

The implementation for hamza is a bit muddier since, uni0621 standalone hamza is now expected by users to break the joining of characters.

But one possible method could be to use CGJ with uni0654 “hamza above”. If CGJ comes before uni0654 then it will appear above the baseline without affecting the joining of the previous character to the next character.

Input sequence Rendering
Meem + fatha + lam + fatha + CGJ + hamza above + fathatan + alef image
lam + fatha + CGJ + hamza above + shaddah + fatha + alef image
sheen + yeh + CGJ + hamza above + alef image

If you think this idea has merit, I can try creating a formal proposal. Please let me know what you think, as I'm only a user and haven't studied Unicode development in detail.

Thank you.

@khaledhosny
Copy link
Member

Using CGJ is not a bad idea. I don't personally care much what method should be used, all I care about is standardized way that can represent the text reliably. Any solution can be made to produce the same output by the font.

@adamiturabi
Copy link
Author

adamiturabi commented May 3, 2021

I've written a draft proposal here: https://github.com/adamiturabi/arabic-inline-unicode/blob/main/index.pdf

I'd appreciate it if you could take a look. Also, if you could mention it to others who might be interested in this implementation and who might be able to give it some traction.

Thanks.

@khaledhosny
Copy link
Member

Looks good. Few comments:

  • It is not that fonts consider hamza to be breaking (non-joining) character, it is Unicode that specifies this and OpenType shaping engines enforce it (as they rely on Unicode for joining behaviour). Fonts that want to change this behaviour will have to jump through many hoops to achieve it.
  • The use of ـئـ for medial seat-less hamza is specified by Arabic Academy in Cairo (مجمع اللغة العربية), as part of its effort to “simplify” hamza rules, it is not just people being lazy.
  • For dagger alef, comparison can be made between it and small waw and small yeh, where both have combining and non-combining variants atomically encoded.

@adamiturabi
Copy link
Author

adamiturabi commented May 14, 2021

Thank you. I've incorporated your feedback. You can see the diffs here: adamiturabi/arabic-inline-unicode@7a5179d

The updated PDF is in the same location: https://github.com/adamiturabi/arabic-inline-unicode/blob/main/index.pdf

Regarding your last point, I wasn't sure exactly what you meant by making a comparison. Because we are not proposing a separate encoding for breaking dagger alef. But according to the CGJ scheme, the breaking "small waw" and "small yeh" won't technically be needed any more. So I've mentioned that.

Also, attempting to tag @roozbehp here.

@roozbehp
Copy link

What is the question for me?

@adamiturabi
Copy link
Author

adamiturabi commented May 15, 2021

What is the question for me?

@roozbehp Thanks for responding. I see that you have some prior work regarding proposing the handling of Arabic inline characters:

I think the use of tatweel and no-break space was proposed in L2/09-358 and there is a UTC action item 139-A60 for a formal proposal.

I've written a document on this issue and a proposed solution, matching one in L2/09-358R, here: https://github.com/adamiturabi/arabic-inline-unicode/blob/main/index.pdf

It will be great if you can provide feedback and recommend how to proceed w.r.t. proposing a solution to Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants