Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Update bidi design document to show proposed design #871

Merged
merged 2 commits into from
Sep 2, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions exploration/bidi-usability.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@ portions of the syntax in order to make messages appear correctly.

The second part of the hybrid approach would be to recommend ("SHOULD") the "strict isolation"
design for serializers.
(Note that "strict" and "super-loose" use non-identical productions with the name `bidi`.
These serve different purposes and are consistent with strict being narrower with super-loose.)
This syntax is a subset of the super-loose syntax and can be applied selectively to messages that
have RTL sequences or which have problematic display.

Expand Down Expand Up @@ -431,7 +433,17 @@ Add isolates and strongly directional marks to required and optional whitespace
This would permit users to get the effects described by the above design,
as long as they use isolates/marks in a "responsible" way.

(Omitting other changes found in #673)
The exception to this is the namespace separator, used in `identifier`.
This requires the ability to insert isolates or strongly directional marks
between the namespace and name portions, where whitespace is not permitted.
This is the only location in the syntax where such characters might be needed
but whitespace is not at least optional.
This could be defined as:
```abnf
ns-separator = [bidi] ":" [bidi]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow for the bidi after : if we're not allowing it to show up elsewhere after a starting sigil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). The [bidi] production allows starting or ending strongly directional marks to prevent spillover in the middle of a namespaced identifier, e.g. $_م1صر:‎_م2صر‎⁩

I'll admit that it's generally better practice to put the strongly directional character at the end (before the : separator), but I didn't want to make the syntax ultra-fussy: whatever stew of strongly directional characters and a colon are not part of either the namespace or the name.

The separator is different from sigils because the sigils are all at token-start, whereas the namespace separator is embedded into the word-token.

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it
code points for the second example: \u2066$_\u06451\u0635\u0631:\u200e_\u06452\u0635\u0631\u200e\u2069 try it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example $_مصر‎⁩ can be made to render correctly using LRI/PDI around the identifier (which the syntax allows and the strict form encourages). [...]

code points for the first example: \u2066$_\u0645\u0635\u0631\u200e\u2069 try it

That's rendering for me with the _ next to the $, which is wrong? The _ here has neutral directionality, and it's the first character of a name which as a whole is RTL. So we ought to have some way to render it as the right-most character, but that's not possible without some bidi control after the $.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, although a display like $⁧_مصر⁩⁩ is in itself super-weird (and it's namespaced friend $⁧_م1صر⁩:⁧_م2صر⁩⁩ is even weirder) from the point of view that it's an RTL isolated run inside an LTR isolated run (or, in the latter case, two RTL runs inside an LTR run). This makes the name tokens display correctly RTL while still forcing the overall token (including the sigil) to display LTR. That is, this sequence:

\u2066$\u2067_\u06451\u0635\u0631\u2069:\u2067_\u06452\u0635\u0631\u2069\u2069

... with 6 of the 18 code points being bidi isolates.

There's a similar problem with options, where we're trying to force them to be LTR-ordered (option = value/م1صر‎=م2صر⁩) when RTL really really wants that to be displayed as ⁦م1صر=م2صر⁩.

So you're right: my proposal does compromise some aspects of RTL display as part of our insistence that messages are intended to work in an LTR editing environment with LTR syntax. Either way, logical order is all that matters and users should be careful about using mixed direction for non-literal values. We give them the tools to make their bidi literals and patterns work RTL-normally as well as the tools to let non-RTL speakers read placeholders LTR (like most developers debugging messages). It is not perfect. Do we need that last bit of cruft to allow the sigil and identifier to be separate? (Not asking rhetorically, btw. What do people think? Where can we get developer/translator feedback?)

Note: It took a minute of fiddling to get the namespaced example to not get entangled with the markdown in this comment--all in service of getting the underscore on the right side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposed solution solves the 99% case, and trying to solve the remaining 1% will be tricky. Moreover, the people who really have to deal with messages are the translators, and they will need tooling to effectively do their jobs at scale — and that tooling has far more freedom to deal with bidi issues than we have in plain text. So I don't think we need to go any further down this path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbf, _ has neutral direction and it is relatively common to use it as a prefix for identifiers.

Yes, I know. So do all of the sigils (with some of them being enclosing punctuation).

I guess my argument would be that the sigil is, from a user perspective, "part of the name", e.g. $foo, not $ with foo being separate. In an RTL display context, the sigil, being a neutral, will naturally reverse sides to be at the start, (e.g. ⁧$_مصر⁩). For us to maintain an LTR bias while making only the name token isolated is a lot more work--for tools and users.

The counterargument would be that identifier and name are their own things (and many are generated from the user's environment). The various sigils and such in MF2 are thus "not part of the name" and should be treated separately. Allowing users to insert these characters is not the same as obligating them to use them and I could see us making the affordance.

In the screenshot below, the display is set to RTL. The placeholders have an LRI/PDI inside the {/}.

  • The first line, lacking any other bidi controls, is illegible.
  • The second line uses RLI/PDI around each token. This presents sigils on the right, which an RTL speaker would understand, but the tokens are in LTR order. I don't specifically object to this presentation.
  • The third line uses LRI/PDI around each token and approximates what I'm suggesting above.
  • The fourth line uses the markup you're suggesting, which is RLI/PDI around each name and LRI/PDI around the sigil+identifier.

image

Here's the data used to produce the screenshot. Note that I do use an U+200E next to the : separator.

Example: {\u2066$_\u06351\u0636\u0637 :_\u06352\u0636\u0637:_\u06353\u0636\u0637 _\u06354\u0636\u0637=_\u06355\u0636\u0637\u2069}<br>
Example RLI: {\u2066\u2067$_\u06351\u0636\u0637\u2069 \u2067:_\u06352\u0636\u0637:_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>
Example LRI: {\u2066\u2066$_\u06351\u0636\u0637\u2069 \u2066:_\u06352\u0636\u0637\u200e:_\u06353\u0636\u0637\u2069 \u2066_\u06354\u0636\u0637\u2069=\u2066_\u06355\u0636\u0637\u2069\u2069}<br>
Example EAO: {\u2066\u2066$\u2067_\u06351\u0636\u0637\u2069\u2069 \u2066:\u2067_\u06352\u0636\u0637\u2069\u200e:\u2067_\u06353\u0636\u0637\u2069\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}<br>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various sigils and such in MF2 are thus "not part of the name" and should be treated separately.

That would be my position. In particular for variables, AFAIK all current implementations would require parameters like { foo: 42 } in order to resolve a $foo, and not e.g. { $foo: 42 }.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Example EAO: {\u2066$\u2067_\u06351\u0636\u0637\u2069 :\u2067_\u06352\u0636\u0637\u2069:\u2067_\u06353\u0636\u0637\u2069 \u2067_\u06354\u0636\u0637\u2069=\u2067_\u06355\u0636\u0637\u2069\u2069}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it matters, here's a slightly shorter way to get the same results as in the last example (The LR isolation around the sigil+identifier isn't required as the placeholder already isolates its contents, and the LRM next to the : has no effect):

Your formulation is correct. Note that the LRM is necessary if we don't permit isolates inside the identifier production (which is not what you are proposing), because the : is a neutral and wants to extend whatever run is to either side of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something got damaged somewhere in the several copy/paste operations.

To make things readable I replaced:

\u0635 A
\u0636 B
\u0637 C

\u2066 [LRI]
\u2069 [PDI]
\u200E [LRM]
\u2067 [RLI]

And the resulting strings above are:

Example: {[LRI]$_A1BC :_A2BC:_A3BC _A4BC=_A5BC[PDI]}<br>
Example RLI: {[LRI][RLI]$_A1BC[PDI] [RLI]:_A2BC:_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>
Example LRI: {[LRI][LRI]$_A1BC[PDI] [LRI]:_A2BC[LRM]:_A3BC[PDI] [LRI]_A4BC[PDI]=[LRI]_A5BC[PDI][PDI]}<br>
Example EAO: {[LRI][LRI]$[RLI]_A1BC[PDI][PDI] [LRI]:[RLI]_A2BC[PDI][LRM]:[RLI]_A3BC[PDI][PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}<br>

Example EAO: {[LRI]$[RLI]_A1BC[PDI] :[RLI]_A2BC[PDI]:[RLI]_A3BC[PDI] [RLI]_A4BC[PDI]=[RLI]_A5BC[PDI][PDI]}

Taking out the bidi control characters we are left with

{$_A1BC :_A2BC:_A3BC _A4BC=_A5BC}

That is not valid MF2 syntax.
So I don't know what this is trying to fix.


But yesterday I played a bit and put together a web page that one can use to interactively play with this.

https://mihai-nita.net/tmp/mf2bidi.html

And I argue that:

  • isolates are enough (FSI, LRI, RLI, PDI)
  • there is no need for any control character between $ and the name proper, even when there is an _ there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not valid MF2 syntax.

Um... how is it not valid?

$_A1BC is a valid operand.
:_A2BC is a valid namespace. :_A3BC is a valid function name.
_A4BC is a valid option name.
_A5BC is a valid unquoted literal.

It's even a valid message, in that a simple message might consist solely of a placeholder.

What am I missing?

```

Here are the other ABNF changes:

```abnf
; strongly directional marks and bidi isolates
Expand Down Expand Up @@ -460,7 +472,7 @@ s = ( SP / HTAB / CR / LF / %x3000 )
### Strict isolation all the time

Apply bidi isolates in a strict way.
The main differences to the proposed solution is:
In this design:
1. The open/close isolate characters are syntactically required to be paired.
This introduces parse errors for unpaired invisible characters,
which could lead to bad user experiences.
Expand All @@ -480,7 +492,7 @@ markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["
/ "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close
/ "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close
identifier = [(namespace ns-separator)] name
ns-separator = [bidi] ":"
ns-separator = [bidi] ":" [bidi]
bidi = [ %x200E-200F / %x061C ]
```

Expand Down