Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Implement changes to bidi to permit non-strict formation #811

Merged
merged 16 commits into from
Aug 12, 2024
Merged
218 changes: 140 additions & 78 deletions exploration/bidi-usability.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,34 +54,42 @@ the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#
can interact in ways that make the _message_ unintelligible or difficult to parse visually.

Machines do not have a problem parsing _messages_ that contain RTL characters,
but users need to be able to discern what a _message_ does,
what _variant_ will be selected,
or what a _placeholder_ will evaluate to.
but users need to be able to discern what a _message_ does.
For example, users need to be able to match _keys_ in a _variant_ to _selectors_
in a `.match` statement.
Or they want to know how a _pattern_ will be evaluated,
such as understanding the _options_ and _values_ in a _placeholder_.

In addition, it is possible to construct messages that use bidi characters to spoof
users into believing that a _message_ does something different than what it actually does.

The current syntax does not permit bidi controls in _name_ tokens,
_unquoted_ literals,
or in the whitespace portions of a _message_.
_unquoted literals_,
or in the non-pattern whitespace portions of a _message_.

Permitting the **isolate** controls and the standalone strongly-directional markers
Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers
would enable tools, including translation tools, and users who are writing in RTL languages
to format a _message_ so that its plain-text representation and its function
are unambiguous.

The isolate controls are paired invisible control characters inserted around a portion of a string.
The start of an isolate sequence is one of:
The isolates are paired invisible characters inserted around a portion of a string.
The start of an isolated sequence is one of:
- U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
- U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
- U+2068 FIRST-STRONG ISOLATE (FSI)

The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).

The characters inside an isolate sequence have the initial string (paragraph) direction
corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI).
The isolate sequence is **isolated** from surrounding text.
This means that the surrounding text treats it as-if the sequence were a single neutral character.
The characters inside an isolated sequence have the initial string direction
corresponding to the starting character (
left-to-right for `LRI`,
right-to-left for `RLI`,
or <a href="https://www.w3.org/TR/i18n-glossary#auto-direction">auto</a> for `FSI`).
They are called "isolates" because the enclosed text is **isolated** from surrounding text
while being processed using the Unicode Bidirectional Algorithm (UBA).
The surrounding text treats the sequence as-if it were a single neutral character,
while the interior sequence is processed using the base direction specified by the isolate
starting character.

> [!NOTE]
> One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_
Expand All @@ -96,11 +104,26 @@ These include:
- U+200F RIGHT-TO-LEFT MARK (RLM)
- U+061C ARABIC LETTER MARK (ALM)

These characters are invisible strongly-directional characters used in bidirectional
These characters are invisible strongly-directional characters.
They are used in bidirectional
text to coerce certain directional behavior (usually to mark the end of
a sequence of characters that would otherwise be ambiguous or interact with
neutrals or opposite direction runs in an unhelpful way).

### Strictness and Abuse

We want the syntax to be somewhat permissive, particularly when it comes to paired isolates.
The isolates and strongly-directional marks are invisble except in certain specialized editing environments.
aphillips marked this conversation as resolved.
Show resolved Hide resolved
While users and tools should be strict about using well-formed isolate sequences,
we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary.
aphillips marked this conversation as resolved.
Show resolved Hide resolved
Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates.

It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing.
Such an ABNF might be used by message serializers to ensure high-quality message generation.

Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace,
could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi))

## Use-Cases

_What use-cases do we see? Ideally, quote concrete examples._
Expand Down Expand Up @@ -135,7 +158,7 @@ You have {$م1صر :م2صر م3صر=م4صر} <- no controls
You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token
```

3. As a developer or translator, I want to make RTL literal or names appear correctly
3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly
in my plain-text editing environment.
I don't want to have to manage a lot of paired controls, when I can get the right effect using
strongly directional mark characters (LRM, RLM, ALM)
Expand Down Expand Up @@ -209,6 +232,12 @@ Newlines inside of messages should not harm later syntax.
ن}}‎ 123 456 {{ LRM }}
```


Naive text editors, when operating in a right-to-left context,
might display a _message_ with an RTL base direction.
While the display of the _message_ might be somewhat damaged by this,
it should still produce results that are as reasonable as possible.

## Constraints

_What prior decisions and existing conditions limit the possible design?_
Expand All @@ -230,72 +259,90 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites
using the various whitespace productions.
This works at the cost of allowing spurious markers.

We want isolate characters to be _outside_ of patterns.
There is an open question about how best to place them.
One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`.
Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`.

Bidi isolates and marks are invisible characters.
Whitespace is also invisible.
Mixing these may be problematic.
Not allowing these to mix could produce annoying parse errors.

## Proposed Design

_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._

Editing and display of a _message_ SHOULD always use a left-to-right base direction
The syntax of a _message_ assumes a left-to-right base direction
both for the complete text of the _message_ as well as for each line (paragraph)
contained therein.

We use LTR display because the syntax of a _message_ depends on LTR word tokens,
contained therein.
We prefer LTR display because human understanding of a _message_ depends on LTR word tokens,
as well as token ordering (as in a placeholder or with variant keys).
Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself.
However, tool and editor implementers ought to pay attention to this assumption.

This is not the disadvantage to right-to-left languages that it might first appear:
- Bidi inside of _patterns_ works normally
- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear
Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear:
- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior
of _patterns_ work as plain text)
- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear
in the correct location in an RTL _pattern_
- _Expressions_ use isolates and directional marks to display internal tokens in the
correct order and without spillover effects
- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
for shaping purposes and these offer a poor person's form of isolation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell whether to read this as "paired enclosing marks (that the Unicode Bidirectional Algorithm pairs for shaping purposes)" or "uses paired enclosing marks (that the Unicode Bidirectional Algorithm pairs) for shaping purposes."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former.

Suggested change
for shaping purposes and these offer a poor person's form of isolation.
- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Permit isolating bidi controls to be used on the **outside** of the following:
The syntax permits (but does not require) isolating bidi controls to be used on the
**outside** of the following:
- unquoted literals
- quoted literals
- quoted patterns

We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow
We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow
the user to set the base direction of a _literal_ or _pattern_ according to its respective
actual contents.

> [!IMPORTANT]
> This change adds a "lookahead" to the process of determining if a given _message_ is
> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message
> as well as being allowed before a quoted pattern, declaration, or selector.

This would change the ABNF as follows:
(Notice that this change includes a production `bidi` described further down
in this document)
```abnf
literal = ( open-isolate (quoted / (unquoted [bidi])) close-isolate)
/ (quoted / (unquoted [bidi]))
quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
/ ("{{" pattern "}}")
literal = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate]
quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate]
eemeli marked this conversation as resolved.
Show resolved Hide resolved

open-isolate = %x2066-2068
close-isolate = %x2069
```

> [!IMPORTANT]
> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_
> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_
> productions because characters on the **_inside_** of these are part of the _literal_'s
> or _pattern_'s textual content.
> We need to allow users to include bidi controls in the output of MF2.
> We need to allow users to include bidi characters, including isolates and strongly directional marks
> in the output of MF2.

Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following:
aphillips marked this conversation as resolved.
Show resolved Hide resolved
- expressions
- markup

Permit isolates around any token inside of an expression or markup.

We only permit the LTR isolates because the contents of an _expression_
or _markup_ must be laid out left-to-right.
_Literal_ values can be right-to-left isolated within that or use strongly
directional marks to ensure correct display.

This would change the ABNF as follows (assuming the above changes are also incorporated):
```abnf
expression = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"
/ "{" (literal-expression / variable-expression / annotation-expression) "}"
expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
literal-expression = [s] literal [s annotation] *(s attribute) [s]
variable-expression = [s] variable [s annotation] *(s attribute) [s]
annotation-expression = [s] annotation *(s attribute) [s]
markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
/ "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}" ; open and standalone
/ "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}" ; close
markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone
/ "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close
LRI = %x2066
```

Expand All @@ -308,8 +355,9 @@ plus the names of _variables_,
as well as the contents of _unquoted_ literals.

> [!NOTE]
> Notice that _unquoted_ literals can also be surrounded by bidi isolates
> Notice that _unquoted literals_ can also be surrounded by bidi isolates
> using the previous syntax modification just above.
> The isolates are **not** a part of the literal!

> [!NOTE]
> Notice that `reserved-annotation` is not in the ABNF changes because it already
Expand All @@ -321,14 +369,25 @@ as well as the contents of _unquoted_ literals.
```abnf
variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}"
function = ":" identifier [bidi] *(s option)
option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi]
attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])]
markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close
identifier = [(namespace [bidi] ":")] name
option = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate]
attribute = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate]
markup = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone
/ "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}" ; close
identifier = [(namespace ns-separator)] name
ns-separator = [bidi] ":"
bidi = [ %x200E-200F / %x061C ]
```

### Open Issues with Proposed Design

The ABNF changes found above put isolates and strongly directional marks into specific locations,
such as directly next to `{`/`}`/`{{`/`}}` markers
or directly following "tokens" such as `name`.
This makes it a syntax error for whitespace to appear around the isolates or marks.
A more permissive design would add the isolates and strongly directional marks to required and optional
whitespace in the syntax and depend on users/editors to appropriately pair or position the marks
to get optimal display.

## Alternatives Considered

_What other solutions are available?_
Expand All @@ -348,47 +407,50 @@ the results or debug what is wrong with their messages.
By contrast, if users insert too many or the wrong controls using the recommended design,
the _message_ would still be functional and would emit no undesired characters.

### Super-loose isolation

### Loose isolation
Add isolates and strongly directional marks to required and optional whitespace in the syntax.
This would permit users to get the effects described by the above design,
as long as they use isolates/marks in a "responsible" way.

Apply bidi isolates in a slightly different way.
The main differences to the proposed solution are:
1. The open/close isolate characters are not syntactically required to be paired.
This avoids introducing parse errors for missing or required invisible characters,
which would lead to bad user experiences.
2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
allow for its proper isolation.
(Omitting other changes found in #673)

Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
For names and quoted literals, the isolate characters are outside the body of the token,
but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
This avoids adding a lookahead requirement for detecting a `complex-message` start,
and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`.
```abnf
; strongly directional marks and bidi isolates
; ALM / LRM / RLM / LRI / RLI / FSI / PDI
bidi = %x061C / %x200E / %x200F / %x2066-2069

Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`.
; optional whitespace
owsp = *( s / bidi )

An LRI is allowed immediately after a newline outside patterns and within expressions.
This is intended to allow left-to-right representation for "code"
even if it contains a newline followed by content
that could otherwise prompt the paragraph direction to be detected as right-to-left.
; required whitespace
wsp = [ owsp ] 1*s [ owsp ]

```abnf
name = [open-isolate] name-start *name-char [close-isolate]
quoted = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"
; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
```

**Pros**
- Avoids problems with syntax errors that users and tools might find difficult to debug.
- Effective if used carefully.
- Addresses need to comply with UAX#31

literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}"
variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}"
annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}"
**Cons**
- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55)

macchiati marked this conversation as resolved.
Show resolved Hide resolved
markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"
/ "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"
### Strict isolation all the time

Apply bidi isolates in a strict way.
The main differences to the proposed solution is:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see one difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.

The current design has expression thusly:

expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"

This alternative would turn that into:

expression = "{" (literal-expression / variable-expression / annotation-expression) "}"
           / "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"

In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.

Rinse and repeat for markup, option, attribute, and literals.

Make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the TODO

1. The open/close isolate characters are syntactically required to be paired.
This introduces parse errors for unpaired invisible characters,
which could lead to bad user experiences.

As noted above, the "strict" version of the ABNF should be adopted by serializers and for
message normalization.

// TODO put ABNF here

s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 )
LRI = %x2066
open-isolate = %x2066-2068
close-isolate = %x2069
```

Isolating rather than marking `name` helps ensure
that its directionality does not spill over to adjoining syntax.
aphillips marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -397,7 +459,7 @@ For example, this allows for the proper rendering of the expression
{⁦:⁧אחת⁩:⁧שתיים⁩⁩}
```
where "אחת" is the `namespace` of the `identifier`.
Without `name` isolation, this would render as
Without `name` isolation, this would (misleadingly) render as
```
{⁦:אחת:שתיים⁩}
```
Expand All @@ -410,7 +472,7 @@ just as they're not included in the parsed values of quoted literals or quoted p

### Deeper Syntax Changes
We could alter the syntax to make it more "bidi robust",
such as by using strongly directional instead of neutrals.
such as by using strongly directional characters instead of neutrals.

### Forbid RTL characters in `name` and/or `unquoted`
We could alter the syntax to forbid using RTL characters in names and unquoted literals.
Expand All @@ -425,7 +487,7 @@ Cons:
- This is not friendly to non-English/non-Latin users and represents a usability
restriction in environments in which names can be non-ASCII values

### Allow more permissive use of bidi controls
### Allow even more permissive use of bidi controls

We could permit RLI/FSI to be used inside _expressions_ and _markup_.
This would be an advantage for simple _expressions_ containing only or primarily
Expand Down