unicode-org · aphillips · Aug 12, 2024 · Jun 24, 2024 · Jun 25, 2024 · Jul 2, 2024
diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md
@@ -54,34 +54,42 @@ the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#
 can interact in ways that make the _message_ unintelligible or difficult to parse visually.
 
 Machines do not have a problem parsing _messages_ that contain RTL characters,
-but users need to be able to discern what a _message_ does,
-what _variant_ will be selected,
-or what a _placeholder_ will evaluate to.
+but users need to be able to discern what a _message_ does.
+For example, users need to be able to match _keys_ in a _variant_ to _selectors_
+in a `.match` statement.
+Or they want to know how a _pattern_ will be evaluated,
+such as understanding the _options_ and _values_ in a _placeholder_.
 
 In addition, it is possible to construct messages that use bidi characters to spoof
 users into believing that a _message_ does something different than what it actually does.
 
 The current syntax does not permit bidi controls in _name_ tokens,
-_unquoted_ literals,
-or in the whitespace portions of a _message_.
+_unquoted literals_,
+or in the non-pattern whitespace portions of a _message_.
 
-Permitting the **isolate** controls and the standalone strongly-directional markers
+Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers
 would enable tools, including translation tools, and users who are writing in RTL languages
 to format a _message_ so that its plain-text representation and its function
 are unambiguous.
 
-The isolate controls are paired invisible control characters inserted around a portion of a string.
-The start of an isolate sequence is one of:
+The isolates are paired invisible characters inserted around a portion of a string.
+The start of an isolated sequence is one of:
 - U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
 - U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
 - U+2068 FIRST-STRONG ISOLATE (FSI)
 
-The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
+The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
 
-The characters inside an isolate sequence have the initial string (paragraph) direction
-corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI).
-The isolate sequence is **isolated** from surrounding text.
-This means that the surrounding text treats it as-if the sequence were a single neutral character.
+The characters inside an isolated sequence have the initial string direction
+corresponding to the starting character (
+left-to-right for `LRI`, 
+right-to-left for `RLI`, 
+or <a href="https://www.w3.org/TR/i18n-glossary#auto-direction">auto</a> for `FSI`).
+They are called "isolates" because the enclosed text is **isolated** from surrounding text
+while being processed using the Unicode Bidirectional Algorithm (UBA).
+The surrounding text treats the sequence as-if it were a single neutral character,
+while the interior sequence is processed using the base direction specified by the isolate
+starting character.
 
 > [!NOTE]
 > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_
@@ -96,11 +104,26 @@ These include:
 - U+200F RIGHT-TO-LEFT MARK (RLM)
 - U+061C ARABIC LETTER MARK (ALM)
 
-These characters are invisible strongly-directional characters used in bidirectional
+These characters are invisible strongly-directional characters.
+They are used in bidirectional
 text to coerce certain directional behavior (usually to mark the end of 
 a sequence of characters that would otherwise be ambiguous or interact with
 neutrals or opposite direction runs in an unhelpful way).
 
+### Strictness and Abuse
+
+We want the syntax to be somewhat permissive, particularly when it comes to paired isolates.
+The isolates and strongly-directional marks are invisble except in certain specialized editing environments.
+While users and tools should be strict about using well-formed isolate sequences,
+we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary.
+Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates.
+
+It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing.
+Such an ABNF might be used by message serializers to ensure high-quality message generation.
+
+Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, 
+could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi))
+
 ## Use-Cases
 
 _What use-cases do we see? Ideally, quote concrete examples._
@@ -135,7 +158,7 @@ You have {$م1صر :م2صر م3صر=م4صر} <- no controls
 You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token
 ```
 
-3. As a developer or translator, I want to make RTL literal or names appear correctly
+3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly
    in my plain-text editing environment.
    I don't want to have to manage a lot of paired controls, when I can get the right effect using
    strongly directional mark characters (LRM, RLM, ALM)
@@ -209,6 +232,12 @@ Newlines inside of messages should not harm later syntax.
 ن}}‎ 123 456 {{ LRM }}
 ```
 
+
+Naive text editors, when operating in a right-to-left context, 
+might display a _message_ with an RTL base direction.
+While the display of the _message_ might be somewhat damaged by this,
+it should still produce results that are as reasonable as possible.
+
 ## Constraints
 
 _What prior decisions and existing conditions limit the possible design?_
@@ -230,72 +259,90 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites
 using the various whitespace productions.
 This works at the cost of allowing spurious markers.
 
+We want isolate characters to be _outside_ of patterns.
+There is an open question about how best to place them.
+One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`.
+Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`.
+
+Bidi isolates and marks are invisible characters.
+Whitespace is also invisible.
+Mixing these may be problematic.
+Not allowing these to mix could produce annoying parse errors.
+
 ## Proposed Design
 
 _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._
 
-Editing and display of a _message_ SHOULD always use a left-to-right base direction
+The syntax of a _message_ assumes a left-to-right base direction
 both for the complete text of the _message_ as well as for each line (paragraph)
-contained therein.
-
-We use LTR display because the syntax of a _message_ depends on LTR word tokens,
+contained therein. 
+We prefer LTR display because human understanding of a _message_ depends on LTR word tokens,
 as well as token ordering (as in a placeholder or with variant keys).
+Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself.
+However, tool and editor implementers ought to pay attention to this assumption.
 
-This is not the disadvantage to right-to-left languages that it might first appear:
-- Bidi inside of _patterns_ works normally
-- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear
+Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear:
+- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior
+  of _patterns_ work as plain text)
+- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear
   in the correct location in an RTL _pattern_
 - _Expressions_ use isolates and directional marks to display internal tokens in the
   correct order and without spillover effects
+- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
+  for shaping purposes and these offer a poor person's form of isolation.
-  for shaping purposes and these offer a poor person's form of isolation.
+- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
+  pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.
-  for shaping purposes and these offer a poor person's form of isolation.
+- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
+  pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.
 
-Permit isolating bidi controls to be used on the **outside** of the following:
+The syntax permits (but does not require) isolating bidi controls to be used on the 
+**outside** of the following:
 - unquoted literals
 - quoted literals
 - quoted patterns
 
-We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow
+We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow
 the user to set the base direction of a _literal_ or _pattern_ according to its respective 
 actual contents.
 
+> [!IMPORTANT]
+> This change adds a "lookahead" to the process of determining if a given _message_ is
+> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message
+> as well as being allowed before a quoted pattern, declaration, or selector.
+
 This would change the ABNF as follows:
 (Notice that this change includes a production `bidi` described further down
 in this document)
 ```abnf
-literal        = ( open-isolate (quoted / (unquoted [bidi])) close-isolate)
-               / (quoted / (unquoted [bidi]))
-quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
-               / ("{{" pattern "}}")
+literal        = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate]
+quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate]
 
 open-isolate   = %x2066-2068
 close-isolate  = %x2069
 ```
 
 > [!IMPORTANT]
-> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_
+> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_
 > productions because characters on the **_inside_** of these are part of the _literal_'s
 > or _pattern_'s textual content.
-> We need to allow users to include bidi controls in the output of MF2.
+> We need to allow users to include bidi characters, including isolates and strongly directional marks
+> in the output of MF2.
 
-Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
+Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following:
 - expressions
 - markup
 
+Permit isolates around any token inside of an expression or markup.
+
 We only permit the LTR isolates because the contents of an _expression_
 or _markup_ must be laid out left-to-right.
 _Literal_ values can be right-to-left isolated within that or use strongly
 directional marks to ensure correct display.
 
 This would change the ABNF as follows (assuming the above changes are also incorporated):
 ```abnf
-expression            = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"
-                      / "{" (literal-expression / variable-expression / annotation-expression) "}"
+expression            = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
 literal-expression    = [s] literal [s annotation] *(s attribute) [s]
 variable-expression   = [s] variable [s annotation] *(s attribute) [s]
 annotation-expression = [s] annotation *(s attribute) [s]
-markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}"                    ; open and standalone
-       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"                          ; close
-       / "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}"  ; open and standalone
-       / "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}"        ; close
+markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"  ; open and standalone
+       / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"        ; close
 LRI = %x2066
 ```
 
@@ -308,8 +355,9 @@ plus the names of _variables_,
 as well as the contents of _unquoted_ literals.
 
 > [!NOTE]
-> Notice that _unquoted_ literals can also be surrounded by bidi isolates
+> Notice that _unquoted literals_ can also be surrounded by bidi isolates
 > using the previous syntax modification just above.
+> The isolates are **not** a part of the literal!
 
 > [!NOTE]
 > Notice that `reserved-annotation` is not in the ABNF changes because it already
@@ -321,14 +369,25 @@ as well as the contents of _unquoted_ literals.
 ```abnf
 variable-expression   = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}"
 function       = ":" identifier [bidi] *(s option)
-option         = identifier [bidi] [s] "=" [s] (literal / variable) [bidi]
-attribute      = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])]
-markup         = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}"  ; open and standalone
-               / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}"  ; close
-identifier     = [(namespace [bidi] ":")] name
+option         = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate]
+attribute      = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate]
+markup         = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"  ; open and standalone
+               / "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}"  ; close
+identifier     = [(namespace ns-separator)] name
+ns-separator   = [bidi] ":"
 bidi           = [ %x200E-200F / %x061C ]
 ```
 
+### Open Issues with Proposed Design
+
+The ABNF changes found above put isolates and strongly directional marks into specific locations,
+such as directly next to `{`/`}`/`{{`/`}}` markers
+or directly following "tokens" such as `name`.
+This makes it a syntax error for whitespace to appear around the isolates or marks.
+A more permissive design would add the isolates and strongly directional marks to required and optional
+whitespace in the syntax and depend on users/editors to appropriately pair or position the marks
+to get optimal display.
+
 ## Alternatives Considered
 
 _What other solutions are available?_
@@ -348,47 +407,50 @@ the results or debug what is wrong with their messages.
 By contrast, if users insert too many or the wrong controls using the recommended design,
 the _message_ would still be functional and would emit no undesired characters.
 
+### Super-loose isolation
 
-### Loose isolation
+Add isolates and strongly directional marks to required and optional whitespace in the syntax.
+This would permit users to get the effects described by the above design,
+as long as they use isolates/marks in a "responsible" way.
 
-Apply bidi isolates in a slightly different way.
-The main differences to the proposed solution are:
-1. The open/close isolate characters are not syntactically required to be paired.
-   This avoids introducing parse errors for missing or required invisible characters,
-   which would lead to bad user experiences.
-2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM,
-   allow for its proper isolation.
+(Omitting other changes found in #673)
 
-Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI.
-For names and quoted literals, the isolate characters are outside the body of the token,
-but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters.
-This avoids adding a lookahead requirement for detecting a `complex-message` start,
-and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`.
+```abnf
+; strongly directional marks and bidi isolates
+; ALM / LRM / RLM / LRI / RLI / FSI / PDI
+bidi = %x061C / %x200E / %x200F / %x2066-2069
 
-Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`.
+; optional whitespace
+owsp = *( s / bidi )
 
-An LRI is allowed immediately after a newline outside patterns and within expressions.
-This is intended to allow left-to-right representation for "code"
-even if it contains a newline followed by content
-that could otherwise prompt the paragraph direction to be detected as right-to-left.
+; required whitespace
+wsp = [ owsp ] 1*s [ owsp ]
 
-```abnf
-name           = [open-isolate] name-start *name-char [close-isolate]
-quoted         = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
-quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}"
+; whitespace characters
+s = ( SP / HTAB / CR / LF / %x3000 )
+```
+
+**Pros**
+- Avoids problems with syntax errors that users and tools might find difficult to debug.
+- Effective if used carefully.
+- Addresses need to comply with UAX#31
 
-literal-expression    = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}"
-variable-expression   = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}"
-annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}"
+**Cons**
+- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55)
 
-markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}"
-       / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}"
+### Strict isolation all the time
+
+Apply bidi isolates in a strict way.
+The main differences to the proposed solution is:
+1. The open/close isolate characters are syntactically required to be paired.
+   This introduces parse errors for unpaired invisible characters,
+   which could lead to bad user experiences.
+
+As noted above, the "strict" version of the ABNF should be adopted by serializers and for 
+message normalization.
+
+// TODO put ABNF here
 
-s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 )
-LRI = %x2066
-open-isolate  = %x2066-2068
-close-isolate = %x2069
-```
 
 Isolating rather than marking `name` helps ensure
 that its directionality does not spill over to adjoining syntax.
@@ -397,7 +459,7 @@ For example, this allows for the proper rendering of the expression
 {⁦:⁧אחת⁩:⁧שתיים⁩⁩}
 ```
 where "אחת" is the `namespace` of the `identifier`.
-Without `name` isolation, this would render as
+Without `name` isolation, this would (misleadingly) render as
 ```
 {⁦:אחת:שתיים⁩}
 ```
@@ -410,7 +472,7 @@ just as they're not included in the parsed values of quoted literals or quoted p
 
 ### Deeper Syntax Changes
 We could alter the syntax to make it more "bidi robust", 
-such as by using strongly directional instead of neutrals.
+such as by using strongly directional characters instead of neutrals.
 
 ### Forbid RTL characters in `name` and/or `unquoted`
 We could alter the syntax to forbid using RTL characters in names and unquoted literals.
@@ -425,7 +487,7 @@ Cons:
 - This is not friendly to non-English/non-Latin users and represents a usability
   restriction in environments in which names can be non-ASCII values
 
-### Allow more permissive use of bidi controls
+### Allow even more permissive use of bidi controls
 
 We could permit RLI/FSI to be used inside _expressions_ and _markup_.
 This would be an advantage for simple _expressions_ containing only or primarily