From c2bcfe5985b3fb9caa996a2245780e5332915dea Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Mon, 24 Jun 2024 16:25:45 -0700 Subject: [PATCH 01/16] [DESIGN] Implement changes to bidi to permit non-strict formation **_DO NOT REVIEW_** This PR will eventually include the design changes. Currently a work in progress. --- exploration/bidi-usability.md | 57 +++++++++++++++++++++-------------- 1 file changed, 35 insertions(+), 22 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index f3bb6b6e4..c15964e53 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -70,18 +70,23 @@ would enable tools, including translation tools, and users who are writing in RT to format a _message_ so that its plain-text representation and its function are unambiguous. -The isolate controls are paired invisible control characters inserted around a portion of a string. -The start of an isolate sequence is one of: +The isolates are paired invisible characters inserted around a portion of a string. +The start of an isolated sequence is one of: - U+2066 LEFT-TO-RIGHT ISOLATE (LRI) - U+2067 RIGHT-TO-LEFT ISOLATE (RLI) - U+2068 FIRST-STRONG ISOLATE (FSI) -The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). +The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). -The characters inside an isolate sequence have the initial string (paragraph) direction -corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI). -The isolate sequence is **isolated** from surrounding text. -This means that the surrounding text treats it as-if the sequence were a single neutral character. +The characters inside an isolated sequence have the initial string direction +corresponding to the starting control ( +left-to-right for `LRI`, +right-to-left for `RLI`, +or auto for `FSI`). +The isolated sequence is **isolated** from surrounding text: +it is processed using the Unicode Bidirectional Algorithm (UBA) +separately from the rest of the string and +the surrounding text treats the sequence as-if it were a single neutral character. > [!NOTE] > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ @@ -96,11 +101,22 @@ These include: - U+200F RIGHT-TO-LEFT MARK (RLM) - U+061C ARABIC LETTER MARK (ALM) -These characters are invisible strongly-directional characters used in bidirectional +These characters are invisible strongly-directional characters. +They are used in bidirectional text to coerce certain directional behavior (usually to mark the end of a sequence of characters that would otherwise be ambiguous or interact with neutrals or opposite direction runs in an unhelpful way). +### Strictness + +We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. +The isolates and strongly-directional marks are invisble except in certain specialized editing environments. +While users and tools should be strict about using well-formed isolate sequences, +we don't want to invisble characters or whitespace to generate additional syntax errors except where +absolutely necessary. +Therefore, it should not be a syntax error if a user, editor, or tool fails to provide the +opening or closing isolate. + ## Use-Cases _What use-cases do we see? Ideally, quote concrete examples._ @@ -253,7 +269,7 @@ Permit isolating bidi controls to be used on the **outside** of the following: - quoted literals - quoted patterns -We permit any of the isolate starting controls (LRI, RLI, FSI) because we want to allow +We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow the user to set the base direction of a _literal_ or _pattern_ according to its respective actual contents. @@ -261,25 +277,25 @@ This would change the ABNF as follows: (Notice that this change includes a production `bidi` described further down in this document) ```abnf -literal = ( open-isolate (quoted / (unquoted [bidi])) close-isolate) - / (quoted / (unquoted [bidi])) -quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate) - / ("{{" pattern "}}") +literal = [open-isolate] (quoted / (unquoted [bidi])) [close-isolate] +quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate] open-isolate = %x2066-2068 close-isolate = %x2069 ``` > [!IMPORTANT] -> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_ +> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_ > productions because characters on the **_inside_** of these are part of the _literal_'s > or _pattern_'s textual content. -> We need to allow users to include bidi controls in the output of MF2. +> We need to allow users to include bidi characters, including isolates and strongly directional marks in the output of MF2. -Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following: +Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following: - expressions - markup +Permit isolates around any token inside of an expression or markup. + We only permit the LTR isolates because the contents of an _expression_ or _markup_ must be laid out left-to-right. _Literal_ values can be right-to-left isolated within that or use strongly @@ -287,15 +303,12 @@ directional marks to ensure correct display. This would change the ABNF as follows (assuming the above changes are also incorporated): ```abnf -expression = "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}" - / "{" (literal-expression / variable-expression / annotation-expression) "}" +expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}" literal-expression = [s] literal [s annotation] *(s attribute) [s] variable-expression = [s] variable [s annotation] *(s attribute) [s] annotation-expression = [s] annotation *(s attribute) [s] -markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close - / "{" LRI [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}" ; open and standalone - / "{" LRI [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}" ; close +markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close LRI = %x2066 ``` From 93423861f07d858ba98821a1bc54741e682769f3 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 25 Jun 2024 09:16:11 -0700 Subject: [PATCH 02/16] Address comments plus update "other" solutions Since we're adopting "loose" as the proposed design, put "strict" as a considered option. --- exploration/bidi-usability.md | 61 +++++++++++++---------------------- 1 file changed, 23 insertions(+), 38 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index c15964e53..1042fc0eb 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -117,6 +117,13 @@ absolutely necessary. Therefore, it should not be a syntax error if a user, editor, or tool fails to provide the opening or closing isolate. +It is possible to generate a "strict" version of the ABNF that is more restrictive about +isolate pairing. +Such an ABNF might be used by message serializers to ensure high-quality message generation. + +Note that the permissive syntax might be abused to produce Trojan Source effects +(see [[UTS55]](https://www.unicode.org/reports/tr55)) + ## Use-Cases _What use-cases do we see? Ideally, quote concrete examples._ @@ -273,6 +280,11 @@ We permit any of the isolate starting characters (LRI, RLI, FSI) because we want the user to set the base direction of a _literal_ or _pattern_ according to its respective actual contents. +> [!IMPORTANT] +> This change adds a "lookahead" to determining if a given _message_ is +> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message +> as well as being allowed before a quoted pattern. + This would change the ABNF as follows: (Notice that this change includes a production `bidi` described further down in this document) @@ -362,46 +374,19 @@ By contrast, if users insert too many or the wrong controls using the recommende the _message_ would still be functional and would emit no undesired characters. -### Loose isolation - -Apply bidi isolates in a slightly different way. -The main differences to the proposed solution are: -1. The open/close isolate characters are not syntactically required to be paired. - This avoids introducing parse errors for missing or required invisible characters, - which would lead to bad user experiences. -2. Rather than patching the `name` rule with an optional trailing LRM/RLM/ALM, - allow for its proper isolation. +### Strict isolation all the time -Quoted patterns, quoted literals, and names may be isolated by LRI/RLI/FSI...PDI. -For names and quoted literals, the isolate characters are outside the body of the token, -but for quoted patterns, the isolates are in the middle of the `{{` and `}}` characters. -This avoids adding a lookahead requirement for detecting a `complex-message` start, -and differentiates a `quoted-pattern` from a `quoted` `key` in a `variant`. +Apply bidi isolates in a strict way. +The main differences to the proposed solution is: +1. The open/close isolate characters are syntactically required to be paired. + This introduces parse errors for unpaired invisible characters, + which could lead to bad user experiences. -Expressions and markup may be isolated by LRI...PDI immediately within the `{` and `}`. +As noted above, the "strict" version of the ABNF should be adopted by serializers and for +message normalization. -An LRI is allowed immediately after a newline outside patterns and within expressions. -This is intended to allow left-to-right representation for "code" -even if it contains a newline followed by content -that could otherwise prompt the paragraph direction to be detected as right-to-left. +// TODO put ABNF here -```abnf -name = [open-isolate] name-start *name-char [close-isolate] -quoted = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate] -quoted-pattern = "{" [open-isolate] "{" pattern "}" [close-isolate] "}" - -literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}" -variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}" -annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}" - -markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" - / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" - -s = 1*( SP / HTAB / CR / LF [LRI] / %x3000 ) -LRI = %x2066 -open-isolate = %x2066-2068 -close-isolate = %x2069 -``` Isolating rather than marking `name` helps ensure that its directionality does not spill over to adjoining syntax. @@ -410,7 +395,7 @@ For example, this allows for the proper rendering of the expression {⁦:⁧אחת⁩:⁧שתיים⁩⁩} ``` where "אחת" is the `namespace` of the `identifier`. -Without `name` isolation, this would render as +Without `name` isolation, this would (misleadingly) render as ``` {⁦:אחת:שתיים⁩} ``` @@ -423,7 +408,7 @@ just as they're not included in the parsed values of quoted literals or quoted p ### Deeper Syntax Changes We could alter the syntax to make it more "bidi robust", -such as by using strongly directional instead of neutrals. +such as by using strongly directional characters instead of neutrals. ### Forbid RTL characters in `name` and/or `unquoted` We could alter the syntax to forbid using RTL characters in names and unquoted literals. From 3d848598c1784bb52c6da207f75081dcb572ad41 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 2 Jul 2024 12:55:46 -0700 Subject: [PATCH 03/16] Edits to add some previous discussion points --- exploration/bidi-usability.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 1042fc0eb..b4736d0a6 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -232,6 +232,13 @@ Newlines inside of messages should not harm later syntax. ن}}‎ 123 456 {{ LRM }} ``` + +Naive text editors, when operating in a right-to-left context, +might display a _message_ with an RTL base direction. +While the display of the _message_ might be somewhat damaged by this, +it should not still produce results that are as reasonable as possible. + + ## Constraints _What prior decisions and existing conditions limit the possible design?_ @@ -253,6 +260,16 @@ The workaround in #763 was to permit these characters _before_ or _after_ whites using the various whitespace productions. This works at the cost of allowing spurious markers. +We want isolate characters to be _outside_ of patterns. +There is an open question about how best to place them. +One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`. +Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`. + +Bidi isolates and marks are invisible characters. +Whitespace is also invisible. +Mixing these may be problematic. +Not allowing these to mix could produce annoying parse errors. + ## Proposed Design _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ @@ -281,7 +298,7 @@ the user to set the base direction of a _literal_ or _pattern_ according to its actual contents. > [!IMPORTANT] -> This change adds a "lookahead" to determining if a given _message_ is +> This change adds a "lookahead" to the process of determining if a given _message_ is > "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message > as well as being allowed before a quoted pattern. From 33528f3fcc45a619fc4dc8b2dafc4beb58d85056 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 2 Jul 2024 13:01:30 -0700 Subject: [PATCH 04/16] Typo --- exploration/bidi-usability.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index b4736d0a6..665202cd9 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -236,8 +236,7 @@ Newlines inside of messages should not harm later syntax. Naive text editors, when operating in a right-to-left context, might display a _message_ with an RTL base direction. While the display of the _message_ might be somewhat damaged by this, -it should not still produce results that are as reasonable as possible. - +it should still produce results that are as reasonable as possible. ## Constraints From dc175bb1395df255c06717f1e3f3f41ac5e23957 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 4 Jul 2024 15:13:55 -0700 Subject: [PATCH 05/16] Improve discussion of abuse --- exploration/bidi-usability.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 665202cd9..287a71b01 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -107,22 +107,19 @@ text to coerce certain directional behavior (usually to mark the end of a sequence of characters that would otherwise be ambiguous or interact with neutrals or opposite direction runs in an unhelpful way). -### Strictness +### Strictness and Abuse We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. The isolates and strongly-directional marks are invisble except in certain specialized editing environments. While users and tools should be strict about using well-formed isolate sequences, -we don't want to invisble characters or whitespace to generate additional syntax errors except where -absolutely necessary. -Therefore, it should not be a syntax error if a user, editor, or tool fails to provide the -opening or closing isolate. +we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary. +Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. -It is possible to generate a "strict" version of the ABNF that is more restrictive about -isolate pairing. +It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. Such an ABNF might be used by message serializers to ensure high-quality message generation. -Note that the permissive syntax might be abused to produce Trojan Source effects -(see [[UTS55]](https://www.unicode.org/reports/tr55)) +Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, +could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi)) ## Use-Cases From ed8bab5a5c684bbcd2ea038b6eacee16302d3c48 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Sun, 14 Jul 2024 10:00:00 -0700 Subject: [PATCH 06/16] Update bidi-usability.md --- exploration/bidi-usability.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 287a71b01..19dcc931a 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -270,21 +270,26 @@ Not allowing these to mix could produce annoying parse errors. _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ -Editing and display of a _message_ SHOULD always use a left-to-right base direction +The syntax of a _message_ assumes a left-to-right base direction both for the complete text of the _message_ as well as for each line (paragraph) -contained therein. - -We use LTR display because the syntax of a _message_ depends on LTR word tokens, +contained therein. +We prefer LTR display because human understanding of a _message_ depends on LTR word tokens, as well as token ordering (as in a placeholder or with variant keys). +Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself. +However, tool and editor implementers ought to pay attention to this assumption. -This is not the disadvantage to right-to-left languages that it might first appear: -- Bidi inside of _patterns_ works normally -- _Placeholders_ and _markup_ are isolated (treated as neutrals) so that they appear +Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear: +- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior + of _patterns_ work as plain text) +- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear in the correct location in an RTL _pattern_ - _Expressions_ use isolates and directional marks to display internal tokens in the correct order and without spillover effects +- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs + for shaping purposes and these offer a poor person's form of isolation. -Permit isolating bidi controls to be used on the **outside** of the following: +The syntax permits (but does not require) isolating bidi controls to be used on the +**outside** of the following: - unquoted literals - quoted literals - quoted patterns @@ -313,7 +318,8 @@ close-isolate = %x2069 > The isolating characters go on the **_outside_** of the various _literal_ and _pattern_ > productions because characters on the **_inside_** of these are part of the _literal_'s > or _pattern_'s textual content. -> We need to allow users to include bidi characters, including isolates and strongly directional marks in the output of MF2. +> We need to allow users to include bidi characters, including isolates and strongly directional marks +> in the output of MF2. Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following: - expressions From 2f7e13c3c46a90640cc99203930ff198a3862814 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 23 Jul 2024 07:49:31 -0700 Subject: [PATCH 07/16] Update bidi-usability.md --- exploration/bidi-usability.md | 57 ++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 21 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 19dcc931a..2ce4d9b04 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -54,18 +54,20 @@ the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX# can interact in ways that make the _message_ unintelligible or difficult to parse visually. Machines do not have a problem parsing _messages_ that contain RTL characters, -but users need to be able to discern what a _message_ does, -what _variant_ will be selected, -or what a _placeholder_ will evaluate to. +but users need to be able to discern what a _message_ does. +For example, users need to be able to match _keys_ in a _variant_ to _selectors_ +in a `.match` statement. +Or they want to know how a _pattern_ will be evaluated, +such as understanding the _options_ and _values_ in a _placeholder_. In addition, it is possible to construct messages that use bidi characters to spoof users into believing that a _message_ does something different than what it actually does. The current syntax does not permit bidi controls in _name_ tokens, -_unquoted_ literals, -or in the whitespace portions of a _message_. +_unquoted literals_, +or in the non-pattern whitespace portions of a _message_. -Permitting the **isolate** controls and the standalone strongly-directional markers +Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers would enable tools, including translation tools, and users who are writing in RTL languages to format a _message_ so that its plain-text representation and its function are unambiguous. @@ -79,14 +81,15 @@ The start of an isolated sequence is one of: The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). The characters inside an isolated sequence have the initial string direction -corresponding to the starting control ( +corresponding to the starting character ( left-to-right for `LRI`, right-to-left for `RLI`, or auto for `FSI`). -The isolated sequence is **isolated** from surrounding text: -it is processed using the Unicode Bidirectional Algorithm (UBA) -separately from the rest of the string and -the surrounding text treats the sequence as-if it were a single neutral character. +They are called "isolates" because the enclosed text is **isolated** from surrounding text +while being processed using the Unicode Bidirectional Algorithm (UBA). +The surrounding text treats the sequence as-if it were a single neutral character, +while the interior sequence is processed using the base direction specified by the isolate +starting character. > [!NOTE] > One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ @@ -155,7 +158,7 @@ You have {$م1صر :م2صر م3صر=م4صر} <- no controls You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token ``` -3. As a developer or translator, I want to make RTL literal or names appear correctly +3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly in my plain-text editing environment. I don't want to have to manage a lot of paired controls, when I can get the right effect using strongly directional mark characters (LRM, RLM, ALM) @@ -301,13 +304,13 @@ actual contents. > [!IMPORTANT] > This change adds a "lookahead" to the process of determining if a given _message_ is > "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message -> as well as being allowed before a quoted pattern. +> as well as being allowed before a quoted pattern, declaration, or selector. This would change the ABNF as follows: (Notice that this change includes a production `bidi` described further down in this document) ```abnf -literal = [open-isolate] (quoted / (unquoted [bidi])) [close-isolate] +literal = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate] quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate] open-isolate = %x2066-2068 @@ -352,8 +355,9 @@ plus the names of _variables_, as well as the contents of _unquoted_ literals. > [!NOTE] -> Notice that _unquoted_ literals can also be surrounded by bidi isolates +> Notice that _unquoted literals_ can also be surrounded by bidi isolates > using the previous syntax modification just above. +> The isolates are **not** a part of the literal! > [!NOTE] > Notice that `reserved-annotation` is not in the ABNF changes because it already @@ -365,14 +369,25 @@ as well as the contents of _unquoted_ literals. ```abnf variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" function = ":" identifier [bidi] *(s option) -option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi] -attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] -markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close -identifier = [(namespace [bidi] ":")] name +option = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate] +attribute = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate] +markup = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" bidi = [ %x200E-200F / %x061C ] ``` +### Open Issues with Proposed Design + +The ABNF changes found above put isolates and strongly directional marks into specific locations, +such as directly next to `{`/`}`/`{{`/`}}` markers +or directly following "tokens" such as `name`. +This makes it a syntax error for whitespace to appear around the isolates or marks. +A more permissive design would add the isolates and strongly directional marks to required and optional +whitespace in the syntax and depend on users/editors to appropriately pair or position the marks +to get optimal display. + ## Alternatives Considered _What other solutions are available?_ @@ -442,7 +457,7 @@ Cons: - This is not friendly to non-English/non-Latin users and represents a usability restriction in environments in which names can be non-ASCII values -### Allow more permissive use of bidi controls +### Allow even more permissive use of bidi controls We could permit RLI/FSI to be used inside _expressions_ and _markup_. This would be an advantage for simple _expressions_ containing only or primarily From 1d9ae2ee56e9152961c5de663b8e2b2f6c9cd2e8 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 23 Jul 2024 09:47:32 -0700 Subject: [PATCH 08/16] Add the 'super-loose' option --- exploration/bidi-usability.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 2ce4d9b04..36aa51ea4 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -407,6 +407,36 @@ the results or debug what is wrong with their messages. By contrast, if users insert too many or the wrong controls using the recommended design, the _message_ would still be functional and would emit no undesired characters. +### Super-loose isolation + +Add isolates and strongly directional marks to required and optional whitespace in the syntax. +This would permit users to get the effects described by the above design, +as long as they use isolates/marks in a "responsible" way. + +(Omitting other changes found in #673) + +```abnf +; strongly directional marks and bidi isolates +; ALM / LRM / RLM / LRI / RLI / FSI / PDI +bidi = %x061C / %x200E / %x200F / %x2066-2069 + +; optional whitespace +owsp = *( s / bidi ) + +; required whitespace +wsp = [ owsp ] 1*s [ owsp ] + +; whitespace characters +s = ( SP / HTAB / CR / LF / %x3000 ) +``` + +**Pros** +- Avoids problems with syntax errors that users and tools might find difficult to debug. +- Effective if used carefully. +- Addresses need to comply with UAX#31 + +**Cons** +- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55) ### Strict isolation all the time From 03ebbcee1625b532f454fb78329538e76d8a075f Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 23 Jul 2024 11:33:05 -0700 Subject: [PATCH 09/16] Update exploration/bidi-usability.md Co-authored-by: Eemeli Aro --- exploration/bidi-usability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 36aa51ea4..b341b09bf 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -115,7 +115,7 @@ neutrals or opposite direction runs in an unhelpful way). We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. The isolates and strongly-directional marks are invisble except in certain specialized editing environments. While users and tools should be strict about using well-formed isolate sequences, -we don't want to invisble characters or whitespace to generate additional syntax errors except where necessary. +we don't want to have invisible characters or whitespace generate additional syntax errors except where necessary. Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. From df376748f69589f4b57632873c2640bf3ec08ea6 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 23 Jul 2024 11:33:37 -0700 Subject: [PATCH 10/16] Update exploration/bidi-usability.md Co-authored-by: Eemeli Aro --- exploration/bidi-usability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index b341b09bf..ee0fc9f51 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -324,7 +324,7 @@ close-isolate = %x2069 > We need to allow users to include bidi characters, including isolates and strongly directional marks > in the output of MF2. -Permit **left-to-right** isolates (`U+2066`...`U+2069`) to be used **immediately inside** the following: +Permit **left-to-right** isolates (`U+2066` and `U+2069`) to be used **immediately inside** the following: - expressions - markup From f12e316aee2d5fef39c82685ec4493981ea13836 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Tue, 23 Jul 2024 15:18:02 -0700 Subject: [PATCH 11/16] Address comments, add Postel's Law design approach --- exploration/bidi-usability.md | 47 +++++++++++++++++++++++++++++++++-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index ee0fc9f51..fb205d0c4 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -436,7 +436,8 @@ s = ( SP / HTAB / CR / LF / %x3000 ) - Addresses need to comply with UAX#31 **Cons** -- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55) +- Syntax does not prevent poor display outcomes, including enabling some Trojan Source cases (UAX#55); + note that tooling or linting can help ameliorate these issues. ### Strict isolation all the time @@ -487,7 +488,7 @@ Cons: - This is not friendly to non-English/non-Latin users and represents a usability restriction in environments in which names can be non-ASCII values -### Allow even more permissive use of bidi controls +### Permit LRI, RLI, and FSI inside expressions and markup We could permit RLI/FSI to be used inside _expressions_ and _markup_. This would be an advantage for simple _expressions_ containing only or primarily @@ -530,3 +531,45 @@ complex sets of controls. - Requires complex sets of bidi controls - RTL editing/display is mostly a special case; we already afford the ability to edit RTL in _patterns_ and _literals_ + +### Hybrid approaches + +Strict syntactical requirements produce better _display_ outcomes +that solve the various problems enumerated in this design document. +However, the strictness comes with a cost: otherwise-valid messages, +including messages that display completely as expected and are not in any way misleading, +can produce syntax errors. +These errors can be difficult to debug, since the characters are invisible. +Syntax errors are generally treated as fatal by processors. + +Semi-strict or super-loose strategies can be used to avoid producing these types of syntax error. +However, valid messages using these approaches can have stray (e.g. unpaired isolates), +malformed (e.g. PDI before LRI/RLI/FSI), +or badly formatted character sequences (wrapping the wrong things), +unless the user or the user's tools are careful. +This can include deliberate abuse, such as Trojan Source attacks (see UAX#55), +in which Bad Actors create messages that have a misleading appearance vs. their runtime interpretation. + +A hybrid ("Postel's Law") approach would be to permit the use of isolates and strongly directional marks +in whitespace in a permissive way (see: "super-loose isolation"), +particularly in runtime formatting operations +but strongly encourage tools to implement message normalization on a strictly-defined grammar +(see: "strict isolation all the time") +and to encourage users to use the strict version of the grammar when writing or serializing messages. + +The hybrid approach would include tests to allow implementations to claim +adherence to the stricter grammar. + +**Pros** +- Messages can be written that solve all display problems +- Stray, unpaired, repeated, or other invisible typos do not produce spurious + syntax errors +- Provides a foundation for tools to claim strict conformance and message normalization + as well as guidance to implementers to make them want to adopt it + +**Cons** +- Requires additional effort to maintain the grammar +- Requires additional effort to maintain tests +- Valid messages can contain Trojan Source and other negative display consequences; + messages can be checked, however, using the strict grammar, so tools could warn + users of potential abuse From 27ca44744d11631f42eb14cb883c6950711b9e1a Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Wed, 24 Jul 2024 14:57:19 -0700 Subject: [PATCH 12/16] Update exploration/bidi-usability.md Co-authored-by: Tim Chevalier --- exploration/bidi-usability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index fb205d0c4..08b7e133f 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -113,7 +113,7 @@ neutrals or opposite direction runs in an unhelpful way). ### Strictness and Abuse We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. -The isolates and strongly-directional marks are invisble except in certain specialized editing environments. +The isolates and strongly-directional marks are invisible except in certain specialized editing environments. While users and tools should be strict about using well-formed isolate sequences, we don't want to have invisible characters or whitespace generate additional syntax errors except where necessary. Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. From 570204ad503b66cd8825ef78c466c12bdb184d3b Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 8 Aug 2024 06:54:08 -0700 Subject: [PATCH 13/16] Commit @eemeli's comment Adding this text in, to be followed with some of the discussion from the PR Co-authored-by: Eemeli Aro --- exploration/bidi-usability.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 08b7e133f..1dcc6a615 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -453,8 +453,19 @@ message normalization. // TODO put ABNF here + +### Isolate `name` rather than `unquoted-literal` + Isolating rather than marking `name` helps ensure that its directionality does not spill over to adjoining syntax. + +The following replaces the proposed design's changes to `literal` and the `[bidi]` additions to +`variable-expression`, `function`, `option`, `attribute`, `markup`, and `ns-separator`: +```abnf +name = [open-isolate] name-start *name-char [close-isolate] +quoted-literal = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate] +``` + For example, this allows for the proper rendering of the expression ``` {⁦:⁧אחת⁩:⁧שתיים⁩⁩} From 53815b3392ef86605f04b1eb21da505d5ec8243d Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 8 Aug 2024 07:01:19 -0700 Subject: [PATCH 14/16] Add missing ABNF --- exploration/bidi-usability.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 1dcc6a615..9afdf13c3 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -450,8 +450,21 @@ The main differences to the proposed solution is: As noted above, the "strict" version of the ABNF should be adopted by serializers and for message normalization. -// TODO put ABNF here - +```abnf +variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" +function = ":" identifier [bidi] *(s option) +option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi] + / LRI identifier [bidi] [s] "=" [s] (literal / variable) [bidi] close-isolate +attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] + / LRI "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] close-isolate +markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone + / "{" LRI [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] close-isolate "}" + / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close + / "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" +bidi = [ %x200E-200F / %x061C ] +``` ### Isolate `name` rather than `unquoted-literal` From 43a5ab2ae6a63ebb2b3599a01979316c4765f96e Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 8 Aug 2024 07:48:10 -0700 Subject: [PATCH 15/16] Add discussion points Adding the pros/cons for isolating name tokens in order to facilitate discussion. --- exploration/bidi-usability.md | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 9afdf13c3..28aeb9456 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -489,11 +489,26 @@ Without `name` isolation, this would (misleadingly) render as {⁦:אחת:שתיים⁩} ``` -In the syntax, it's much simpler to include the changes to `name` in that rule, -rather than patching every place where `name` is used. -Either way, the parsed value of the name should not include the open/close isolates, -just as they're not included in the parsed values of quoted literals or quoted patterns. +Note that the parsed value of the `name` does not include the open/close isolates, +just as they're not included in the parsed values of quoted literals or quoted patterns, +even though the production includes the characters. +We could accomplish this by adding an additional productions to manage `name`, at the cost +of a more complex ABNF. +**Pros** +- In the syntax, it's much simpler to include the changes to `name` in the `name` rule, + rather than patching every place where `name` is used. + +**Cons** +- Implementations need to remove isolates from the `name` token before comparing + the value to other values (such as comparing `function` or `variable` names). + Because of namespacing, this requires looking _inside_ the token. +- Implementations might need to insert isolates when generating names upon serialization. + The current data model does not separate `namespace` and `name`, + so this might be more complicated. +- `unquoted-literal` values appear as keys, as operands, and as option values. + If not isolated, these can cause spillover effects, so we might need both `name` + and `unquoted-literal` isolation. ### Deeper Syntax Changes We could alter the syntax to make it more "bidi robust", From b6e4132e0e1b25e4560a8b751d903e24e28a6380 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Thu, 8 Aug 2024 08:04:17 -0700 Subject: [PATCH 16/16] Address comments from @catamorphism --- exploration/bidi-usability.md | 39 ++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md index 28aeb9456..3f70ed700 100644 --- a/exploration/bidi-usability.md +++ b/exploration/bidi-usability.md @@ -288,8 +288,8 @@ Preferring LTR display is not the disadvantage to right-to-left languages that i in the correct location in an RTL _pattern_ - _Expressions_ use isolates and directional marks to display internal tokens in the correct order and without spillover effects -- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs - for shaping purposes and these offer a poor person's form of isolation. +- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm + pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself. The syntax permits (but does not require) isolating bidi controls to be used on the **outside** of the following: @@ -324,16 +324,23 @@ close-isolate = %x2069 > We need to allow users to include bidi characters, including isolates and strongly directional marks > in the output of MF2. -Permit **left-to-right** isolates (`U+2066` and `U+2069`) to be used **immediately inside** the following: -- expressions -- markup +- Permit **left-to-right** isolates + (starting with LRI `U+2066` and ending with PDI `U+2069`) + to be used **immediately inside** the following: + - expressions + - markup -Permit isolates around any token inside of an expression or markup. +- Permit any type of isolate sequence + (starting with LRI `U+2066`, RLI `U+2067`, or FSI `U+2068` and ending with PDI `U+2069`) + around any token inside of an expression or markup. -We only permit the LTR isolates because the contents of an _expression_ -or _markup_ must be laid out left-to-right. -_Literal_ values can be right-to-left isolated within that or use strongly -directional marks to ensure correct display. +- Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that + **end** with the `name` production in the ABNF. + This includes _identifiers_ found in the names of + _functions_ + and _options_, + plus the names of _variables_, + as well as the contents of _unquoted_ literals. This would change the ABNF as follows (assuming the above changes are also incorporated): ```abnf @@ -346,13 +353,11 @@ markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [clos LRI = %x2066 ``` -Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that -**end** with the `name` production in the ABNF. -This includes _identifiers_ found in the names of -_functions_ -and _options_, -plus the names of _variables_, -as well as the contents of _unquoted_ literals. +> [!NOTE] +> This design only permits LTR isolates at the expression level because the contents of an _expression_ +> or _markup_ must be laid out left-to-right. +> _Literal_ values can be right-to-left isolated within that or use strongly +> directional marks to ensure correct display. > [!NOTE] > Notice that _unquoted literals_ can also be surrounded by bidi isolates