[DESIGN] Implement changes to bidi to permit non-strict formation #811

aphillips · 2024-06-24T23:26:51Z

This PR contains the design document for bidi handling at the syntax level. It contains a fairly comprehensive description of the problem space with examples.

**_DO NOT REVIEW_** This PR will eventually include the design changes. Currently a work in progress.

exploration/bidi-usability.md

Since we're adopting "loose" as the proposed design, put "strict" as a considered option.

exploration/bidi-usability.md

macchiati · 2024-07-23T18:01:10Z

It is maybe a Pro (but more of a possibly-mitigating factor for a Con)

…

On Tue, Jul 23, 2024 at 10:57 AM Addison Phillips ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In exploration/bidi-usability.md <#811 (comment)> : > -literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}" -variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}" -annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}" +**Cons** +- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55) This would be better as a pro perhaps? "Irresponsibly" might be the wrong word in the 'con', although it smacks of code attacks, which is the idea. This design enables the right thing by also enabling many wrong things. On the other hand, a lot of the time we'll be lucky if tools implement anything or users bother to insert the isolates/marks. — Reply to this email directly, view it on GitHub <#811 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGGMYDR3VDMY5NUUR3ZN2KP7AVCNFSM6AAAAABJ2W265OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOJUGY4DEMJUGY> . You are receiving this because you commented.Message ID: ***@***.***>

exploration/bidi-usability.md

Co-authored-by: Eemeli Aro <eemeli@mozilla.com>

exploration/bidi-usability.md

catamorphism · 2024-07-24T19:15:57Z

exploration/bidi-usability.md

  in the correct location in an RTL _pattern_
 - _Expressions_ use isolates and directional marks to display internal tokens in the
  correct order and without spillover effects
+- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
+  for shaping purposes and these offer a poor person's form of isolation.


I can't tell whether to read this as "paired enclosing marks (that the Unicode Bidirectional Algorithm pairs for shaping purposes)" or "uses paired enclosing marks (that the Unicode Bidirectional Algorithm pairs) for shaping purposes."

It's the former.

Suggested change

for shaping purposes and these offer a poor person's form of isolation.

- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm

pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.

catamorphism · 2024-07-24T19:18:33Z

exploration/bidi-usability.md


-Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
+Permit **left-to-right** isolates (`U+2066` and `U+2069`) to be used **immediately inside** the following:


Should this be part of a bullet list, along with the other "Permit..." imperatives that follow?

The paragraph on line 331 should probably move below the paragraph "We only permit..." because it has a different meaning and refers to all isolates. Not sure about using a bullet list?

Done. Created the bullet list and reorganized.

catamorphism · 2024-07-24T19:21:51Z

exploration/bidi-usability.md

+### Strict isolation all the time
+
+Apply bidi isolates in a strict way.
+The main differences to the proposed solution is:


I only see one difference?

I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.

The current design has expression thusly:

expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"

This alternative would turn that into:

expression = "{" (literal-expression / variable-expression / annotation-expression) "}" / "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"

In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.

Rinse and repeat for markup, option, attribute, and literals.

Make sense?

Fixed the TODO

Co-authored-by: Tim Chevalier <tjc@igalia.com>

eemeli

I would be interested in hearing from @catamorphism, @mihnita, @lucacasonato and any other implementers about whether they would have any concerns about including bidi isolation characters within the name ABNF rule, as proposed in the alternative I'm naming here as "Isolate name rather than unquoted-literal".

This would not change the parsed meaning of a name in any way, but only include the bidi isolation or mark characters within the ABNF production so that they do not need to be mentioned in all the places where a name is used, as is the case in the current proposed solution.

@aphillips and I have discussed this, and as our opinions on this diverge, getting comments from other people who'll need to adjust their parsers due to the changes here would be helpful.

exploration/bidi-usability.md

lucacasonato · 2024-07-25T17:01:54Z

Prefix: I am very unfamiliar with bidi - the linked introduction at the start of this doc was a great help. So thanks for linking that.

My impression is that the proposed ABNF seems reasonable. I do not think this would make my parser significantly more complex.

I make this statement on the following assumptions that I have based on my understanding of this document. Please correct them if they are wrong.

isolation/bidi characters are in no way stored in an AST / data model
isolation/bidi character placement is not meant to round-trip through parse-serialize
a serializer can make a fully automatic determination based just on the AST / data model as to where what isolation/bidi chars need to be placed in the output
a parser does not need to validate the correct pairing of bidi chars

I would be interested to understand further how likely it is that by editing a message that contains isolation/bidi chars, a user could produce a message that ends up being either:

syntactically invalid
has isolation/bidi chars in the middle of a pattern or quoted literal, only because they backspace / copy pasted etc

I guess what I am asking is how text (and code) editors usually deal with isolation/bidi chars: if I have a message that has an expresion containing a PDI just before the }, and I remove (backspace) the }, then manually move my cursor to the column trailing the last visible character in the expression, and type }, would the PDI now be in the pattern or would it be gone?

catamorphism · 2024-08-07T18:33:07Z

I would be interested in hearing from @catamorphism, @mihnita, @lucacasonato and any other implementers about whether they would have any concerns about including bidi isolation characters within the name ABNF rule, as proposed in the alternative I'm naming here as "Isolate name rather than unquoted-literal".

This would not change the parsed meaning of a name in any way, but only include the bidi isolation or mark characters within the ABNF production so that they do not need to be mentioned in all the places where a name is used, as is the case in the current proposed solution.

@aphillips and I have discussed this, and as our opinions on this diverge, getting comments from other people who'll need to adjust their parsers due to the changes here would be helpful.

I don't have concerns about this... or maybe I will once I try implementing it, but I can't think of anything right now.

aphillips · 2024-08-07T23:10:07Z

@catamorphism Thanks. My disagreement with @eemeli about including isolates and directional marks inside names (that are not considered part of the name) is that this would require parsers to process each name to remove these characters. I think this is relatively high impact for very little reward.

We are discussing in another thread potentially doing NFC normalization for comparison purposes, but this doesn't require walking the buffer (and can be fast-checked and otherwise optimized). Prohibiting isolates and strong marks in names will permit a few corner cases in which the namespace and name display awkwardly but no cases in which false matches are produced. Users have a much better workaround ("choose unidirectional names") available.

Note: I would like to merge this PR and then have a technical discussion of the design.

eemeli · 2024-08-08T03:23:32Z

Note: I would like to merge this PR and then have a technical discussion of the design.

I'd be fine with that, provided that the suggestion from #811 (comment) or something like it is included to document the remaining aspect of this that we disagree on.

Adding this text in, to be followed with some of the discussion from the PR Co-authored-by: Eemeli Aro <eemeli@mozilla.com>

Adding the pros/cons for isolating name tokens in order to facilitate discussion.

aphillips · 2024-08-08T14:52:31Z

@eemeli I inserted the comment with edits and additions. Please have a look and see that I've captured the pros/cons to your satisfaction or if you have additions.

For the record, I think I'm leaning towards the "hybrid approaches" design. It would mean that MF2 messages might have spillover effects, but would encourage implementations to (re)serialize messages in ways that eliminate such effects. Looking forward to merge-and-discuss.

eemeli

This looks fine to merge. One line comment below, but it's not a blocker.

eemeli · 2024-08-08T18:34:29Z

exploration/bidi-usability.md

+- `unquoted-literal` values appear as keys, as operands, and as option values.
+  If not isolated, these can cause spillover effects, so we might need both `name`
+  and `unquoted-literal` isolation.


Given how unquoted-literal is defined as

message-format-wg/spec/message.abnf

Line 48 in 01f2880

unquoted-literal = name / number-literal

and number-literal only contains LTR characters

message-format-wg/spec/message.abnf

Line 50 in 01f2880

number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "+"] 1*DIGIT]

allowing name isolation should cover all the unquoted-literal cases that may need isolation.

Ur... this is something I don't like, since isolates are "not included" in the name, but are baked into the production. I'd rather do more with the ABNF to keep them separate (the isolates and directional characters are in the same places, but part of the surrounding goo instead of being part of the name-like token).

Note that number-literal contains only neutral and weakly directional characters (except for e, which is strongly LTR). Note that the leading - can switch sides in an RTL context (unless isolated or protected with LRM)

[DESIGN] Implement changes to bidi to permit non-strict formation

c2bcfe5

**_DO NOT REVIEW_** This PR will eventually include the design changes. Currently a work in progress.

aphillips requested a review from eemeli June 24, 2024 23:26

eemeli reviewed Jun 25, 2024

View reviewed changes

exploration/bidi-usability.md Show resolved Hide resolved

Address comments plus update "other" solutions

9342386

Since we're adopting "loose" as the proposed design, put "strict" as a considered option.

aphillips added syntax Issues related with MF Syntax design Design principles, decisions normative LDML46 LDML46 Release (Tech Preview - October 2024) labels Jun 25, 2024

aphillips added 6 commits July 2, 2024 12:55

Edits to add some previous discussion points

3d84859

Typo

33528f3

Improve discussion of abuse

dc175bb

Update bidi-usability.md

ed8bab5

Update bidi-usability.md

2f7e13c

Add the 'super-loose' option

1d9ae2e

macchiati reviewed Jul 23, 2024

View reviewed changes

exploration/bidi-usability.md Show resolved Hide resolved

eemeli reviewed Jul 23, 2024

View reviewed changes

exploration/bidi-usability.md Outdated Show resolved Hide resolved

exploration/bidi-usability.md Outdated Show resolved Hide resolved

aphillips and others added 3 commits July 23, 2024 11:33

Update exploration/bidi-usability.md

03ebbce

Co-authored-by: Eemeli Aro <eemeli@mozilla.com>

Update exploration/bidi-usability.md

df37674

Co-authored-by: Eemeli Aro <eemeli@mozilla.com>

Address comments, add Postel's Law design approach

f12e316

aphillips requested review from eemeli, macchiati, stasm, catamorphism, echeran, gibson042 and mihnita July 23, 2024 22:36

catamorphism reviewed Jul 24, 2024

View reviewed changes

Update exploration/bidi-usability.md

27ca447

Co-authored-by: Tim Chevalier <tjc@igalia.com>

eemeli reviewed Jul 25, 2024

View reviewed changes

exploration/bidi-usability.md Show resolved Hide resolved

eemeli mentioned this pull request Jul 29, 2024

Require at least one keyword for complex messages #841

Closed

eemeli linked an issue Jul 29, 2024 that may be closed by this pull request

[FEEDBACK] Unpaired bidi isolates should not be a parse error #788

Closed

aphillips and others added 2 commits August 8, 2024 06:54

Commit @eemeli's comment

570204a

Adding this text in, to be followed with some of the discussion from the PR Co-authored-by: Eemeli Aro <eemeli@mozilla.com>

Add missing ABNF

53815b3

eemeli approved these changes Aug 8, 2024

View reviewed changes

Add discussion points

43a5ab2

Adding the pros/cons for isolating name tokens in order to facilitate discussion.

Address comments from @catamorphism

b6e4132

aphillips requested a review from catamorphism August 8, 2024 15:05

eemeli approved these changes Aug 8, 2024

View reviewed changes

aphillips added the Agenda+ Requested for upcoming teleconference label Aug 9, 2024

aphillips merged commit 1c1bb37 into main Aug 12, 2024
1 check passed

aphillips deleted the aphillips-bidi-whitespace branch August 12, 2024 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DESIGN] Implement changes to bidi to permit non-strict formation #811

[DESIGN] Implement changes to bidi to permit non-strict formation #811

aphillips commented Jun 24, 2024 •

edited

Loading

macchiati commented Jul 23, 2024 via email

catamorphism Jul 24, 2024

aphillips Jul 24, 2024

aphillips Aug 8, 2024

catamorphism Jul 24, 2024

aphillips Jul 24, 2024

aphillips Aug 8, 2024

catamorphism Jul 24, 2024

aphillips Jul 24, 2024

aphillips Aug 8, 2024

eemeli left a comment

lucacasonato commented Jul 25, 2024

catamorphism commented Aug 7, 2024

aphillips commented Aug 7, 2024

eemeli commented Aug 8, 2024

aphillips commented Aug 8, 2024 •

edited

Loading

eemeli left a comment

eemeli Aug 8, 2024

aphillips Aug 8, 2024

	for shaping purposes and these offer a poor person's form of isolation.
	- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
	pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.


		Permit left-to-right isolating bidi controls (`U+2066`...`U+2069`) to be used immediately inside the following:
		Permit left-to-right isolates (`U+2066` and `U+2069`) to be used immediately inside the following:

[DESIGN] Implement changes to bidi to permit non-strict formation #811

[DESIGN] Implement changes to bidi to permit non-strict formation #811

Conversation

aphillips commented Jun 24, 2024 • edited Loading

macchiati commented Jul 23, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eemeli left a comment

Choose a reason for hiding this comment

lucacasonato commented Jul 25, 2024

catamorphism commented Aug 7, 2024

aphillips commented Aug 7, 2024

eemeli commented Aug 8, 2024

aphillips commented Aug 8, 2024 • edited Loading

eemeli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aphillips commented Jun 24, 2024 •

edited

Loading

aphillips commented Aug 8, 2024 •

edited

Loading