Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Implement changes to bidi to permit non-strict formation #811

Merged
merged 16 commits into from
Aug 12, 2024

Conversation

aphillips
Copy link
Member

@aphillips aphillips commented Jun 24, 2024

This PR contains the design document for bidi handling at the syntax level. It contains a fairly comprehensive description of the problem space with examples.

**_DO NOT REVIEW_**

This PR will eventually include the design changes. Currently a work in progress.
@aphillips aphillips requested a review from eemeli June 24, 2024 23:26
Since we're adopting "loose" as the proposed design, put "strict" as a considered option.
@aphillips aphillips added syntax Issues related with MF Syntax design Design principles, decisions normative LDML46 LDML46 Release (Tech Preview - October 2024) labels Jun 25, 2024
@macchiati
Copy link
Member

macchiati commented Jul 23, 2024 via email

exploration/bidi-usability.md Outdated Show resolved Hide resolved
exploration/bidi-usability.md Outdated Show resolved Hide resolved
aphillips and others added 3 commits July 23, 2024 11:33
Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
exploration/bidi-usability.md Outdated Show resolved Hide resolved
in the correct location in an RTL _pattern_
- _Expressions_ use isolates and directional marks to display internal tokens in the
correct order and without spillover effects
- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs
for shaping purposes and these offer a poor person's form of isolation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell whether to read this as "paired enclosing marks (that the Unicode Bidirectional Algorithm pairs for shaping purposes)" or "uses paired enclosing marks (that the Unicode Bidirectional Algorithm pairs) for shaping purposes."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former.

Suggested change
for shaping purposes and these offer a poor person's form of isolation.
- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm
pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following:
Permit **left-to-right** isolates (`U+2066` and `U+2069`) to be used **immediately inside** the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of a bullet list, along with the other "Permit..." imperatives that follow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph on line 331 should probably move below the paragraph "We only permit..." because it has a different meaning and refers to all isolates. Not sure about using a bullet list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Created the bullet list and reorganized.

### Strict isolation all the time

Apply bidi isolates in a strict way.
The main differences to the proposed solution is:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see one difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.

The current design has expression thusly:

expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"

This alternative would turn that into:

expression = "{" (literal-expression / variable-expression / annotation-expression) "}"
           / "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"

In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.

Rinse and repeat for markup, option, attribute, and literals.

Make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the TODO

Co-authored-by: Tim Chevalier <tjc@igalia.com>
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested in hearing from @catamorphism, @mihnita, @lucacasonato and any other implementers about whether they would have any concerns about including bidi isolation characters within the name ABNF rule, as proposed in the alternative I'm naming here as "Isolate name rather than unquoted-literal".

This would not change the parsed meaning of a name in any way, but only include the bidi isolation or mark characters within the ABNF production so that they do not need to be mentioned in all the places where a name is used, as is the case in the current proposed solution.

@aphillips and I have discussed this, and as our opinions on this diverge, getting comments from other people who'll need to adjust their parsers due to the changes here would be helpful.

exploration/bidi-usability.md Show resolved Hide resolved
@lucacasonato
Copy link
Contributor

Prefix: I am very unfamiliar with bidi - the linked introduction at the start of this doc was a great help. So thanks for linking that.

My impression is that the proposed ABNF seems reasonable. I do not think this would make my parser significantly more complex.

I make this statement on the following assumptions that I have based on my understanding of this document. Please correct them if they are wrong.

  • isolation/bidi characters are in no way stored in an AST / data model
  • isolation/bidi character placement is not meant to round-trip through parse-serialize
  • a serializer can make a fully automatic determination based just on the AST / data model as to where what isolation/bidi chars need to be placed in the output
  • a parser does not need to validate the correct pairing of bidi chars

I would be interested to understand further how likely it is that by editing a message that contains isolation/bidi chars, a user could produce a message that ends up being either:

  • syntactically invalid
  • has isolation/bidi chars in the middle of a pattern or quoted literal, only because they backspace / copy pasted etc

I guess what I am asking is how text (and code) editors usually deal with isolation/bidi chars: if I have a message that has an expresion containing a PDI just before the }, and I remove (backspace) the }, then manually move my cursor to the column trailing the last visible character in the expression, and type }, would the PDI now be in the pattern or would it be gone?

@catamorphism
Copy link
Collaborator

I would be interested in hearing from @catamorphism, @mihnita, @lucacasonato and any other implementers about whether they would have any concerns about including bidi isolation characters within the name ABNF rule, as proposed in the alternative I'm naming here as "Isolate name rather than unquoted-literal".

This would not change the parsed meaning of a name in any way, but only include the bidi isolation or mark characters within the ABNF production so that they do not need to be mentioned in all the places where a name is used, as is the case in the current proposed solution.

@aphillips and I have discussed this, and as our opinions on this diverge, getting comments from other people who'll need to adjust their parsers due to the changes here would be helpful.

I don't have concerns about this... or maybe I will once I try implementing it, but I can't think of anything right now.

@aphillips
Copy link
Member Author

@catamorphism Thanks. My disagreement with @eemeli about including isolates and directional marks inside names (that are not considered part of the name) is that this would require parsers to process each name to remove these characters. I think this is relatively high impact for very little reward.

We are discussing in another thread potentially doing NFC normalization for comparison purposes, but this doesn't require walking the buffer (and can be fast-checked and otherwise optimized). Prohibiting isolates and strong marks in names will permit a few corner cases in which the namespace and name display awkwardly but no cases in which false matches are produced. Users have a much better workaround ("choose unidirectional names") available.

Note: I would like to merge this PR and then have a technical discussion of the design.

@eemeli
Copy link
Collaborator

eemeli commented Aug 8, 2024

Note: I would like to merge this PR and then have a technical discussion of the design.

I'd be fine with that, provided that the suggestion from #811 (comment) or something like it is included to document the remaining aspect of this that we disagree on.

aphillips and others added 2 commits August 8, 2024 06:54
Adding this text in, to be followed with some of the discussion from the PR

Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
Adding the pros/cons for isolating name tokens in order to facilitate discussion.
@aphillips
Copy link
Member Author

aphillips commented Aug 8, 2024

@eemeli I inserted the comment with edits and additions. Please have a look and see that I've captured the pros/cons to your satisfaction or if you have additions.


For the record, I think I'm leaning towards the "hybrid approaches" design. It would mean that MF2 messages might have spillover effects, but would encourage implementations to (re)serialize messages in ways that eliminate such effects. Looking forward to merge-and-discuss.

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to merge. One line comment below, but it's not a blocker.

Comment on lines +514 to +516
- `unquoted-literal` values appear as keys, as operands, and as option values.
If not isolated, these can cause spillover effects, so we might need both `name`
and `unquoted-literal` isolation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how unquoted-literal is defined as

unquoted-literal = name / number-literal
and number-literal only contains LTR characters
number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "+"] 1*DIGIT]
allowing name isolation should cover all the unquoted-literal cases that may need isolation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur... this is something I don't like, since isolates are "not included" in the name, but are baked into the production. I'd rather do more with the ABNF to keep them separate (the isolates and directional characters are in the same places, but part of the surrounding goo instead of being part of the name-like token).

Note that number-literal contains only neutral and weakly directional characters (except for e, which is strongly LTR). Note that the leading - can switch sides in an RTL context (unless isolated or protected with LRM)

@aphillips aphillips added the Agenda+ Requested for upcoming teleconference label Aug 9, 2024
@aphillips aphillips merged commit 1c1bb37 into main Aug 12, 2024
1 check passed
@aphillips aphillips deleted the aphillips-bidi-whitespace branch August 12, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference design Design principles, decisions LDML46 LDML46 Release (Tech Preview - October 2024) normative syntax Issues related with MF Syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEEDBACK] Unpaired bidi isolates should not be a parse error
5 participants