-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DESIGN] Implement changes to bidi to permit non-strict formation #811
Conversation
**_DO NOT REVIEW_** This PR will eventually include the design changes. Currently a work in progress.
Since we're adopting "loose" as the proposed design, put "strict" as a considered option.
It is maybe a Pro (but more of a possibly-mitigating factor for a Con)
…On Tue, Jul 23, 2024 at 10:57 AM Addison Phillips ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In exploration/bidi-usability.md
<#811 (comment)>
:
>
-literal-expression = "{" [LRI] [s] literal [s annotation] *(s attribute) [s] [close-isolate] "}"
-variable-expression = "{" [LRI] [s] variable [s annotation] *(s attribute) [s] [close-isolate] "}"
-annotation-expression = "{" [LRI] [s] annotation *(s attribute) [s] [close-isolate] "}"
+**Cons**
+- Can be used irresponsibly, including enabling some Trojan Source cases (UAX#55)
This would be better as a pro perhaps? "Irresponsibly" might be the wrong
word in the 'con', although it smacks of code attacks, which is the idea.
This design enables the right thing by also enabling many wrong things. On
the other hand, a lot of the time we'll be lucky if tools implement
anything or users bother to insert the isolates/marks.
—
Reply to this email directly, view it on GitHub
<#811 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGGMYDR3VDMY5NUUR3ZN2KP7AVCNFSM6AAAAABJ2W265OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOJUGY4DEMJUGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
exploration/bidi-usability.md
Outdated
in the correct location in an RTL _pattern_ | ||
- _Expressions_ use isolates and directional marks to display internal tokens in the | ||
correct order and without spillover effects | ||
- The syntax uses paired enclosing marks that the Unicode Bidirectional Algorithm pairs | ||
for shaping purposes and these offer a poor person's form of isolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell whether to read this as "paired enclosing marks (that the Unicode Bidirectional Algorithm pairs for shaping purposes)" or "uses paired enclosing marks (that the Unicode Bidirectional Algorithm pairs) for shaping purposes."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the former.
for shaping purposes and these offer a poor person's form of isolation. | |
- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm | |
pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
exploration/bidi-usability.md
Outdated
|
||
Permit **left-to-right** isolating bidi controls (`U+2066`...`U+2069`) to be used **immediately inside** the following: | ||
Permit **left-to-right** isolates (`U+2066` and `U+2069`) to be used **immediately inside** the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of a bullet list, along with the other "Permit..." imperatives that follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paragraph on line 331 should probably move below the paragraph "We only permit..." because it has a different meaning and refers to all isolates. Not sure about using a bullet list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Created the bullet list and reorganized.
### Strict isolation all the time | ||
|
||
Apply bidi isolates in a strict way. | ||
The main differences to the proposed solution is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see one difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should do the TODO, which makes it clearer. I'll do one here in the comment for clarity and then go back and fix the PR.
The current design has expression
thusly:
expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
This alternative would turn that into:
expression = "{" (literal-expression / variable-expression / annotation-expression) "}"
/ "{" LRI (literal-expression / variable-expression / annotation-expression) close-isolate "}"
In this formulation, you cannot have unpaired opening (or closing) isolates without a syntax error, nor can you have multiples of open or close.
Rinse and repeat for markup, option, attribute, and literals.
Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the TODO
Co-authored-by: Tim Chevalier <tjc@igalia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested in hearing from @catamorphism, @mihnita, @lucacasonato and any other implementers about whether they would have any concerns about including bidi isolation characters within the name
ABNF rule, as proposed in the alternative I'm naming here as "Isolate name
rather than unquoted-literal
".
This would not change the parsed meaning of a name
in any way, but only include the bidi isolation or mark characters within the ABNF production so that they do not need to be mentioned in all the places where a name
is used, as is the case in the current proposed solution.
@aphillips and I have discussed this, and as our opinions on this diverge, getting comments from other people who'll need to adjust their parsers due to the changes here would be helpful.
Prefix: I am very unfamiliar with bidi - the linked introduction at the start of this doc was a great help. So thanks for linking that. My impression is that the proposed ABNF seems reasonable. I do not think this would make my parser significantly more complex. I make this statement on the following assumptions that I have based on my understanding of this document. Please correct them if they are wrong.
I would be interested to understand further how likely it is that by editing a message that contains isolation/bidi chars, a user could produce a message that ends up being either:
I guess what I am asking is how text (and code) editors usually deal with isolation/bidi chars: if I have a message that has an expresion containing a PDI just before the |
I don't have concerns about this... or maybe I will once I try implementing it, but I can't think of anything right now. |
@catamorphism Thanks. My disagreement with @eemeli about including isolates and directional marks inside names (that are not considered part of the name) is that this would require parsers to process each name to remove these characters. I think this is relatively high impact for very little reward. We are discussing in another thread potentially doing NFC normalization for comparison purposes, but this doesn't require walking the buffer (and can be fast-checked and otherwise optimized). Prohibiting isolates and strong marks in names will permit a few corner cases in which the namespace and name display awkwardly but no cases in which false matches are produced. Users have a much better workaround ("choose unidirectional names") available. Note: I would like to merge this PR and then have a technical discussion of the design. |
I'd be fine with that, provided that the suggestion from #811 (comment) or something like it is included to document the remaining aspect of this that we disagree on. |
Adding this text in, to be followed with some of the discussion from the PR Co-authored-by: Eemeli Aro <eemeli@mozilla.com>
Adding the pros/cons for isolating name tokens in order to facilitate discussion.
@eemeli I inserted the comment with edits and additions. Please have a look and see that I've captured the pros/cons to your satisfaction or if you have additions. For the record, I think I'm leaning towards the "hybrid approaches" design. It would mean that MF2 messages might have spillover effects, but would encourage implementations to (re)serialize messages in ways that eliminate such effects. Looking forward to merge-and-discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to merge. One line comment below, but it's not a blocker.
- `unquoted-literal` values appear as keys, as operands, and as option values. | ||
If not isolated, these can cause spillover effects, so we might need both `name` | ||
and `unquoted-literal` isolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how unquoted-literal
is defined as
message-format-wg/spec/message.abnf
Line 48 in 01f2880
unquoted-literal = name / number-literal |
number-literal
only contains LTR characters message-format-wg/spec/message.abnf
Line 50 in 01f2880
number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "+"] 1*DIGIT] |
name
isolation should cover all the unquoted-literal
cases that may need isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur... this is something I don't like, since isolates are "not included" in the name
, but are baked into the production. I'd rather do more with the ABNF to keep them separate (the isolates and directional characters are in the same places, but part of the surrounding goo instead of being part of the name-like token).
Note that number-literal
contains only neutral and weakly directional characters (except for e
, which is strongly LTR). Note that the leading -
can switch sides in an RTL context (unless isolated or protected with LRM)
This PR contains the design document for bidi handling at the syntax level. It contains a fairly comprehensive description of the problem space with examples.