Status: Proposed
What is this proposal trying to achieve?
The MessageFormat 2 syntax uses whitespace as a required delimiter as well as permitting the use of whitespace to make messages easier to read. In addition, a message can include bidirectional text in identifiers and literals.
MessageFormat's syntax also uses a variety of "sigils" and markers to form the structure of a message. These sigils are ASCII punctuation characters that have neutral directionality. This means that the inclusion of right-to-left ("RTL") identifiers or literals in a message can result in the syntax looking "scrambled" or, in extreme cases, appearing to have a different meaning due to spillover.
To prevent spillover effects and to allow users (particularly RTL language users) to author messages in a straightforward way, we want to allow the syntax to include appropriate bidirectional support and to recommend to tool and translation technology implementers mechanisms to make messages that include RTL characters easy to work with without introducing spoofing or "Trojan Source" attack vectors.
What context is helpful to understand this proposal?
If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction here.
MessageFormat message strings are created and edited primarily by humans. The original message is often written by a software developer or user experience designer. Translators need to work with the target-language versions of each message. Like many templating or domain-specific languages, MF2 uses neutrally-directional symbols to form portions of the syntax. When the message contains right-to-left (RTL) characters in translations or in portions of the syntax, the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#9) can interact in ways that make the message unintelligible or difficult to parse visually.
Machines do not have a problem parsing messages that contain RTL characters,
but users need to be able to discern what a message does.
For example, users need to be able to match keys in a variant to selectors
in a .match
statement.
Or they want to know how a pattern will be evaluated,
such as understanding the options and values in a placeholder.
In addition, it is possible to construct messages that use bidi characters to spoof users into believing that a message does something different than what it actually does.
The current syntax does not permit bidi controls in name tokens, unquoted literals, or in the non-pattern whitespace portions of a message.
Permitting the Unicode bidi isolate characters and the standalone strongly-directional markers would enable tools, including translation tools, and users who are writing in RTL languages to format a message so that its plain-text representation and its function are unambiguous.
The isolates are paired invisible characters inserted around a portion of a string. The start of an isolated sequence is one of:
- U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
- U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
- U+2068 FIRST-STRONG ISOLATE (FSI)
The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).
The characters inside an isolated sequence have the initial string direction
corresponding to the starting character (
left-to-right for LRI
,
right-to-left for RLI
,
or auto for FSI
).
They are called "isolates" because the enclosed text is isolated from surrounding text
while being processed using the Unicode Bidirectional Algorithm (UBA).
The surrounding text treats the sequence as-if it were a single neutral character,
while the interior sequence is processed using the base direction specified by the isolate
starting character.
Note
One of the side-effects of using {
/}
and {{
/}}
to delimit expressions
and patterns is that these paired enclosing punctuations provide a measure of
isolation in UBA.
This is an additional reason not to change over to quote marks (which are not enclosing)
around patterns.
This design also allows for the use of strongly directional marker characters. These include:
- U+200E LEFT-TO-RIGHT MARK (LRM)
- U+200F RIGHT-TO-LEFT MARK (RLM)
- U+061C ARABIC LETTER MARK (ALM)
These characters are invisible strongly-directional characters. They are used in bidirectional text to coerce certain directional behavior (usually to mark the end of a sequence of characters that would otherwise be ambiguous or interact with neutrals or opposite direction runs in an unhelpful way).
We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. The isolates and strongly-directional marks are invisible except in certain specialized editing environments. While users and tools should be strict about using well-formed isolate sequences, we don't want to have invisible characters or whitespace generate additional syntax errors except where necessary. Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates.
It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. Such an ABNF might be used by message serializers to ensure high-quality message generation.
Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, could produce the various Trojan Source effects described in [UTS55])
What use-cases do we see? Ideally, quote concrete examples.
- Presentation of keys can change if the text of the key's literal is not isolated:
.match {$م2صر :string}{$num :integer}
م2صر 0 {{The {$م2صر} is actually the first key}}
م2صر * {{This one appears okay}}
Note
The first variant in the use case above is actually:
\u06452\u0635\u0631 0 {{The {$\u06452\u0635\u0631} is actually the first key}}
- Presentation in an expression can change if portions of the expression are not isolated or do not restore LTR order:
In the following example, we use the same string with a number inserted into the middle of the string to make the bidi effects visible. The numbers correspond to:
- operand
- function
- option name
- option value
You have {$م1صر :م2صر م3صر=م4صر} <- no controls
You have {$م1صر :م2صر م3صر=م4صر} <- LRM after each RTL token
-
As a developer or translator, I want to make unquoted RTL literals or names appear correctly in my plain-text editing environment. I don't want to have to manage a lot of paired controls, when I can get the right effect using strongly directional mark characters (LRM, RLM, ALM)
-
As a translation tool or MF2 implementation, I want to automatically generate messages which display correctly when they contain RTL text or substring with minimal user intervention.
What properties does the solution have to manifest to enable the use-cases above?
To prevent RTL literals from having spillover effects with surrounding syntax, it should be possible to bidi isolate a quoted or unquoted literal.
.local $title = {|البحرين مصر الكويت!|} .local $egypt = {مصر :string}
To prevent patterns from having spillover effects with other parts of a message, particularly with keys in a variant, it should be possible to bidi-isolate a quoted-pattern.
.match {$foo :string} isolate {{البحرين مصر الكويت!}}
To prevent markup, placeholders, or expressions from having spillover effects with other parts of a message it should be possible to bidi isolate the contents of a markup or an expression.
You can find it in {$مصر}.
To prevent RTL identifiers from having spillover effects with other parts of an expression, it should be possible to include "local effect" bidi controls following an identifier, name, option value, or literal. These controls must not be included into the identifier, name, option value, or literal, that is, it must be possible to distinguish these characters from the identifier, name, option value, or literal in question.
You can use {$م1صر :م2صر م3صر=م4صر}
To prevent RTL namespace names from having spillover effects with function names, it should be possible to include "local effect" strongly directional marks in an identifier:
In this example, the namespace is
:م2
and the name is:ن3
, but the sequence is displayed with a spillover effect. (Note that the number in each name trails the Arabic letter: it appears to the left because the string is RTL!).{$a1 :b2:c3} {$م1 :م2:ن3} spillover effects {$م1 :م2:ن3} with isolates and LRMs
Newlines inside of messages should not harm later syntax.
* * {{\u0645<br>\u0646}} 123 456 {{ No LRM==bad }}
* * {{م
ن}} 123 456 {{ No LRM==bad }}
* * {{\u0645<br>\u0646}}\u200e 123 456 {{ LRM }}
* * {{م
ن}} 123 456 {{ LRM }}
Naive text editors, when operating in a right-to-left context, might display a message with an RTL base direction. While the display of the message might be somewhat damaged by this, it should still produce results that are as reasonable as possible.
What prior decisions and existing conditions limit the possible design?
Users cannot be expected to create or manage bidirectional controls or marks in messages, since the characters are invisible and can be difficult to manage. Tools (such as resource editors or translation editors) and other implementations of MessageFormat 2 serialization are strongly encouraged to provide paired isolates around any right-to-left syntax as described in this design so that messages display appropriately as plain text.
Ideally we do not want RLM/LRM/ALM to be part of the parsed
name
, variable
, reserved-keyword
, unquoted
, or any other term
defined in terms of name
.
This is complicated to do in ABNF because each of these tokens is followed either by
whitespace or by some closing marker such as }
.
The workaround in #763 was to permit these characters before or after whitespace
using the various whitespace productions.
This works at the cost of allowing spurious markers.
We want isolate characters to be outside of patterns.
There is an open question about how best to place them.
One option would be to place them adjacent to the "pattern quote" character sequences {{
/}}
.
Another option would be to place them inside the pattern quotes, e.g. {\u2066{
/}\u2068}
.
Bidi isolates and marks are invisible characters. Whitespace is also invisible. Mixing these may be problematic. Not allowing these to mix could produce annoying parse errors.
Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange.
I propose adopting a hybrid approach in which we permit "super-loose isolation". This allows user to include isolates and strongly directional characters into the whitespace portions of the syntax in order to make messages appear correctly.
The second part of the hybrid approach would be to recommend ("SHOULD") the "strict isolation"
design for serializers.
(Note that "strict" and "super-loose" use non-identical productions with the name bidi
.
These serve different purposes and are consistent with strict being narrower with super-loose.)
This syntax is a subset of the super-loose syntax and can be applied selectively to messages that
have RTL sequences or which have problematic display.
What other solutions are available? How do they compare against the requirements? What other properties they have?
We could do nothing.
A likely outcome of doing nothing is that RTL users would insert bidi controls into messages in an attempt to make the pattern and/or placeholders display correctly. These controls would become part of the output of the message, showing up inappropriately at runtime. Because these characters are invisible, users might be very frustrated trying to manage the results or debug what is wrong with their messages.
By contrast, if users insert too many or the wrong controls using the recommended design, the message would still be functional and would emit no undesired characters.
The syntax of a message assumes a left-to-right base direction both for the complete text of the message as well as for each line (paragraph) contained therein. We prefer LTR display because human understanding of a message depends on LTR word tokens, as well as token ordering (as in a placeholder or with variant keys). Note that LTR display is not a requirement, because that is beyond the scope of MF2 itself. However, tool and editor implementers ought to pay attention to this assumption.
Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear:
- Bidi inside of patterns works normally (we go to great lengths to make the interior of patterns work as plain text)
- Placeholders and markup can be isolated (treated as neutrals) so that they appear in the correct location in an RTL pattern
- Expressions use isolates and directional marks to display internal tokens in the correct order and without spillover effects
- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself.
The syntax permits (but does not require) isolating bidi controls to be used on the outside of the following:
- unquoted literals
- quoted literals
- quoted patterns
We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow the user to set the base direction of a literal or pattern according to its respective actual contents.
Important
This change adds a "lookahead" to the process of determining if a given message is "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message as well as being allowed before a quoted pattern, declaration, or selector.
This would change the ABNF as follows:
(Notice that this change includes a production bidi
described further down
in this document)
literal = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate]
quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate]
open-isolate = %x2066-2068
close-isolate = %x2069
Important
The isolating characters go on the outside of the various literal and pattern productions because characters on the inside of these are part of the literal's or pattern's textual content. We need to allow users to include bidi characters, including isolates and strongly directional marks in the output of MF2.
-
Permit left-to-right isolates (starting with LRI
U+2066
and ending with PDIU+2069
) to be used immediately inside the following:- expressions
- markup
-
Permit any type of isolate sequence (starting with LRI
U+2066
, RLIU+2067
, or FSIU+2068
and ending with PDIU+2069
) around any token inside of an expression or markup. -
Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that end with the
name
production in the ABNF. This includes identifiers found in the names of functions and options, plus the names of variables, as well as the contents of unquoted literals.
This would change the ABNF as follows (assuming the above changes are also incorporated):
expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}"
literal-expression = [s] literal [s annotation] *(s attribute) [s]
variable-expression = [s] variable [s annotation] *(s attribute) [s]
annotation-expression = [s] annotation *(s attribute) [s]
markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone
/ "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close
LRI = %x2066
Note
This design only permits LTR isolates at the expression level because the contents of an expression or markup must be laid out left-to-right. Literal values can be right-to-left isolated within that or use strongly directional marks to ensure correct display.
Note
Notice that unquoted literals can also be surrounded by bidi isolates using the previous syntax modification just above. The isolates are not a part of the literal!
Note
Notice that reserved-annotation
is not in the ABNF changes because it already
permits the marks in question.
Any syntax derived from reserved-annotation
(i.e. when unreserving a new statement in a future addition)
would need to handle bidi explicitly using the model already established here.
variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}"
function = ":" identifier [bidi] *(s option)
option = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate]
attribute = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate]
markup = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone
/ "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}" ; close
identifier = [(namespace ns-separator)] name
ns-separator = [bidi] ":"
bidi = [ %x200E-200F / %x061C ]
Open Issues
The ABNF changes found above put isolates and strongly directional marks into specific locations,
such as directly next to {
/}
/{{
/}}
markers
or directly following "tokens" such as name
.
This makes it a syntax error for whitespace to appear around the isolates or marks.
A more permissive design would add the isolates and strongly directional marks to required and optional
whitespace in the syntax and depend on users/editors to appropriately pair or position the marks
to get optimal display.
Add isolates and strongly directional marks to required and optional whitespace in the syntax. This would permit users to get the effects described by the above design, as long as they use isolates/marks in a "responsible" way.
The exception to this is the namespace separator, used in identifier
.
This requires the ability to insert isolates or strongly directional marks
between the namespace and name portions, where whitespace is not permitted.
This is the only location in the syntax where such characters might be needed
but whitespace is not at least optional.
This could be defined as:
ns-separator = [bidi] ":" [bidi]
Here are the other ABNF changes:
; strongly directional marks and bidi isolates
; ALM / LRM / RLM / LRI / RLI / FSI / PDI
bidi = %x061C / %x200E / %x200F / %x2066-2069
; optional whitespace
owsp = *( s / bidi )
; required whitespace
wsp = [ owsp ] 1*s [ owsp ]
; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
Pros
- Avoids problems with syntax errors that users and tools might find difficult to debug.
- Effective if used carefully.
- Addresses need to comply with UAX#31
Cons
- Syntax does not prevent poor display outcomes, including enabling some Trojan Source cases (UAX#55); note that tooling or linting can help ameliorate these issues.
Apply bidi isolates in a strict way. In this design:
- The open/close isolate characters are syntactically required to be paired. This introduces parse errors for unpaired invisible characters, which could lead to bad user experiences.
As noted above, the "strict" version of the ABNF should be adopted by serializers and for message normalization.
variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}"
function = ":" identifier [bidi] *(s option)
option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi]
/ LRI identifier [bidi] [s] "=" [s] (literal / variable) [bidi] close-isolate
attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])]
/ LRI "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] close-isolate
markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" LRI [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] close-isolate "}"
/ "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close
/ "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close
identifier = [(namespace ns-separator)] name
ns-separator = [bidi] ":" [bidi]
bidi = [ %x200E-200F / %x061C ]
Isolating rather than marking name
helps ensure
that its directionality does not spill over to adjoining syntax.
The following replaces the proposed design's changes to literal
and the [bidi]
additions to
variable-expression
, function
, option
, attribute
, markup
, and ns-separator
:
name = [open-isolate] name-start *name-char [close-isolate]
quoted-literal = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate]
For example, this allows for the proper rendering of the expression
{:אחת:שתיים}
where "אחת" is the namespace
of the identifier
.
Without name
isolation, this would (misleadingly) render as
{:אחת:שתיים}
Note that the parsed value of the name
does not include the open/close isolates,
just as they're not included in the parsed values of quoted literals or quoted patterns,
even though the production includes the characters.
We could accomplish this by adding an additional productions to manage name
, at the cost
of a more complex ABNF.
Pros
- In the syntax, it's much simpler to include the changes to
name
in thename
rule, rather than patching every place wherename
is used.
Cons
- Implementations need to remove isolates from the
name
token before comparing the value to other values (such as comparingfunction
orvariable
names). Because of namespacing, this requires looking inside the token. - Implementations might need to insert isolates when generating names upon serialization.
The current data model does not separate
namespace
andname
, so this might be more complicated. unquoted-literal
values appear as keys, as operands, and as option values. If not isolated, these can cause spillover effects, so we might need bothname
andunquoted-literal
isolation.
We could alter the syntax to make it more "bidi robust", such as by using strongly directional characters instead of neutrals.
We could alter the syntax to forbid using RTL characters in names and unquoted literals. This would make the syntax consist solely of LTR and neutral characters. One flavor of this would be to restrict tokens to US ASCII.
Cons:
- This would break compatibility with NCName/QName; we would be back to defining our own idiosyncratic namespace
- Unicode could define more RTL characters in the future, making the syntax brittle
- This is not friendly to non-English/non-Latin users and represents a usability restriction in environments in which names can be non-ASCII values
We could permit RLI/FSI to be used inside expressions and markup. This would be an advantage for simple expressions containing only or primarily RTL content. For example:
{لت-123-م...} // RLI isolated
{لت-123-م...}
We could also permit users/editors to use RTL base direction for editing. This is tricky, as the syntax promotes the use of left-to-right runs that will "stick together" unless isolated. This is most visible in selectors and variant keys.
Consider this message:
.match {$\u06451\u0645}{$\u06462\u0646}
one two {{normal LTR}}
\u2067one\u2069 \u2067two\u2069 {{RLI around each key}}
\u2066one\u2069 \u2066two\u2069 {{LRI around each key}}
\u0645 \u0646 {{RTL}}
* \u0646 {{star is first}}
\u0645 * {{star is second}}
In an LTR context the message displays like this (red lines around display errors):
In an RTL context, there is an equivalent case:
Coercing proper display in both LTR and RTL contexts requires complex sets of controls.
Pros
- Can provide both LTR and RTL native editing experiences
Cons
- Requires complex sets of bidi controls
- RTL editing/display is mostly a special case; we already afford the ability to edit RTL in patterns and literals
Strict syntactical requirements produce better display outcomes that solve the various problems enumerated in this design document. However, the strictness comes with a cost: otherwise-valid messages, including messages that display completely as expected and are not in any way misleading, can produce syntax errors. These errors can be difficult to debug, since the characters are invisible. Syntax errors are generally treated as fatal by processors.
Semi-strict or super-loose strategies can be used to avoid producing these types of syntax error. However, valid messages using these approaches can have stray (e.g. unpaired isolates), malformed (e.g. PDI before LRI/RLI/FSI), or badly formatted character sequences (wrapping the wrong things), unless the user or the user's tools are careful. This can include deliberate abuse, such as Trojan Source attacks (see UAX#55), in which Bad Actors create messages that have a misleading appearance vs. their runtime interpretation.
A hybrid ("Postel's Law") approach would be to permit the use of isolates and strongly directional marks in whitespace in a permissive way (see: "super-loose isolation"), particularly in runtime formatting operations but strongly encourage tools to implement message normalization on a strictly-defined grammar (see: "strict isolation all the time") and to encourage users to use the strict version of the grammar when writing or serializing messages.
The hybrid approach would include tests to allow implementations to claim adherence to the stricter grammar.
Pros
- Messages can be written that solve all display problems
- Stray, unpaired, repeated, or other invisible typos do not produce spurious syntax errors
- Provides a foundation for tools to claim strict conformance and message normalization as well as guidance to implementers to make them want to adopt it
- Messages are valid while being edited (such as when the open or close isolate has been inserted but the corresponding opposite isolate hasn't been entered yet)
Cons
- Requires additional effort to maintain the grammar
- Requires additional effort to maintain tests
- Valid messages can contain Trojan Source and other negative display consequences; messages can be checked, however, using the strict grammar, so tools could warn users of potential abuse