-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early errors for RegExp literals #984
Conversation
This patch is great--it seems to address this issue as well: tc39/proposal-regexp-unicode-property-escapes#29 |
I’d upvote this a thousand times if I could. Abstracting away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a web reality change--modulo engines which do not implement certain Unicode-mode early errors, all of the changes here match what is shipped to the web, which is that these SyntaxErrors occur when the RegExp is parsed, not when it is exec'd.
LGTM
<li> | ||
It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|. | ||
</li> | ||
</ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From eshost, seems like four engines do have this early error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ eshost -e "/a{5,3}/"
#### jsc
SyntaxError: Invalid regular expression: numbers out of order in {} quantifier
#### chakracore
SyntaxError: Syntax error in regular expression
#### spidermonkey
SyntaxError: numbers out of order in {} quantifier.:
#### d8
SyntaxError: Invalid regular expression: /a{5,3}/: numbers out of order in {} quantifier
</li> | ||
<li> | ||
It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *false* and IsCharacterClass of |ClassAtom| is *false* and the CharacterValue of |ClassAtomNoDash| is larger than the CharacterValue of |ClassAtom|. | ||
</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntax error for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ eshost -e "/[z-a]/"
#### chakracore
SyntaxError: Invalid range in character set
#### jsc
SyntaxError: Invalid regular expression: range out of order in character class
#### spidermonkey
SyntaxError: invalid range in character class:
#### d8
SyntaxError: Invalid regular expression: /[z-a]/: Range out of order in character class
<li> | ||
It is a Syntax Error if any source text matches this rule. | ||
</li> | ||
</ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error in all tested engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ eshost -e "/{1}/"
#### jsc
SyntaxError: Invalid regular expression: nothing to repeat
#### chakracore
SyntaxError: Unexpected quantifier
#### spidermonkey
SyntaxError: nothing to repeat:
#### d8
SyntaxError: Invalid regular expression: /{1}/: Nothing to repeat
<ul> | ||
<li> | ||
It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>. | ||
</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want to commit these <ins> tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ins is used elsewhere in Annex B, e.g., here.
$ eshost -e "/[\w-\d]/u"
#### chakracore
/[\w-\d]/u
#### jsc
/[\w-\d]/u
#### d8
SyntaxError: Invalid regular expression: /[\w-\d]/: Invalid character class
#### spidermonkey
SyntaxError: character class escape cannot be used in class range in regular expression:
<ul> | ||
<li> | ||
It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>. | ||
</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.
<ul> | ||
<li> | ||
It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>. | ||
</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want to commit these <ins> tags.
Seems good. I just want to do my own littledan exercise of going through the various implementations and the early errors prescribed here. Should this be a needs-consensus PR? |
@littledan, I don't think this actually matches web reality in Chrome, does it? The relevant v8 bug is still open, and I'm seeing that e.g. In particular, it looks like in v8 the error is thrown when the literal is evaluated, not when it's parsed (but still before it's exec'd). |
Needs consensus PR seems right to me. |
I want to clarify the tests that I included above. It's true that V8 has a known issue where, if a RegExp isn't reached in executing code, it's not parsed and early errors don't come. However, there are still separate stages of parsing/compiling and executing on a string in V8. You can see this, for example, if you do I don't think we need to hold back this PR just because V8 has a known spec violation. Whether parsing errors are thrown when parsing the enclosing code, or when evaluating the RegExp for the first time, is not an open spec question given 3/4 of these engines giving early errors:
It's currently web reality already that, when RegExps are first parsed/compiled, the errors are thrown, rather than later when they are going through a string. That's the delta that this PR makes; it doesn't change whether RegExp syntax errors come at the broader early error time or at a special, later time as V8 does. |
@bakkot For this to be marked "needs consensus": I think we ended up deciding that not all normative changes need to be discussed in committee, we should just discuss the ones that seem major or don't agree on here. Do you have doubts about this proposal, beyond the fact that V8 doesn't comply with the spec here (in a way that, I argued above, is unrelated)? If not, maybe we can land this patch without spending committee time on it. If you do have doubts, I'll add it to the agenda. |
@littledan I don't have doubts about this change, but I'm not comfortable making that call for everyone else on the committee, even though I do not expect there to be objections. If it indeed proves entirely uncontroversial it shouldn't take much committee time; if it doesn't, well, that indicates the discussion was necessary. I would be happier with this going through the needs-consensus PR process, such as it is. |
I believe the original assertion that this isn't already covered made in #918 is incorrect. See the comment I left there #918 (comment) |
Below, I'm copying over most of #918 (comment) The TL;DR summary is: This patch and all of the other existing syntax error checks in the RegExp evaluation algorithms should be refactored into Static Semantics Early Error rules. The algorithmic syntax error detection in the RegExp evaluation algorithms really shouldn't be there. They are a carry over from Edition 3 where the spec. which did not use separate static semantic clauses listing syntax-derived early errors (and hence the need for the Ch 16 rule). What I should have done in ES6 is to factor all of those syntax error checks out of the RegExp evaluation algorithms and restated them as early error rules in 21.2.1.1. Not doing so was an oversight on my part, or perhaps I ran out of time. The purpose of placing these errors into 21.2.1.1 (and with the early error static semantic errors, in general) is to clearly state the errors that identify malformed ES code that can be detected and reported independently of an run-time context or values. The goal is to eliminate implementation variations in when such errors are reported. |
@allenwb Does this patch implement your suggestion, or if not, what's the change you're suggesting? This patch is exactly moving runtime errors for RegExps into early errors. |
Yes, I think it is probably fine. For some reason, I thought I was seeing something different (new syntax error checks being added to RegExp evaluation rules) when I looked at the patch before sending my previous message Gawd, I hate reading raw EMU and out of (global) context patch lines. |
This PR gained consensus at the September 2017 TC39 meeting. Since the Unicode properties proposal is blocked on this change, please merge this as soon as possible. |
Sorry, I didn't know it was blocking. Happy to merge now. Result of investigation alluded to above: seems great. There are a few differences with what ChakraCore implements but are not concerning and I'll just track them as bugs :) |
Thanks! |
PR tc39#984 added a reference to "the MV of |Hex4Digits|" (in the definition of CharacterValue) but didn't add a definition for it. Extract one from the definition for "the SV of |Hex4Digits|".
PR tc39#984 added a reference to "the MV of |Hex4Digits|" (in the definition of CharacterValue) but didn't add a definition for it. Extract one from the definition for "the SV of |Hex4Digits|".
…ons (tc39#1301) - CoveredCallExpression: change production When an SDO is applied to a Parse Node, the appropriate definition of the SDO is the one for the production of which the Parse Node is an instance (ignoring the complication of chain productions). The only invocation of CoveredCallExpression is on a |CoverCallExpressionAndAsyncArrowHead|, so we expect CoveredCallExpression to be defined on CoverCallExpressionAndAsyncArrowHead : MemberExpression Arguments not CallExpression : CoverCallExpressionAndAsyncArrowHead - BoundNames: add def for ObjectBindingPattern BoundNames was missing a definition for ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}` - ContainsExpression: add a few defs ContainsExpression was missing definitions for: ObjectBindingPattern : `{` BindingRestProperty `}` ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}` ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - ExpectedArgumentCount: add a few defs ExpectedArgumentCount was missing definitions for: FormalParameters : FunctionRestParameter FormalParameterList : FormalParameter ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - CoveredFormalsList: add a def CoveredFormalsList was missing a definition for CoverParenthesizedExpressionAndArrowParameterList : `(` Expression `,` `)` - IteratorBindingInitialization: add a def IteratorBindingInitialization was missing a definition for: ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - HasCallInTailPosition: add a few defs HasCallInTailPosition was missing definitions for the 3 IterationStatement : `for` `await` `(` productions. - define 3 SDOs on empty FunctionStatementList specifically: - ContainsDuplicateLabels - ContainsUndefinedBreakTarget - ContainsUndefinedContinueTarget - MV: add def for Hex4Digits PR tc39#984 added a reference to "the MV of |Hex4Digits|" (in the definition of CharacterValue) but didn't add a definition for it. Extract one from the definition for "the SV of |Hex4Digits|". - eliminate unsupported "SV of |SourceCharacter|" We have rules that say the TRV of something is: - the SV of the |SourceCharacter| that is that single code point; or - the SV of the |SourceCharacter| that is that |HexDigit|. but there's no definition for the SV of |SourceCharacter|. We could add such a definition, except that these usages are already odd, in that they introduce a |SourceCharacter| nonterminal when there isn't one in the parse tree. So instead, fix that oddity by eliminating the references to SV and |SourceCharacter|. And since I'm in the neighborhood, The TRV of a |HexDigit| ... is weird. No other SDO rule is written that way. Change it to: The TRV of <emu-grammar>HexDigit :: ....</emu-grammar> ...
Fixes #918