Early errors for RegExp literals #984

anba · 2017-08-28T10:11:59Z

Fixes #918

littledan · 2017-08-29T20:09:08Z

This patch is great--it seems to address this issue as well: tc39/proposal-regexp-unicode-property-escapes#29

mathiasbynens · 2017-08-29T20:15:22Z

I’d upvote this a thousand times if I could. Abstracting away IsCharacterClass etc. makes so much sense.

littledan

This is a web reality change--modulo engines which do not implement certain Unicode-mode early errors, all of the changes here match what is shipped to the web, which is that these SyntaxErrors occur when the RegExp is parsed, not when it is exec'd.

LGTM

littledan · 2017-08-29T20:13:50Z

spec.html

+          <li>
+            It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|.
+          </li>
+        </ul>


From eshost, seems like four engines do have this early error.

$ eshost -e "/a{5,3}/" #### jsc SyntaxError: Invalid regular expression: numbers out of order in {} quantifier #### chakracore SyntaxError: Syntax error in regular expression #### spidermonkey SyntaxError: numbers out of order in {} quantifier.: #### d8 SyntaxError: Invalid regular expression: /a{5,3}/: numbers out of order in {} quantifier

littledan · 2017-08-29T20:20:58Z

spec.html

+          </li>
+          <li>
+            It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *false* and IsCharacterClass of |ClassAtom| is *false* and the CharacterValue of |ClassAtomNoDash| is larger than the CharacterValue of |ClassAtom|.
+          </li>


Syntax error for everyone.

$ eshost -e "/[z-a]/" #### chakracore SyntaxError: Invalid range in character set #### jsc SyntaxError: Invalid regular expression: range out of order in character class #### spidermonkey SyntaxError: invalid range in character class: #### d8 SyntaxError: Invalid regular expression: /[z-a]/: Range out of order in character class

littledan · 2017-08-29T20:24:38Z

spec.html

+          <li>
+            It is a Syntax Error if any source text matches this rule.
+          </li>
+        </ul>


Error in all tested engines.

$ eshost -e "/{1}/" #### jsc SyntaxError: Invalid regular expression: nothing to repeat #### chakracore SyntaxError: Unexpected quantifier #### spidermonkey SyntaxError: nothing to repeat: #### d8 SyntaxError: Invalid regular expression: /{1}/: Nothing to repeat

littledan · 2017-08-29T20:27:47Z

spec.html

+        <ul>
+          <li>
+            It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
+          </li>


This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.

Probably don't want to commit these <ins> tags.

ins is used elsewhere in Annex B, e.g., here.

$ eshost -e "/[\w-\d]/u" #### chakracore /[\w-\d]/u #### jsc /[\w-\d]/u #### d8 SyntaxError: Invalid regular expression: /[\w-\d]/: Invalid character class #### spidermonkey SyntaxError: character class escape cannot be used in class range in regular expression:

littledan · 2017-08-29T20:27:53Z

spec.html

+        <ul>
+          <li>
+            It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
+          </li>


This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.

bterlson · 2017-08-29T21:46:59Z

spec.html

+        <ul>
+          <li>
+            It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
+          </li>


Probably don't want to commit these <ins> tags.

bterlson · 2017-08-29T21:49:58Z

Seems good. I just want to do my own littledan exercise of going through the various implementations and the early errors prescribed here.

Should this be a needs-consensus PR?

bakkot · 2017-08-29T21:53:22Z

@littledan, I don't think this actually matches web reality in Chrome, does it? The relevant v8 bug is still open, and I'm seeing that e.g. if (false) /a{2,3}/ does not throw an error.

In particular, it looks like in v8 the error is thrown when the literal is evaluated, not when it's parsed (but still before it's exec'd).

bakkot · 2017-08-29T21:55:09Z

Needs consensus PR seems right to me.

littledan · 2017-08-30T08:47:49Z

I want to clarify the tests that I included above. It's true that V8 has a known issue where, if a RegExp isn't reached in executing code, it's not parsed and early errors don't come.

However, there are still separate stages of parsing/compiling and executing on a string in V8. You can see this, for example, if you do /\5/u.toString(). In the current specification, that would not thrown an error, but in V8, it would. This is the same time as errors for RegExps which don't match the grammar are thrown, for example /?/.

I don't think we need to hold back this PR just because V8 has a known spec violation. Whether parsing errors are thrown when parsing the enclosing code, or when evaluating the RegExp for the first time, is not an open spec question given 3/4 of these engines giving early errors:

$ eshost -e "false ? /{1}/ : undefined"
#### jsc
SyntaxError: Invalid regular expression: nothing to repeat

#### chakracore
SyntaxError: Unexpected quantifier

#### spidermonkey
SyntaxError: nothing to repeat:

#### d8
undefined

It's currently web reality already that, when RegExps are first parsed/compiled, the errors are thrown, rather than later when they are going through a string. That's the delta that this PR makes; it doesn't change whether RegExp syntax errors come at the broader early error time or at a special, later time as V8 does.

littledan · 2017-09-12T22:54:09Z

@bakkot For this to be marked "needs consensus": I think we ended up deciding that not all normative changes need to be discussed in committee, we should just discuss the ones that seem major or don't agree on here. Do you have doubts about this proposal, beyond the fact that V8 doesn't comply with the spec here (in a way that, I argued above, is unrelated)? If not, maybe we can land this patch without spending committee time on it. If you do have doubts, I'll add it to the agenda.

bakkot · 2017-09-12T23:18:56Z

@littledan I don't have doubts about this change, but I'm not comfortable making that call for everyone else on the committee, even though I do not expect there to be objections. If it indeed proves entirely uncontroversial it shouldn't take much committee time; if it doesn't, well, that indicates the discussion was necessary.

I would be happier with this going through the needs-consensus PR process, such as it is.

allenwb · 2017-09-13T00:47:32Z

I believe the original assertion that this isn't already covered made in #918 is incorrect. See the comment I left there #918 (comment)

allenwb · 2017-09-13T16:13:00Z

Below, I'm copying over most of #918 (comment)

The TL;DR summary is: This patch and all of the other existing syntax error checks in the RegExp evaluation algorithms should be refactored into Static Semantics Early Error rules.

The algorithmic syntax error detection in the RegExp evaluation algorithms really shouldn't be there. They are a carry over from Edition 3 where the spec. which did not use separate static semantic clauses listing syntax-derived early errors (and hence the need for the Ch 16 rule).

What I should have done in ES6 is to factor all of those syntax error checks out of the RegExp evaluation algorithms and restated them as early error rules in 21.2.1.1. Not doing so was an oversight on my part, or perhaps I ran out of time.

The purpose of placing these errors into 21.2.1.1 (and with the early error static semantic errors, in general) is to clearly state the errors that identify malformed ES code that can be detected and reported independently of an run-time context or values. The goal is to eliminate implementation variations in when such errors are reported.

littledan · 2017-09-13T17:02:56Z

@allenwb Does this patch implement your suggestion, or if not, what's the change you're suggesting? This patch is exactly moving runtime errors for RegExps into early errors.

allenwb · 2017-09-13T17:23:17Z

Yes, I think it is probably fine. For some reason, I thought I was seeing something different (new syntax error checks being added to RegExp evaluation rules) when I looked at the patch before sending my previous message

Gawd, I hate reading raw EMU and out of (global) context patch lines.

mathiasbynens · 2017-10-20T17:27:44Z

This PR gained consensus at the September 2017 TC39 meeting. Since the Unicode properties proposal is blocked on this change, please merge this as soon as possible.

bterlson · 2017-10-20T21:13:53Z

Sorry, I didn't know it was blocking. Happy to merge now. Result of investigation alluded to above: seems great. There are a few differences with what ChakraCore implements but are not concerning and I'll just track them as bugs :)

mathiasbynens · 2017-10-20T23:41:44Z

Thanks!

PR tc39#984 added a reference to "the MV of |Hex4Digits|" (in the definition of CharacterValue) but didn't add a definition for it. Extract one from the definition for "the SV of |Hex4Digits|".

…ons (tc39#1301) - CoveredCallExpression: change production When an SDO is applied to a Parse Node, the appropriate definition of the SDO is the one for the production of which the Parse Node is an instance (ignoring the complication of chain productions). The only invocation of CoveredCallExpression is on a |CoverCallExpressionAndAsyncArrowHead|, so we expect CoveredCallExpression to be defined on CoverCallExpressionAndAsyncArrowHead : MemberExpression Arguments not CallExpression : CoverCallExpressionAndAsyncArrowHead - BoundNames: add def for ObjectBindingPattern BoundNames was missing a definition for ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}` - ContainsExpression: add a few defs ContainsExpression was missing definitions for: ObjectBindingPattern : `{` BindingRestProperty `}` ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}` ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - ExpectedArgumentCount: add a few defs ExpectedArgumentCount was missing definitions for: FormalParameters : FunctionRestParameter FormalParameterList : FormalParameter ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - CoveredFormalsList: add a def CoveredFormalsList was missing a definition for CoverParenthesizedExpressionAndArrowParameterList : `(` Expression `,` `)` - IteratorBindingInitialization: add a def IteratorBindingInitialization was missing a definition for: ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList - HasCallInTailPosition: add a few defs HasCallInTailPosition was missing definitions for the 3 IterationStatement : `for` `await` `(` productions. - define 3 SDOs on empty FunctionStatementList specifically: - ContainsDuplicateLabels - ContainsUndefinedBreakTarget - ContainsUndefinedContinueTarget - MV: add def for Hex4Digits PR tc39#984 added a reference to "the MV of |Hex4Digits|" (in the definition of CharacterValue) but didn't add a definition for it. Extract one from the definition for "the SV of |Hex4Digits|". - eliminate unsupported "SV of |SourceCharacter|" We have rules that say the TRV of something is: - the SV of the |SourceCharacter| that is that single code point; or - the SV of the |SourceCharacter| that is that |HexDigit|. but there's no definition for the SV of |SourceCharacter|. We could add such a definition, except that these usages are already odd, in that they introduce a |SourceCharacter| nonterminal when there isn't one in the parse tree. So instead, fix that oddity by eliminating the references to SV and |SourceCharacter|. And since I'm in the neighborhood, The TRV of a |HexDigit| ... is weird. No other SDO rule is written that way. Change it to: The TRV of <emu-grammar>HexDigit :: ....</emu-grammar> ...

Early errors for RegExp literals

a607d4b

littledan added normative change Affects behavior required to correctly evaluate some ECMAScript source text web reality labels Aug 29, 2017

littledan approved these changes Aug 29, 2017

View reviewed changes

bterlson approved these changes Aug 29, 2017

View reviewed changes

bterlson added the needs consensus This needs committee consensus before it can be eligible to be merged. label Aug 29, 2017

anba mentioned this pull request Sep 1, 2017

Miscellaneous fixes tc39/test262#1206

Merged

anba mentioned this pull request Sep 12, 2017

CharacterRange abstract operation tests should be phase: runtime and not phase: early tc39/test262#1193

Closed

rwaldron added has consensus This has committee consensus. needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 and removed needs consensus This needs committee consensus before it can be eligible to be merged. labels Sep 26, 2017

hzoo mentioned this pull request Sep 26, 2017

Sept 2017 babel/proposals#28

Open

11 tasks

ljharb assigned bterlson Oct 20, 2017

bterlson merged commit d898df2 into tc39:master Oct 20, 2017

mathiasbynens mentioned this pull request Oct 20, 2017

V8 implementation: does unicode property escapes behavior in character classes range differ from other classes intentionally? tc39/proposal-regexp-unicode-property-escapes#29

Closed

Jamesernator mentioned this pull request Nov 12, 2017

[Meta] Should ECMA262 pull requests be listed here or just as issues on the main repository? babel/proposals#33

Open

jmdyck mentioned this pull request Jul 21, 2022

Editorial: Fix typo in CharacterValue #2834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early errors for RegExp literals #984

Early errors for RegExp literals #984

anba commented Aug 28, 2017

littledan commented Aug 29, 2017

mathiasbynens commented Aug 29, 2017

littledan left a comment

littledan Aug 29, 2017

littledan Aug 30, 2017

littledan Aug 29, 2017

littledan Aug 30, 2017

littledan Aug 29, 2017

littledan Aug 30, 2017

littledan Aug 29, 2017

bterlson Aug 29, 2017

littledan Aug 30, 2017 •

edited

Loading

littledan Aug 29, 2017

bterlson Aug 29, 2017

bterlson commented Aug 29, 2017

bakkot commented Aug 29, 2017

bakkot commented Aug 29, 2017

littledan commented Aug 30, 2017

littledan commented Sep 12, 2017

bakkot commented Sep 12, 2017

allenwb commented Sep 13, 2017

allenwb commented Sep 13, 2017

littledan commented Sep 13, 2017

allenwb commented Sep 13, 2017

mathiasbynens commented Oct 20, 2017

bterlson commented Oct 20, 2017

mathiasbynens commented Oct 20, 2017

Early errors for RegExp literals #984

Early errors for RegExp literals #984

Conversation

anba commented Aug 28, 2017

littledan commented Aug 29, 2017

mathiasbynens commented Aug 29, 2017

littledan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

littledan Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bterlson commented Aug 29, 2017

bakkot commented Aug 29, 2017

bakkot commented Aug 29, 2017

littledan commented Aug 30, 2017

littledan commented Sep 12, 2017

bakkot commented Sep 12, 2017

allenwb commented Sep 13, 2017

allenwb commented Sep 13, 2017

littledan commented Sep 13, 2017

allenwb commented Sep 13, 2017

mathiasbynens commented Oct 20, 2017

bterlson commented Oct 20, 2017

mathiasbynens commented Oct 20, 2017

littledan Aug 30, 2017 •

edited

Loading