Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early errors for RegExp literals #984

Merged
merged 1 commit into from
Oct 20, 2017
Merged

Conversation

anba
Copy link
Contributor

@anba anba commented Aug 28, 2017

Fixes #918

@littledan
Copy link
Member

This patch is great--it seems to address this issue as well: tc39/proposal-regexp-unicode-property-escapes#29

@mathiasbynens
Copy link
Member

I’d upvote this a thousand times if I could. Abstracting away IsCharacterClass etc. makes so much sense.

@littledan littledan added normative change Affects behavior required to correctly evaluate some ECMAScript source text web reality labels Aug 29, 2017
Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a web reality change--modulo engines which do not implement certain Unicode-mode early errors, all of the changes here match what is shipped to the web, which is that these SyntaxErrors occur when the RegExp is parsed, not when it is exec'd.

LGTM

<li>
It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|.
</li>
</ul>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From eshost, seems like four engines do have this early error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ eshost -e "/a{5,3}/"
#### jsc
SyntaxError: Invalid regular expression: numbers out of order in {} quantifier

#### chakracore
SyntaxError: Syntax error in regular expression

#### spidermonkey
SyntaxError: numbers out of order in {} quantifier.:

#### d8
SyntaxError: Invalid regular expression: /a{5,3}/: numbers out of order in {} quantifier

</li>
<li>
It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *false* and IsCharacterClass of |ClassAtom| is *false* and the CharacterValue of |ClassAtomNoDash| is larger than the CharacterValue of |ClassAtom|.
</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax error for everyone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ eshost -e "/[z-a]/"
#### chakracore
SyntaxError: Invalid range in character set

#### jsc
SyntaxError: Invalid regular expression: range out of order in character class

#### spidermonkey
SyntaxError: invalid range in character class:

#### d8
SyntaxError: Invalid regular expression: /[z-a]/: Range out of order in character class

<li>
It is a Syntax Error if any source text matches this rule.
</li>
</ul>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error in all tested engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ eshost -e "/{1}/"
#### jsc
SyntaxError: Invalid regular expression: nothing to repeat

#### chakracore
SyntaxError: Unexpected quantifier

#### spidermonkey
SyntaxError: nothing to repeat:

#### d8
SyntaxError: Invalid regular expression: /{1}/: Nothing to repeat

<ul>
<li>
It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to commit these <ins> tags.

Copy link
Member

@littledan littledan Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ins is used elsewhere in Annex B, e.g., here.

$ eshost -e "/[\w-\d]/u"
#### chakracore
/[\w-\d]/u

#### jsc
/[\w-\d]/u

#### d8
SyntaxError: Invalid regular expression: /[\w-\d]/: Invalid character class

#### spidermonkey
SyntaxError: character class escape cannot be used in class range in regular expression:

<ul>
<li>
It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what seems to be implemented in V8 and SpiderMonkey; the versions of JSC and ChakraCore I'm testing don't seem to throw the errors in non-Unicode mode.

<ul>
<li>
It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* <ins>and this production has a [U] parameter</ins>.
</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to commit these <ins> tags.

@bterlson
Copy link
Member

Seems good. I just want to do my own littledan exercise of going through the various implementations and the early errors prescribed here.

Should this be a needs-consensus PR?

@bakkot
Copy link
Contributor

bakkot commented Aug 29, 2017

@littledan, I don't think this actually matches web reality in Chrome, does it? The relevant v8 bug is still open, and I'm seeing that e.g. if (false) /a{2,3}/ does not throw an error.

In particular, it looks like in v8 the error is thrown when the literal is evaluated, not when it's parsed (but still before it's exec'd).

@bakkot
Copy link
Contributor

bakkot commented Aug 29, 2017

Needs consensus PR seems right to me.

@bterlson bterlson added the needs consensus This needs committee consensus before it can be eligible to be merged. label Aug 29, 2017
@littledan
Copy link
Member

I want to clarify the tests that I included above. It's true that V8 has a known issue where, if a RegExp isn't reached in executing code, it's not parsed and early errors don't come.

However, there are still separate stages of parsing/compiling and executing on a string in V8. You can see this, for example, if you do /\5/u.toString(). In the current specification, that would not thrown an error, but in V8, it would. This is the same time as errors for RegExps which don't match the grammar are thrown, for example /?/.

I don't think we need to hold back this PR just because V8 has a known spec violation. Whether parsing errors are thrown when parsing the enclosing code, or when evaluating the RegExp for the first time, is not an open spec question given 3/4 of these engines giving early errors:

$ eshost -e "false ? /{1}/ : undefined"
#### jsc
SyntaxError: Invalid regular expression: nothing to repeat

#### chakracore
SyntaxError: Unexpected quantifier

#### spidermonkey
SyntaxError: nothing to repeat:

#### d8
undefined

It's currently web reality already that, when RegExps are first parsed/compiled, the errors are thrown, rather than later when they are going through a string. That's the delta that this PR makes; it doesn't change whether RegExp syntax errors come at the broader early error time or at a special, later time as V8 does.

@littledan
Copy link
Member

@bakkot For this to be marked "needs consensus": I think we ended up deciding that not all normative changes need to be discussed in committee, we should just discuss the ones that seem major or don't agree on here. Do you have doubts about this proposal, beyond the fact that V8 doesn't comply with the spec here (in a way that, I argued above, is unrelated)? If not, maybe we can land this patch without spending committee time on it. If you do have doubts, I'll add it to the agenda.

@bakkot
Copy link
Contributor

bakkot commented Sep 12, 2017

@littledan I don't have doubts about this change, but I'm not comfortable making that call for everyone else on the committee, even though I do not expect there to be objections. If it indeed proves entirely uncontroversial it shouldn't take much committee time; if it doesn't, well, that indicates the discussion was necessary.

I would be happier with this going through the needs-consensus PR process, such as it is.

@allenwb
Copy link
Member

allenwb commented Sep 13, 2017

I believe the original assertion that this isn't already covered made in #918 is incorrect. See the comment I left there #918 (comment)

@allenwb
Copy link
Member

allenwb commented Sep 13, 2017

Below, I'm copying over most of #918 (comment)

The TL;DR summary is: This patch and all of the other existing syntax error checks in the RegExp evaluation algorithms should be refactored into Static Semantics Early Error rules.


The algorithmic syntax error detection in the RegExp evaluation algorithms really shouldn't be there. They are a carry over from Edition 3 where the spec. which did not use separate static semantic clauses listing syntax-derived early errors (and hence the need for the Ch 16 rule).

What I should have done in ES6 is to factor all of those syntax error checks out of the RegExp evaluation algorithms and restated them as early error rules in 21.2.1.1. Not doing so was an oversight on my part, or perhaps I ran out of time.

The purpose of placing these errors into 21.2.1.1 (and with the early error static semantic errors, in general) is to clearly state the errors that identify malformed ES code that can be detected and reported independently of an run-time context or values. The goal is to eliminate implementation variations in when such errors are reported.

@littledan
Copy link
Member

@allenwb Does this patch implement your suggestion, or if not, what's the change you're suggesting? This patch is exactly moving runtime errors for RegExps into early errors.

@allenwb
Copy link
Member

allenwb commented Sep 13, 2017

Yes, I think it is probably fine. For some reason, I thought I was seeing something different (new syntax error checks being added to RegExp evaluation rules) when I looked at the patch before sending my previous message

Gawd, I hate reading raw EMU and out of (global) context patch lines.

@rwaldron rwaldron added has consensus This has committee consensus. needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 and removed needs consensus This needs committee consensus before it can be eligible to be merged. labels Sep 26, 2017
@hzoo hzoo mentioned this pull request Sep 26, 2017
11 tasks
@mathiasbynens
Copy link
Member

This PR gained consensus at the September 2017 TC39 meeting. Since the Unicode properties proposal is blocked on this change, please merge this as soon as possible.

@bterlson
Copy link
Member

Sorry, I didn't know it was blocking. Happy to merge now. Result of investigation alluded to above: seems great. There are a few differences with what ChakraCore implements but are not concerning and I'll just track them as bugs :)

@bterlson bterlson merged commit d898df2 into tc39:master Oct 20, 2017
@mathiasbynens
Copy link
Member

Thanks!

jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Nov 1, 2019
PR tc39#984 added a reference to "the MV of |Hex4Digits|"
(in the definition of CharacterValue)
but didn't add a definition for it.

Extract one from the definition for "the SV of |Hex4Digits|".
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Nov 22, 2019
PR tc39#984 added a reference to "the MV of |Hex4Digits|"
(in the definition of CharacterValue)
but didn't add a definition for it.

Extract one from the definition for "the SV of |Hex4Digits|".
ljharb pushed a commit to jmdyck/ecma262 that referenced this pull request Feb 19, 2020
…ons (tc39#1301)

 - CoveredCallExpression: change production
      When an SDO is applied to a Parse Node, the appropriate definition of the SDO
      is the one for the production of which the Parse Node is an instance
      (ignoring the complication of chain productions).

      The only invocation of CoveredCallExpression
      is on a |CoverCallExpressionAndAsyncArrowHead|,
      so we expect CoveredCallExpression to be defined on
          CoverCallExpressionAndAsyncArrowHead : MemberExpression Arguments
      not
          CallExpression : CoverCallExpressionAndAsyncArrowHead
 - BoundNames: add def for ObjectBindingPattern
      BoundNames was missing a definition for
          ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}`
 - ContainsExpression: add a few defs
      ContainsExpression was missing definitions for:
          ObjectBindingPattern : `{` BindingRestProperty `}`
          ObjectBindingPattern : `{` BindingPropertyList `,` BindingRestProperty `}`
          ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList
 - ExpectedArgumentCount: add a few defs
      ExpectedArgumentCount was missing definitions for:
          FormalParameters : FunctionRestParameter
          FormalParameterList : FormalParameter
          ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList
 - CoveredFormalsList: add a def
      CoveredFormalsList was missing a definition for
          CoverParenthesizedExpressionAndArrowParameterList : `(` Expression `,` `)`
 - IteratorBindingInitialization: add a def
      IteratorBindingInitialization was missing a definition for:
          ArrowParameters : CoverParenthesizedExpressionAndArrowParameterList
 - HasCallInTailPosition: add a few defs
      HasCallInTailPosition was missing definitions for the 3
          IterationStatement : `for` `await` `(` productions.
 - define 3 SDOs on empty FunctionStatementList
      specifically:
       - ContainsDuplicateLabels
       - ContainsUndefinedBreakTarget
       - ContainsUndefinedContinueTarget
 - MV: add def for Hex4Digits
      PR tc39#984 added a reference to "the MV of |Hex4Digits|"
      (in the definition of CharacterValue)
      but didn't add a definition for it.

      Extract one from the definition for "the SV of |Hex4Digits|".
 - eliminate unsupported "SV of |SourceCharacter|"
      We have rules that say the TRV of something is:
       - the SV of the |SourceCharacter| that is that single code point; or
       - the SV of the |SourceCharacter| that is that |HexDigit|.

      but there's no definition for the SV of |SourceCharacter|.

      We could add such a definition, except that these usages are already odd,
      in that they introduce a |SourceCharacter| nonterminal when there isn't one
      in the parse tree. So instead, fix that oddity by eliminating the references
      to SV and |SourceCharacter|.

      And since I'm in the neighborhood,
         The TRV of a |HexDigit| ...
      is weird. No other SDO rule is written that way. Change it to:
         The TRV of <emu-grammar>HexDigit :: ....</emu-grammar> ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has consensus This has committee consensus. needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 normative change Affects behavior required to correctly evaluate some ECMAScript source text web reality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants