Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path to Stage 4! #58

Open
22 of 30 tasks
ljharb opened this issue Sep 26, 2023 · 59 comments
Open
22 of 30 tasks

Path to Stage 4! #58

ljharb opened this issue Sep 26, 2023 · 59 comments

Comments

@ljharb
Copy link
Member

ljharb commented Sep 26, 2023

Stage 4

  • committee approval
  • two implementations
  • significant in-the-field experience
  • ecma262 PR approved
  • prepare ecma262 PR

Stage 3

  • committee approval
  • merge test262 tests
  • write test262 tests (issue, PR)

Stage 2.7

Stage 2

  • committee approval
  • spec reviewers selected
  • spec text written

Stage 1

  • committee approval
@oliverfoster
Copy link
Contributor

Congratulations on getting to Stage 2. 👍

@benjamingr
Copy link
Collaborator

Hey, need help on any of these? (I'm happy to try and contribute the test262 tests or amend the spec according to the consensus)

@ljharb
Copy link
Member Author

ljharb commented Sep 26, 2023

I'm comfortable handling the spec, but the test262 tests would be most helpful. Be warned, though, they'll need to be extremely rigorous.

@jridgewell
Copy link
Member

My review is at #60

@michaelficarra
Copy link
Member

My feedback:

  1. I don't like how this is introducing identity escapes in Unicode mode for these new punctuators. This wasn't a goal of the RegExp escaping proposal. You can accomplish RegExp escaping with hex escapes and without introducing new features to the RegExp grammar.
  2. There's no reason to define the phrase "the ASCII punctuators that need escaping" if it's just used once, in the algorithm that follows. Just inline the string. Also I don't like how the string is denoted since it contains a ". Can you figure out some other way to denote it?

Maybe a table (code point, hex escape string) would solve both of my issues.

@ljharb
Copy link
Member Author

ljharb commented Nov 2, 2023

For the second, I have no strong opinions about how to denote the characters; I can certainly inline the string.

For the first, isn't that required, otherwise this proposal won't be able to produce output that works in both u and v mode?

@michaelficarra
Copy link
Member

No, I believe hex escapes are sufficient for this purpose. Do you have a specific counterexample?

@ljharb
Copy link
Member Author

ljharb commented Nov 2, 2023

cc @bakkot since this was a result of their research

@bakkot
Copy link
Collaborator

bakkot commented Nov 2, 2023

Hex encoding works, it just makes the output completely unreadable. I can't imagine we're going to make \& mean anything other than & in u-mode regexps, so I don't think there's much cost in making those legal, and I think we ought to pay that cost for the benefit of more readable output.

@michaelficarra
Copy link
Member

I do not value the readability of the output. This function is meant only for composing and then compiling RegExps.

@bakkot
Copy link
Collaborator

bakkot commented Nov 3, 2023

Sometimes you have to debug compiled regexps.

@michaelficarra
Copy link
Member

You can decompile them and represent them however you like. Paste them into a visualiser.

@bakkot
Copy link
Collaborator

bakkot commented Nov 3, 2023

That's a bunch more work than doing console.log(regex). Why should anyone have to do that work? That's a very concrete cost to your suggestion and I see approximately no offsetting benefit. What benefit do you see to your suggestion?

@ljharb
Copy link
Member Author

ljharb commented Nov 16, 2023

@michaelficarra as far as the spec review goes, I've addressed your editorial comment; can I check you off?

The normative one we can certainly discuss further if needed.

@michaelficarra
Copy link
Member

The spec text is fine for what you intended. I still have issue with the escaping design though.

@ljharb
Copy link
Member Author

ljharb commented Nov 17, 2023

Thanks! I'll check you off, but can you file a new issue to further discuss the escaping design? That seems like the only obstacle to seeking stage 3.

@sophiebits
Copy link

Escaping using hex escapes also has the advantage that it's straightforwardly polyfillable in practice, whereas changing the UnicodeMode grammar would seem to not really be.

@gibson042
Copy link
Contributor

gibson042 commented Nov 27, 2023

My review:

  • In |IdentityEscape|, / should only appear once in the new expansion to non-|SyntaxCharacter| punctuators and ` should be included (but I am in favor of introducing the new expansions that improve parity with non-Unicode literals as anticipated by How is escaping of plain text affected by this proposal? proposal-regexp-v-flag#71 (comment) — it cuts off any future special use of those sequences, but that doesn't seem like a problem to me).
  • In the RegExp.escape introduction, the phrase "control characters" should be replaced with something less confusable with the similarly-named ASCII concept (e.g., "This method takes a string and returns a similar string in which each character that is potentially special in a regular expression |Pattern| has been replaced by an escape sequence representing that character.").
  • [Which leading characters should be escaped? #66] In the RegExp.escape algorithm, I think it is necessary to escape not just a leading decimal digit but also a leading ASCII letter (consider e.g. new RegExp("\\c" + RegExp.escape("J")) in a web browser implementing |ExtendedAtom|, where the result should match "\\cJ" and [unlike /\cJ/] fail to match "\n"), and possibly just to always escape the first character. I also think comprehensibility would be improved by adding a note explaining that logic branch and by making it more complete (rather than separating \x3 from the second hexadecimal digit by an intervening Else step). For example, assuming it's acceptable to always emit a \u… rather than something more dynamic like QuoteJSONString:
    1. For each code point _c_ in _cpList_, do
      1. If _escapedList_ is empty [and _c_ needs escaping], then
        1. NOTE: Escaping is required to ensure that the output may be suffixed onto an arbitrary prefix without being misinterpreted as part of an escape sequence that starts within that prefix.
        1. Let _codeUnits_ be UTF16EncodeCodePoint(_c_).
        1. Let _len_ be the length of _codeUnits_.
        1. Assert: _len_ = 1 or _len_ = 2.
        1. Let _codeUnit_ be the code unit at index 0 within _codeUnits_.
        1. Let _escapedCodeUnit_ be UnicodeEscape(_codeUnit_).
        1. Append to _escapedList_ the elements of StringToCodePoints(_escapedCodeUnit_).
        1. If len = 2, then
          1. Set _codeUnit_ to the code unit at index 1 within _codeUnits_.
          1. Set _escapedCodeUnit_ to UnicodeEscape(_codeUnit_).
          1. Append to _escapedList_ the elements of StringToCodePoints(_escapedCodeUnit_).
      1. Else,
        1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace|, then
          1. Append code unit U+005C (REVERSE SOLIDUS) to _escapedList_.
        1. Append _c_ to _escapedList_.
    1. Return CodePointsToString(_escapedList_).
    

The first issue is trivial to fix and the second is purely editorial, leaving only the third as something to actually resolve (potentially even by accepting the risk and making no normative change). In my opinion, this is ready for stage 3.

@ljharb
Copy link
Member Author

ljharb commented Nov 27, 2023

First two are fixed; we can discuss the third in #66 but iirc the intention was simply to not support using the output of RegExp.escape in contexts like that.

@syg
Copy link

syg commented Jan 26, 2024

Spec draft is fairly short and I didn't see any glaring editorial issues.

I find the alias definition for "the ASCII punctuators that need escaping" as a separate paragraph the precede the algorithmic steps strange. Why not have a step that says "Let the ASCII punctuators [...] be the String [...]"?

@ljharb
Copy link
Member Author

ljharb commented Jan 26, 2024

Sure, I could do that - happy to go with whatever yall want there.

ljharb added a commit that referenced this issue Mar 20, 2024
@ljharb
Copy link
Member Author

ljharb commented Mar 25, 2024

ok, this has been reworked with #67 - @jridgewell @michaelficarra @gibson042 @syg @bakkot, can you confirm that you're signed off, assuming you are?

@jridgewell
Copy link
Member

LGTM

1 similar comment
@bakkot
Copy link
Collaborator

bakkot commented Mar 25, 2024

LGTM

@michaelficarra
Copy link
Member

@bakkot Does the string coercion in step 1 align with our new coercion strategy?

@bakkot
Copy link
Collaborator

bakkot commented Mar 25, 2024

Oh, good point. This should throw a TypeError on any non-string inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests