Reconsider unconditionally exposing capture offsets on result #12

schuay · 2019-07-15T10:36:29Z

Revisiting feedback from #1 (comment):

Currently, the proposed changes add an optional bag of args to RegExp.p.exec and multiple other RegExp and String methods. One can pass { capture: "indices" } to tell exec to return capture indices instead of the captured substrings. Indices are returned as a 2-element array [start, end].

I don't think this is the way to go: it doesn't feel like clean API, it opens us up to future complexity ('we already have the options object, why not add more there'), it modifies 7+ functions just to pass through matchOptions, it adds conditional functionality to exec (as the choke-point for all RegExp functions, IMO we should keep exec simple).

It may be cleaner to remove the option args and just expose indices on regexp results unconditionally.

If we carefully specify the way to access these indices, overhead would be low. We'd only need to set a single property (at least in V8, the regexp engine already provides a list of all capture indices). And additional memory use could be just 2 * sizeof(VM integer) * number of captures.
The way to access these could be a method pair on the result object (result.captureStart(i) / result.captureEnd(i)).
A potential complication: currently the result is just an Array with extra data properties. The extra methods would probably be located on the prototype and so we'd have to define a dedicated prototype for the result.
Alternatively, we could expose them as two arrays on the result, 'starts' and 'ends'. But with this approach we'd have the overhead of creating two full new JSArrays and transforming the internal array.

So, my proposal would be: expose indices unconditionally through captureStart(index) and captureEnd(index) on the result object. The only overhead would be that the captures array (which we already have) is kept alive by the result object.

The text was updated successfully, but these errors were encountered:

ljharb · 2019-07-15T15:57:09Z

I’d think a single getter or captureIndexes function own property on the match object would be simpler?

rbuckton · 2019-07-15T18:56:52Z

This was the original Stage 0 proposal, but I was cautioned against it due to the memory overhead added to every existing RegExp result, even if they didn't use this feature. I agree that a captureIndices property would have been the preferred approach.

A less memory-intrusive approach (though possibly more break-y) would have been to use something like an Integer Indexed Exotic Object for the result of exec/match (etc.) instead of an Array, and have the string elements of the array lazily evaluate based on the stored indices. I'm not certain there's a good way to spec that, however, much less whether it would be a good idea.

schuay · 2019-07-15T20:08:58Z

I was cautioned against it due to the memory overhead added to every existing RegExp result

If done carefully, the memory overhead would be at most 2 ints per capture. Assuming 4-byte integers, that's comparable to each captured string being 8 characters longer.

mathiasbynens · 2019-07-17T09:06:29Z

I was cautioned against it due to the memory overhead added to every existing RegExp result

To be clear, it was cautioned against by the same people who are now asking to revisit this. We now believe the overhead could be minimal if implemented carefully. It seems like there's agreement this would be strictly better than having to add an options bag to multiple existing methods, so let's do it!

schuay@'s suggestion applies regardless of whether we go with captureStart/captureEnd or captureIndices.

rbuckton · 2019-07-18T01:13:25Z

I'm looking into putting together a PR for this change using an "indices" getter on the result that would lazily create an array of start/end tuples, along with its own groups object mapping groups to the same start/end tuples. This would be similar to the original Stage 0/1 proposal, with the only overhead being the need to hold onto the start/end index (and group name, if one is associated with it).

Would that be sufficient to address the memory concerns? If so I can put up a PR by tomorrow.

rbuckton · 2019-07-18T05:33:20Z

@ljharb If we don't have options, then it seems there'd be no way to get this information from @@matchAll for a RegExp with the g flag. We might be able to support @@match with g, but it becomes very complicated to lazily evaluate indices, so for @@match with g they would have to be eagerly evaluated.

schuay · 2019-07-18T05:38:52Z

I'm looking into putting together a PR for this change using an "indices" getter on the result that would lazily create an array of start/end tuples, along with its own groups object mapping groups to the same start/end tuples.

Sounds reasonable.

If we don't have options, then it seems there'd be no way to get this information from @@matchall for a RegExp with the g flag

Why not? @@matchAll returns an iterator of regexp results, each of which would have the indices getter, right?

cc @hashseed

ljharb · 2019-07-18T05:43:55Z

@rbuckton I'm not sure what you mean; matchAll could easily create a getter function for each match object that access internal slots to build up the info.

hashseed · 2019-07-18T05:50:39Z

While we are at putting alternatives on the table: we could also introduce a new regexp flag to prescribe getting capture indices. I'm not saying that this is what I would prefer though.

I'm not sure adding indices to match with g makes sense. The same could be achieved with matchAll already, and I was under the assumption that we only want to extend regexp result objects, e.g. ones returned by matchAll or exec. match returns an array of strings.

rbuckton · 2019-07-18T06:04:40Z

@hashseed: We've already discussed and generally disliked the idea of making this a regexp flag.

@ljharb: I think I misread part of @@matchAll, so I see that it would not be an issue. However, @@match is still a bit problematic in the g case, since I can't refer to any existing internal slot as I only have an array to work with (since RegExpExec could just return the result of a user-defined custom exec() function). If I try to add a lazily-evaluated indices property to the result for @@match, the lazy-evaluation would be user-observable (assuming that matters).

@@match is already strange in the g case since it doesn't have a groups property.

rbuckton · 2019-07-18T06:15:10Z

match returns an array of strings.

Yes and no. match for a RegExp with g returns only an array of strings. match for a RegExp without g returns the result of RegExpExec, which is the more complex array of strings or undefineds with the extra index, input, and group properties.

ljharb · 2019-07-18T06:20:18Z

match for a global regex should be unchanged by this proposal; people can use matchAll if they want any useful output for a global regex.

match for a nonglobal regex should be the same as matchAll's only yielded match object for a nonglobal regex.

rbuckton · 2019-07-18T06:22:43Z

@ljharb: So effectively drop any specialization in @@match and @@matchAll since they're no longer needed?

If you would, please take a look at #14 as it contains an alternative to the current Stage 2 semantics that reverts back to adding a property to the result of RegExpBuiltinExec.

schuay · 2019-07-18T06:22:51Z

A major motivation behind this proposal (changing back to unconditionally including offsets on results) was that we would not have to modify any regexp builtins other than exec. I agree it's totally fine for match not to include offsets, that's why matchAll was introduced.

ljharb · 2019-07-18T06:26:14Z

@rbuckton from this proposal? perhaps; i'll look at the PR.

@schuay note that match would include offsets for nonglobal regexes, which imo is the proper thing to do.

schuay · 2019-07-18T06:35:03Z

@ljharb right, thanks for clarifying. My point was that happens automatically, without modifying any of the logic in match itself.

rbuckton · 2019-07-19T00:12:02Z

@ljharb: For the current version of the proposal (that uses an options object), is it worth supporting an extra argument to match/@@match if matchAll/@@matchAll is the future-preferred alternative? There is concern about threading an options argument through the existing @@match method due to possible user customization and subclassing, however @@matchAll is still fairly new and may not suffer from the same burden.

ljharb · 2019-07-19T02:05:02Z

I think that the API should remain consistent; so yes, it would be good to include the options object on match as well; matchAll being new means it might not be as common, but it's still just as likely to be subclassed as match is (which, imo, is virtually zero likelihood on either; but opinions differ here)

rbuckton · 2019-07-19T03:33:57Z

Is there a tool for searching JavaScript GitHub projects for usages of Symbol.match? For TypeScript there's BigTSQuery: https://bigtsquery.firebaseapp.com/query?selector=PropertyAccessExpression%5Btext%3D%22Symbol.match%22%5D. I've tried using GitHub's own search, but it doesn't seem to want to treat the . as an actual ..

SMotaal · 2019-07-19T12:27:15Z

@rbuckton If I might offer one opinion from the wild…

I personally explored a lot of subclassing patterns to augment exec behaviours, but forgetting specific details, all my work eventually (probably for at least a few good reasons) moved away from Symbol.… hooks on actual RegExp subclasses.

Also, re argumenting the ‹extends RegExp›.prototype.exec, this proposal itself being out there for quite some time now should have at least cautioned (did for me) otherwise.

Maybe worthwhile considering rolling out experimental but when you do, please consider having a little more visibility+outreach+involvement in the efforts for the community — ie breaking something vital or just outdated feeling very critical to those momentarily affected but also historically subjectively very polarizing and frustrating to those affected permanently as a result.

Update: 👍 @ljharb pointing out consistency here — { new(fixedSource, fixedFlags?) } and { [method](... specificArguments, fixedOptions?) } and fixed… here signifies both immutability and consistency of semantics across all RegExp methods.

ljharb · 2019-07-19T16:01:38Z

Subclassing RegExp properly requires overriding 1 or more of the Symbol hooks, or exec.

SMotaal · 2019-07-19T17:16:25Z

Yes, overriding without making breaking assumptions of the interface.

Here is roughly how I go about it personally:

class extends RegExp {
  betterExec(... args) {
    return // somehow get the result
  }

  capture(result, ... args) {
    // somehow augment the result
    return result;
  }
  
  exec(... args) {
    const result = this.betterExec(...args) || super.exec(...args);
    if (result) this.capture(result, ...args);
    return result;
  }
}

@ljharb, I override exec of RegExp subclasses was my point — I concluded that Symbols are well suited where subclassing is not desired, or if you really need to modify a particular RegExp behaviour for good reasons (I did not myself find one yet to justify doing away with conformance/builtin gains there).

rbuckton · 2019-07-25T19:07:02Z

Switching to result.indices was the accepted consensus. Merged in #14.

waldemarhorwat mentioned this issue Jul 17, 2019

Stage 3 Specification Review - waldemarhorwat #10

Closed

rbuckton mentioned this issue Jul 18, 2019

Switch back to a property on the exec result #14

Merged

rbuckton mentioned this issue Jul 18, 2019

Stage 3 Specification Review - msaboff #8

Closed

mathiasbynens mentioned this issue Jul 19, 2019

Meta: Stage 3 Specification Reviews #11

Closed

3 tasks

rbuckton closed this as completed Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider unconditionally exposing capture offsets on result #12

Reconsider unconditionally exposing capture offsets on result #12

schuay commented Jul 15, 2019

ljharb commented Jul 15, 2019

rbuckton commented Jul 15, 2019 •

edited

Loading

schuay commented Jul 15, 2019

mathiasbynens commented Jul 17, 2019

rbuckton commented Jul 18, 2019

rbuckton commented Jul 18, 2019

schuay commented Jul 18, 2019 •

edited

Loading

ljharb commented Jul 18, 2019

hashseed commented Jul 18, 2019

rbuckton commented Jul 18, 2019

rbuckton commented Jul 18, 2019

ljharb commented Jul 18, 2019

rbuckton commented Jul 18, 2019

schuay commented Jul 18, 2019

ljharb commented Jul 18, 2019

schuay commented Jul 18, 2019

rbuckton commented Jul 19, 2019

ljharb commented Jul 19, 2019

rbuckton commented Jul 19, 2019 •

edited

Loading

SMotaal commented Jul 19, 2019 •

edited

Loading

ljharb commented Jul 19, 2019 •

edited

Loading

SMotaal commented Jul 19, 2019 •

edited by ljharb

Loading

rbuckton commented Jul 25, 2019

Reconsider unconditionally exposing capture offsets on result #12

Reconsider unconditionally exposing capture offsets on result #12

Comments

schuay commented Jul 15, 2019

ljharb commented Jul 15, 2019

rbuckton commented Jul 15, 2019 • edited Loading

schuay commented Jul 15, 2019

mathiasbynens commented Jul 17, 2019

rbuckton commented Jul 18, 2019

rbuckton commented Jul 18, 2019

schuay commented Jul 18, 2019 • edited Loading

ljharb commented Jul 18, 2019

hashseed commented Jul 18, 2019

rbuckton commented Jul 18, 2019

rbuckton commented Jul 18, 2019

ljharb commented Jul 18, 2019

rbuckton commented Jul 18, 2019

schuay commented Jul 18, 2019

ljharb commented Jul 18, 2019

schuay commented Jul 18, 2019

rbuckton commented Jul 19, 2019

ljharb commented Jul 19, 2019

rbuckton commented Jul 19, 2019 • edited Loading

SMotaal commented Jul 19, 2019 • edited Loading

ljharb commented Jul 19, 2019 • edited Loading

SMotaal commented Jul 19, 2019 • edited by ljharb Loading

rbuckton commented Jul 25, 2019

rbuckton commented Jul 15, 2019 •

edited

Loading

schuay commented Jul 18, 2019 •

edited

Loading

rbuckton commented Jul 19, 2019 •

edited

Loading

SMotaal commented Jul 19, 2019 •

edited

Loading

ljharb commented Jul 19, 2019 •

edited

Loading

SMotaal commented Jul 19, 2019 •

edited by ljharb

Loading