Skip to content
This repository has been archived by the owner on Jun 7, 2022. It is now read-only.

Reconsider unconditionally exposing capture offsets on result #12

Closed
schuay opened this issue Jul 15, 2019 · 23 comments
Closed

Reconsider unconditionally exposing capture offsets on result #12

schuay opened this issue Jul 15, 2019 · 23 comments

Comments

@schuay
Copy link

schuay commented Jul 15, 2019

Revisiting feedback from #1 (comment):

Currently, the proposed changes add an optional bag of args to RegExp.p.exec and multiple other RegExp and String methods. One can pass { capture: "indices" } to tell exec to return capture indices instead of the captured substrings. Indices are returned as a 2-element array [start, end].

I don't think this is the way to go: it doesn't feel like clean API, it opens us up to future complexity ('we already have the options object, why not add more there'), it modifies 7+ functions just to pass through matchOptions, it adds conditional functionality to exec (as the choke-point for all RegExp functions, IMO we should keep exec simple).

It may be cleaner to remove the option args and just expose indices on regexp results unconditionally.

  • If we carefully specify the way to access these indices, overhead would be low. We'd only need to set a single property (at least in V8, the regexp engine already provides a list of all capture indices). And additional memory use could be just 2 * sizeof(VM integer) * number of captures.
  • The way to access these could be a method pair on the result object (result.captureStart(i) / result.captureEnd(i)).
  • A potential complication: currently the result is just an Array with extra data properties. The extra methods would probably be located on the prototype and so we'd have to define a dedicated prototype for the result.
  • Alternatively, we could expose them as two arrays on the result, 'starts' and 'ends'. But with this approach we'd have the overhead of creating two full new JSArrays and transforming the internal array.

So, my proposal would be: expose indices unconditionally through captureStart(index) and captureEnd(index) on the result object. The only overhead would be that the captures array (which we already have) is kept alive by the result object.

@ljharb
Copy link
Member

ljharb commented Jul 15, 2019

I’d think a single getter or captureIndexes function own property on the match object would be simpler?

@rbuckton
Copy link
Collaborator

rbuckton commented Jul 15, 2019

This was the original Stage 0 proposal, but I was cautioned against it due to the memory overhead added to every existing RegExp result, even if they didn't use this feature. I agree that a captureIndices property would have been the preferred approach.

A less memory-intrusive approach (though possibly more break-y) would have been to use something like an Integer Indexed Exotic Object for the result of exec/match (etc.) instead of an Array, and have the string elements of the array lazily evaluate based on the stored indices. I'm not certain there's a good way to spec that, however, much less whether it would be a good idea.

@schuay
Copy link
Author

schuay commented Jul 15, 2019

I was cautioned against it due to the memory overhead added to every existing RegExp result

If done carefully, the memory overhead would be at most 2 ints per capture. Assuming 4-byte integers, that's comparable to each captured string being 8 characters longer.

@mathiasbynens
Copy link
Member

I was cautioned against it due to the memory overhead added to every existing RegExp result

To be clear, it was cautioned against by the same people who are now asking to revisit this. We now believe the overhead could be minimal if implemented carefully. It seems like there's agreement this would be strictly better than having to add an options bag to multiple existing methods, so let's do it!

schuay@'s suggestion applies regardless of whether we go with captureStart/captureEnd or captureIndices.

@rbuckton
Copy link
Collaborator

I'm looking into putting together a PR for this change using an "indices" getter on the result that would lazily create an array of start/end tuples, along with its own groups object mapping groups to the same start/end tuples. This would be similar to the original Stage 0/1 proposal, with the only overhead being the need to hold onto the start/end index (and group name, if one is associated with it).

Would that be sufficient to address the memory concerns? If so I can put up a PR by tomorrow.

@rbuckton
Copy link
Collaborator

@ljharb If we don't have options, then it seems there'd be no way to get this information from @@matchAll for a RegExp with the g flag. We might be able to support @@match with g, but it becomes very complicated to lazily evaluate indices, so for @@match with g they would have to be eagerly evaluated.

@schuay
Copy link
Author

schuay commented Jul 18, 2019

I'm looking into putting together a PR for this change using an "indices" getter on the result that would lazily create an array of start/end tuples, along with its own groups object mapping groups to the same start/end tuples.

Sounds reasonable.

If we don't have options, then it seems there'd be no way to get this information from @@matchall for a RegExp with the g flag

Why not? @@matchAll returns an iterator of regexp results, each of which would have the indices getter, right?

cc @hashseed

@ljharb
Copy link
Member

ljharb commented Jul 18, 2019

@rbuckton I'm not sure what you mean; matchAll could easily create a getter function for each match object that access internal slots to build up the info.

@hashseed
Copy link

While we are at putting alternatives on the table: we could also introduce a new regexp flag to prescribe getting capture indices. I'm not saying that this is what I would prefer though.

I'm not sure adding indices to match with g makes sense. The same could be achieved with matchAll already, and I was under the assumption that we only want to extend regexp result objects, e.g. ones returned by matchAll or exec. match returns an array of strings.

@rbuckton
Copy link
Collaborator

@hashseed: We've already discussed and generally disliked the idea of making this a regexp flag.

@ljharb: I think I misread part of @@matchAll, so I see that it would not be an issue. However, @@match is still a bit problematic in the g case, since I can't refer to any existing internal slot as I only have an array to work with (since RegExpExec could just return the result of a user-defined custom exec() function). If I try to add a lazily-evaluated indices property to the result for @@match, the lazy-evaluation would be user-observable (assuming that matters).

@@match is already strange in the g case since it doesn't have a groups property.

@rbuckton
Copy link
Collaborator

match returns an array of strings.

Yes and no. match for a RegExp with g returns only an array of strings. match for a RegExp without g returns the result of RegExpExec, which is the more complex array of strings or undefineds with the extra index, input, and group properties.

@ljharb
Copy link
Member

ljharb commented Jul 18, 2019

match for a global regex should be unchanged by this proposal; people can use matchAll if they want any useful output for a global regex.

match for a nonglobal regex should be the same as matchAll's only yielded match object for a nonglobal regex.

@rbuckton
Copy link
Collaborator

@ljharb: So effectively drop any specialization in @@match and @@matchAll since they're no longer needed?

If you would, please take a look at #14 as it contains an alternative to the current Stage 2 semantics that reverts back to adding a property to the result of RegExpBuiltinExec.

@schuay
Copy link
Author

schuay commented Jul 18, 2019

A major motivation behind this proposal (changing back to unconditionally including offsets on results) was that we would not have to modify any regexp builtins other than exec. I agree it's totally fine for match not to include offsets, that's why matchAll was introduced.

@ljharb
Copy link
Member

ljharb commented Jul 18, 2019

@rbuckton from this proposal? perhaps; i'll look at the PR.

@schuay note that match would include offsets for nonglobal regexes, which imo is the proper thing to do.

@schuay
Copy link
Author

schuay commented Jul 18, 2019

@ljharb right, thanks for clarifying. My point was that happens automatically, without modifying any of the logic in match itself.

@rbuckton
Copy link
Collaborator

@ljharb: For the current version of the proposal (that uses an options object), is it worth supporting an extra argument to match/@@match if matchAll/@@matchAll is the future-preferred alternative? There is concern about threading an options argument through the existing @@match method due to possible user customization and subclassing, however @@matchAll is still fairly new and may not suffer from the same burden.

@ljharb
Copy link
Member

ljharb commented Jul 19, 2019

I think that the API should remain consistent; so yes, it would be good to include the options object on match as well; matchAll being new means it might not be as common, but it's still just as likely to be subclassed as match is (which, imo, is virtually zero likelihood on either; but opinions differ here)

@rbuckton
Copy link
Collaborator

rbuckton commented Jul 19, 2019

Is there a tool for searching JavaScript GitHub projects for usages of Symbol.match? For TypeScript there's BigTSQuery: https://bigtsquery.firebaseapp.com/query?selector=PropertyAccessExpression%5Btext%3D%22Symbol.match%22%5D. I've tried using GitHub's own search, but it doesn't seem to want to treat the . as an actual ..

@SMotaal
Copy link

SMotaal commented Jul 19, 2019

@rbuckton If I might offer one opinion from the wild…

I personally explored a lot of subclassing patterns to augment exec behaviours, but forgetting specific details, all my work eventually (probably for at least a few good reasons) moved away from Symbol.… hooks on actual RegExp subclasses.

Also, re argumenting the ‹extends RegExp›.prototype.exec, this proposal itself being out there for quite some time now should have at least cautioned (did for me) otherwise.

Maybe worthwhile considering rolling out experimental but when you do, please consider having a little more visibility+outreach+involvement in the efforts for the community — ie breaking something vital or just outdated feeling very critical to those momentarily affected but also historically subjectively very polarizing and frustrating to those affected permanently as a result.

Update: 👍 @ljharb pointing out consistency here — { new(fixedSource, fixedFlags?) } and { [method](... specificArguments, fixedOptions?) } and fixed… here signifies both immutability and consistency of semantics across all RegExp methods.

@ljharb
Copy link
Member

ljharb commented Jul 19, 2019

Subclassing RegExp properly requires overriding 1 or more of the Symbol hooks, or exec.

@SMotaal
Copy link

SMotaal commented Jul 19, 2019

Yes, overriding without making breaking assumptions of the interface.

Here is roughly how I go about it personally:

class extends RegExp {
  betterExec(... args) {
    return // somehow get the result
  }

  capture(result, ... args) {
    // somehow augment the result
    return result;
  }
  
  exec(... args) {
    const result = this.betterExec(...args) || super.exec(...args);
    if (result) this.capture(result, ...args);
    return result;
  }
}

@ljharb, I override exec of RegExp subclasses was my point — I concluded that Symbols are well suited where subclassing is not desired, or if you really need to modify a particular RegExp behaviour for good reasons (I did not myself find one yet to justify doing away with conformance/builtin gains there).

@rbuckton
Copy link
Collaborator

Switching to result.indices was the accepted consensus. Merged in #14.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants