Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp: add iterator forms of matching methods #61902

Open
rsc opened this issue Aug 9, 2023 · 38 comments
Open

proposal: regexp: add iterator forms of matching methods #61902

rsc opened this issue Aug 9, 2023 · 38 comments
Labels
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Aug 9, 2023

We propose to add methods to regexp that allow iterating over matches instead of having to accumulate all the matches into a slice.

This is one of a collection of proposals updating the standard library for the new 'range over function' feature (#61405). It would only be accepted if that proposal is accepted. See #61897 for a list of related proposals.

Regexp has a lot of methods that return slices of all matches (the “FindAll*” methods). Each should have an iterator equivalent that doesn’t build the slice. They can be named by removing the “Find” prefix. The docs would change as follows. (Plain text is unchanged; strikethrough is removed, bold is added):

There are 16 24 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression:

(Find|All|FindAll)?(String)?(Submatch)?(Index)?

If 'All' is present, the routine matches successive non-overlapping matches of the entire expression.
The ‘Find’ form returns the first match. The ‘All’ form returns an iterator over all matches.
Empty matches abutting a preceding match are ignored.
The For ‘FindAll’, the return value is a slice containing the successive return values of the corresponding non-’All’ non-‘Find’ routine. These The ‘FindAll’ routines take an extra integer argument, ...

Instead of enumerating all eight methods here, let’s just show one example.
FindAllString currently reads:

// FindAllString is the 'All' version of FindString; it returns a slice of all
// successive matches of the expression, as defined by the 'All' description in
// the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllString(s string, n int) []string

This would change to become a pair of methods:

// FindAllString is the 'All' 'FindAll' version of FindString; it returns a slice of all
// successive matches of the expression, as defined by the 'All' 'FindAll' description in
// the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllString(s string, n int) []string

// AllString is the ‘All’ version of ‘FindString’; it returns an iterator over all
// successive matches of the expression, as defined by the ‘All’ description in
// the package comment.
func (re *Regexp) AllString(s string) iter.Seq[[]string]

The full list is:

// All is the ‘All’ version of ‘Find’: it returns an iterator over all ...
func (re *Regexp) All(b []byte) iter.Seq[[]byte]

// AllIndex is the ‘All’ version of ‘FindIndex’: it returns an iterator over all ...
func (re *Regexp) AllIndex(b []byte) iter.Seq[[]int]

// AllString is the ‘All’ version of ‘FindString’: it returns an iterator over all ...
func (re *Regexp) AllString(s string) iter.Seq[string]

// AllStringIndex is the ‘All’ version of ‘FindStringIndex’: it returns an iterator over all ...
func (re *Regexp) AllStringIndex(s string) iter.Seq[[]int]

// AllStringSubmatch is the ‘All’ version of ‘FindStringSubmatch’: it returns an iterator ...
func (re *Regexp) AllStringSubmatch(s string) iter.Seq[[]string]

// AllStringSubmatchIndex is the ‘All’ version of ‘FindStringSubmatchIndex’: it returns ...
func (re *Regexp) AllStringSubmatchIndex(s string) iter.Seq[[]int]

// AllSubmatch is the ‘All’ version of ‘FindSubmatch’: it returns an iterator over all ...
func (re *Regexp) AllSubmatch(b []byte) iter.Seq[[][]byte]

// AllSubmatchIndex is the ‘All’ version of ‘FindSubmatchIndex’: it returns an iterator ...
func (re *Regexp) AllSubmatchIndex(b []byte) iter.Seq[[]int]

There would also be a new SplitSeq method alongside regexp.Regexp.Split, completing the analogy with strings.Split and strings.SplitSeq.

// SplitSeq returns an iterator over substrings of s separated by the expression.
func (re *Regexp) SplitSeq(s string) iter.Seq[string]

@rsc rsc added the Proposal label Aug 9, 2023
@gopherbot gopherbot added this to the Proposal milestone Aug 9, 2023
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Aug 9, 2023
@cespare
Copy link
Contributor

cespare commented Aug 9, 2023

@rsc under "The full list is" the actual function signatures need to be fixed to match the correct names in the doc comments.

@magical
Copy link
Contributor

magical commented Aug 9, 2023

It seems like you are trying to establish a general pattern in the stdlib that the word All in a method means that it returns an iterator. However, the regexp package has already established a pattern that All means a method that returns a slice. I don't think that dropping the Find prefix is a particularly good or memorable compromise, and that it would be better in this case to find a different word to replace All (say, Iter).

Which is to say, I think these names would be a lot clearer:

    func (re *Regexp) FindIter(b []byte) iter.Seq[[]byte]
    func (re *Regexp) FindIterIndex(b []byte) iter.Seq[[]int]
    func (re *Regexp) FindIterString(s string) iter.Seq[string]
    func (re *Regexp) FindIterStringIndex(s string) iter.Seq[[]int]
    func (re *Regexp) FindIterStringSubmatch(s string) iter.Seq[[]string]
    func (re *Regexp) FindIterStringSubmatchIndex(s string) iter.Seq[[]int]
    func (re *Regexp) FindIterSubmatch(b []byte) iter.Seq[[][]byte]
    func (re *Regexp) FindIterSubmatchIndex(b []byte) iter.Seq[[]int]

@Merovius
Copy link
Contributor

Merovius commented Aug 9, 2023

@magical I think Iter as a name component reads very badly. I'd be fine with it in regexp (given how rarely it's used), but definitely not as a general convention. I'd prefer essentially anything over changing the All convention to an Iter convention, including accepting this slight ambiguity in the regexp package, deviating from the All convention in the regexp package only, or not adding iterator-versions of methods to regexp at all. In particular given how rarely used the regexp package is in my experience.

@rsc
Copy link
Contributor Author

rsc commented Aug 9, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Aug 9, 2023
@benhoyt
Copy link
Contributor

benhoyt commented Aug 10, 2023

I have found it quite confusing in the past to figure out which of the 16 Find* methods to use for a particular task. Determining String vs non-string is quite straight-forward, but I always have to re-read several paragraph of docs and play around to determine which one of the remaining 8 to use. It's cute, even elegant, that the method names are matched by a regex (Find(All)?(String)?(Submatch)?(Index)?), but in my experience that doesn't make it easier to choose which one you need. This proposal is going to increase that cognitive burden. I don't have a concrete suggestion, but I wonder if there's a better way. From this perspective I do think having Iter (or Seq?) in the name could help. Or maybe there needs to be a "regex method chooser wizard" in the docs? Only semi-serious, but maybe this can be solved by documentation improvements.

@extemporalgenome
Copy link
Contributor

In addition to the correction @cespare proposes, I believe your AllString example has the wrong returned iterator element type ([]string). It should be string:

func (re *Regexp) AllString(s string) iter.Seq[string]

@extemporalgenome
Copy link
Contributor

I believe it would be beneficial to document in the relevant methods that any yielded slices ([]byte, []int, [][]byte, []string) cannot be retained beyond the current iteration step, thus allowing the regex implementation to reuse these slices. There is precedent for this kind of restriction in bytes.Buffer.Bytes and elsewhere.

I suspect most use-cases will be unaffected by this restriction, other than to benefit from reduced allocations (which could potentially further benefit from internal use of sync.Pool to share applications between unrelated iterations). Those cases that do need to retain the slice can easily enough call slices.Clone or append.

go vet could include a check to ensure these slices do not escape in most cases.

@leaxoy
Copy link

leaxoy commented Aug 11, 2023

Emm, that is, we will have nearly 60 methods to do matching? 😂

How about add a new type RegexpIter? And provide methods convert between the two.

@extemporalgenome
Copy link
Contributor

extemporalgenome commented Aug 11, 2023

I agree with @leaxoy.

Irrespective of my above comments, I believe we'll be better served with an approach closer to bufio.Scanner, in which something like a *Match value, with methods for fetching index offset ranges, individual matches as a string or byte slice, etc.

@Merovius
Copy link
Contributor

Merovius commented Aug 11, 2023

I don't believe a separate type is a good idea. The regexp.Regexp type is a representation of a compiled regular expression, not of how to extract its values. We don't have regexp.BytesRegexp and regexp.StringRegexp, for example. And the API surface is larger with separate types, rather than smaller. And 99% of the logic will be shared (in fact, all the other extraction methods will probably be implemented in terms of the iterator methods).

It's unfortunate that the *regexp.Regexp type has that many methods. But I don't think deviating from that design decision at this point will be an improvement. It just means the result will be even more confusing, because now it is also incongruent and asymmetric.

@mrg0lden
Copy link

mrg0lden commented Aug 11, 2023

Regarding the index variants, wouldn't make sense to use a Seq2[int,string|[]byte] instead of having two variants?
That would be similar to slices' convention.

@extemporalgenome
Copy link
Contributor

With separate methods, particularly the Index variants, I am concerned that we'd be adding the methods for completeness without gaining much value.

FindReaderIndex is of questionable utility, given that when you have a reader that you know nothing about (or know that you can't or don't want to reread, such as io.Stdin), it's rare that you can make meaningful use of the indices alone.

There are other cases where having a separate Match type is simply more efficient: the consumer may want a string representation of some submatches while having a byte slice representation of others. With a Match type, the consumer does not need to care what the underlying input was. If a string for a particular submatch is requested, it'll slice or copy a portion the underlying input data depending on whether or not it was a string, but will be no worse in allocation efficiency compared to what the caller needs to do today.

@extemporalgenome
Copy link
Contributor

@Merovius regarding deviating from the current convention being confusing, I think that's entirely manageable if all of the iter methods are internally self-consistent. Per your point, if we implement the methods as Russ initially proposed and then introduce a separate type, that certainly will be confusing. So this is really the only good time we'll get to make a clean decision.

@Merovius
Copy link
Contributor

Merovius commented Aug 11, 2023

@extemporalgenome We seem to be talking past each other. Putting the method on a new type is the deviation. Obviously that can't be addressed by making them "internally self-consistent". And it's also not about timing - it's a deviation if we do it from the beginning just as much.

If we never had the *regexp.Regexp type with it's current slew of variations, then maybe we could talk about making the API more minimal. But that ship sailed over a decade ago. At this point, we have one *regexp.Regexp type and it has methods based on a regular schema that are grouped by type they return. Adding iter.Seq as a type to return on equal footing with []byte and string is conforming to that - again, established a decade ago - pattern.

@doggedOwl
Copy link

Is there really a need for these? do the regexp find methods return enough elements to justify returning an iterator or is this just to avoid that one slice allocation?

@leaxoy
Copy link

leaxoy commented Aug 12, 2023

It's unfortunate that the *regexp.Regexp type has that many methods. But I don't think deviating from that design decision at this point will be an improvement. It just means the result will be even more confusing, because now it is also incongruent and asymmetric.

@Merovius a crazy idea, how about introduce regexp v2 like the math v2 to simplify API.

@Merovius
Copy link
Contributor

@leaxoy It doesn't seem that wild to me. I think there is an argument to be made that a) we might want to wait a release or so to see how the math/rand/v2 migration works and b) if we intend to make a v2, we might as well add these in the meantime, as any harm would be time-limited.

But yeah, it's not really up to me. Personally, I think this proposal is fine as it is, but maybe the Go team can be persuaded to do a v2 for regexp as well (I'm not sure how a simplified API would look, though).

@doggedOwl I thought the same thing, TBH. Especially as the matching groups can be sliced out of the input, so just the actual result slice has to be allocated. There are still two ways in which an iterator form arguably might improve performance: 1. it might enable you to prematurely stop some matching work - though this seems to be possible in a corner case at best. And 2. it might enable you to do subgroup matching entirely without allocations.

But I'm not totally sold on needing an iterator form of these either. It would be possible to feed the iterator into other iterator transformation functions, but then again, that would be just as possible by using slices.All on the result.

@ulikunitz
Copy link
Contributor

Unfortunately the proposal doesn't say, what the benefit of the proposal is or what problem it is trying to address. It looks more like a demo for iterator functions.

The proposal will further inflate the number of methods for the Regexp type. Already now I have to consult the documentation every time I use the package.

I would welcome a regexp2 package that simplifies the interface. Maybe by using byte slices and string as type arguments and supporting only the iterator methods, since the first match functions wouldn't be required anymore. Using a match type could also reduce the provided variants.

@rsc
Copy link
Contributor Author

rsc commented Aug 16, 2023

Unfortunately the proposal doesn't say, what the benefit of the proposal is or what problem it is trying to address.

It says

Regexp has a lot of methods that return slices of all matches (the “FindAll*” methods). Each should have an iterator equivalent that doesn’t build the slice.

The benefit is that it doesn't build the slice. If you are searching large texts, you almost always want to consider the matches one at a time, and building the slice of all results is wasted memory, potentially larger than the text.

@rsc
Copy link
Contributor Author

rsc commented Aug 30, 2023

Finishing this proposal discussion is blocked on #61405.

@deefdragon
Copy link

This should be unblocked right? Or is it waiting on the iterators addition to roll out?

@aclements
Copy link
Member

This is a lot of new methods, but it's also very regular and consistent with the existing API. Is there anything still blocking this proposal?

@adonovan
Copy link
Member

adonovan commented Dec 11, 2024

Every time I use the regexp API I find it hard to remember what all the different unnamed numbers signify. Has anyone prototyped a version of a Match method that returns an iter.Seq of some abstract match data type that provides methods (with informative names!) that can return any information you need about a given match: its indices, its byte or string value (allocating if different from the input string type), its submatch index, and so on?

(speaking for proposal committee)

@adonovan
Copy link
Member

adonovan commented Jan 22, 2025

https://go.dev/cl/643896 provides a sketch of the idea above:

package regexp

// All returns the sequence of matches of the regular expression on
// the input text, which may be a string or a []byte slice.
//
// TODO(adonovan): API:
//   - Should we define two variants All([]byte), AllString(string)?
//     This is consistent with the String flavors of the existing API.
//   - Or use generics: func All[S string|[]byte](S)?
//     This means we must forego methods.
func (re *Regexp) All(text any) iter.Seq[Substring]

// A Substring represents a subsequence of an input string or []byte
// slice that matches a regular expression.
type Substring struct { ... }

// String returns the matched substring.
func (s Substring) String() string

// AppendTo appends the matched substring to the provided slice.
func (s Substring) AppendTo(slice []byte) []byte

// Start returns the matched substring's start index,
// relative to the input of [Regexp.All].
func (s Substring) Start() int

// End returns the matched substring's end index,
// relative to the input of [Regexp.All].
func (s Substring) End() int

// NumSubmatch returns the number of submatches.
func (s Substring) NumSubmatch() int

// Submatch returns the ith submatch of the current match.
func (s Substring) Submatch(i int) Substring

Feedback welcome.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/643896 mentions this issue: regexp: sketch of iterator API

@Merovius
Copy link
Contributor

@adonovan One note is, that this API doesn't allow to inspect a match without allocation (String()) or copy (AppendTo()). Adding a Bytes() []byte method would allow the caller to avoid both - they can use String() if they searched a string and Bytes() if they searched a []byte.

In regards to your question, I would add All and AllString. I dislike not having a method and there are only two variants.

@adonovan
Copy link
Member

adonovan commented Jan 23, 2025

Incorporating @Merovius's suggestions, we have:

// All returns the sequence of matches of the regular expression on
// the input text.
func (re *Regexp) All(text []byte) iter.Seq[Substring]

// AllString returns the sequence of matches of the regular expression
// on the input text.
func (re *Regexp) AllString(text string) iter.Seq[Substring]

// A Substring represents a subsequence of an input string or []byte
// slice that matches a regular expression.
//
// Call the [Substring.Bytes] or [Substring.String] method to access
// the text of the substring. (For efficiency, use the method that
// corresponds to the function--[Regexp.All] or
// [Regexp.AllString]--that created this Substring.)
//
// A Substring is valid only for the duration of the iteration that
// produced it.
type Substring struct  {}

// String returns the matched substring.
//
// If this Substring was created by [Regexp.All], String must allocate
// a copy; it may be more efficient to use [Substring.Bytes].
func (s Substring) String() string

// Bytes returns the matched substring.
//
// If this Substring was created by [Regexp.All], Bytes returns a
// clipped slice of the original array; if created by
// [Regexp.AllString], it allocates a copy of part of the
// original string.
func (s Substring) Bytes() []byte

// AppendTo appends the matched substring to the provided slice.
func (s Substring) AppendTo(slice []byte) []byte

// Start returns the matched substring's start index,
// relative to the input of [Regexp.All].
func (s Substring) Start() int

// End returns the matched substring's end index,
// relative to the input of [Regexp.All].
func (s Substring) End() int

// NumSubmatch returns the number of submatches.
func (s Substring) NumSubmatch() int

// Submatch returns the ith submatch of the current match.
// The boolean indicates whether it exists.
func (s Substring) Submatch(i int) (Substring, bool)

[Update: changed Substring to be valid only for the duration of its iteration. The CL implements a defensive check.]
[Update: Submatch may fail, so it returns an additional bool.]

@apparentlymart
Copy link

I have some code that makes use of named capture groups but hadn't looked at it for a while and was initially worried that the above API would be hard to use in that case.

However, I'm pleased to report that it actually wasn't a problem after all! Regexp.SubexpNames returns the name of each capture group, using "" for any capture group that is not named, and so it's therefore possible to correlate that result with results from Substring.Submatch to find the substring associated with a particular name, just as one would do with the result of Regexp.FindStringSubmatchIndex today.

This new proposed API is also, in my opinion at least, considerably more ergonomic than working with Regexp.FindStringSubmatchIndex. The existing function returns a flat slice where even elements are starting indices and odd elements are ending indices, leading to some potentially-error-prone *2 and /2 all over code working with that result, which is all encapsulated inside Substring.Submatch in this proposal.

Two potential additions that would make my use-case more convenient:

  • func (s Substring) Submatches() iter.Seq2[i, Substring] for more convenient use of a series of unnamed capture groups.

    Using s.NumSubmatch in conjunction with s.Submatch is not that hard (still better than the FindStringSubmatchIndex version of the code) so I could live without it to keep the API surface smaller.

  • func (re *Regexp) FindSubstring() (Substring, bool) (and FindSubstringString()? 😬) as a one-match-only equivalent to Regexp.All that exposes the result using the same Substring API, to make it easier to reuse code between the two situations.

    Using All/AllString and then attempting to read a single item from the sequence would be a viable workaround for me, so again I could live without it but I personally think Substring is a nicer API overall and so I'd prefer to use it in all cases and never use any of the various Find... methods again.


In case anyone finds it interesting, some specific places where I'd like to use this new API:

In this particular usage, earlier code has checked to make sure that the given pattern consistently uses either all named or all unnamed capture groups -- an additional constraint imposed by me for simplicity's sake -- so these are handled entirely separately.

captureIdxs is the result from Regexp.FindStringSubmatchIndex. If the Substring form of this proposal were accepted, I expect I'd rewrite this function to take a regexp.Substring instead. Since I currently have two callers using FindStringSubmatchIndex and FindAllStringSubmatchIndex respectively, this is where I'd find the Regexp.FindSubstringString useful to easily reuse the same Substring-based implementation for both cases.

@adonovan
Copy link
Member

adonovan commented Jan 23, 2025

Thanks for the feedback.

Re: Submatches and iterators: the usual case is that Submatch is called with a specific constant index to extract a particular portion, rather than that all submatches are iterated over. (The submatches are more like a struct than a slice, i.e., heterogenous.) The proposed Submatch(int) also dovetails nicely with named groups: re.Submatch(re.SubexpIndex("foo")).

Re: FindSubstring (which I think must have a []byte parameter?): I don't think the gain from FindSubstring is very much: one source line, and no difference in efficiency:

if sub, ok := re.FindSubstring(bytes); ok {
    use(sub)
}

versus:

for sub := range re.All(bytes) {
    use(sub)
    break
}

(A better name might be Regexp.First.)

@jimmyfrasche
Copy link
Member

Is AppendTo needed now that there are Bytes and String?

Should String be Text for the same reasons as bufio.Scanner?

Substring does not seem like a descriptive name. Maybe RegexpMatch? That puts it below Regexp and though it's a bit long few would ever have to write it out.

@adonovan
Copy link
Member

Is AppendTo needed now that there are Bytes and String?

Not sure; perhaps not since it can be derived from the other two for any given situation.

Should String be Text for the same reasons as bufio.Scanner?

Substring should have a String method so that you can log.Print them. If it has that method, it doesn't need a Text method.

Substring does not seem like a descriptive name. Maybe RegexpMatch? That puts it below Regexp and though it's a bit long few would ever have to write it out.

regexp.Match would be ideal but is taken. regexp.RegexpMatch is unfortunately stuttery (though, as you point out, rarely needs to be written).

@jimmyfrasche
Copy link
Member

ReMatch is shorter but perhaps confusing in its own way.

I wouldn't want to print one and get the current value. I'd only be printing it to log or print-debug so I'd either want just the type name so I realize I did something silly or something more informative to help me.

@apparentlymart
Copy link

If we're bikeshedding about the type name 😀 then I will offer regexp.Matched, intended to read as being short for "thing [that was] matched by a regexp".

regexp.Matched.String, regexp.Matched.Bytes, and regexp.Matched.Submatch all then read relatively nicely to my eye at least. Folks tend to like to name variables after their type when there's not another name that would be more relevant, so that'd be matched.String(), matched.Bytes(), and matched.Submatch() when used as a method through such a variable.

It is admittedly unconventional to use a past tense verb to name a type. In this particular case I'm treating it as an adjective (participle) modifying an implied noun, which is also unconventional but perhaps not so weird? 🤷‍

Agreed that Match would've been the ideal name, were that not already a function. Typical hazard of English's love of verbing nouns and nouning verbs. I was aiming for something as close as possible to that which still had some sort of defensible reason, and so I "participled" the verb.

(FWIW I was also just fine with Substring, given that it would presumably typically be written as regexp.Substring in the rare case where it needs to be written at all. The package name and the name together seem sufficiently descriptive to me, although I would concede that it could potentially be read as "substring of a regexp" -- whatever that might mean -- rather than "substring matched by a regexp".)

@aclements
Copy link
Member

func (s Substring) Submatch(i int) (Substring, bool)

This feels like a false symmetry. What happens if I ask for a submatch of a submatch?

There are really two levels here: matches and submatches (aka capture groups). I'm not positive, but I think you flattened them into one concept because a match is also a match of the whole regexp. But I think its better to separate these concepts. Here's what that would look like as a minimal change from @adonovan 's API, where "matches" are represented as a []Submatch:

// All returns the sequence of matches of the regular expression on
// the input text.
//
// Submatch 0 is the match of the entire expression,
// submatch 1 is the match of the first parenthesized subexpression,
// and so on.
func (re *Regexp) All(text []byte) iter.Seq[[]Submatch]

// AllString returns the sequence of matches of the regular expression
// on the input text.
func (re *Regexp) AllString(text string) iter.Seq[[]Submatch]

type Submatch struct { ... }

// String returns the matched substring.
//
// If this Submatch was created by [Regexp.All], String must allocate
// a copy; it may be more efficient to use [Substring.Bytes].
func (s Submatch) String() string

// Bytes returns the matched substring.
//
// If this Submatch was created by [Regexp.All], Bytes returns a
// clipped slice of the original array; if created by
// [Regexp.AllString], it allocates a copy of part of the
// original string.
func (s Submatch) Bytes() []byte

// AppendTo appends the matched substring to the provided slice.
func (s Submatch) AppendTo(slice []byte) []byte

// Start returns the matched substring's start index,
// relative to the input of [Regexp.All].
func (s Submatch) Start() int

// End returns the matched substring's end index,
// relative to the input of [Regexp.All].
func (s Submatch) End() int

Using a []Submatch provides natural indexing and iteration, and conveniently avoids conflicts with the existing Match function. On the other hand, the distinct meaning of element 0 makes iteration error-prone (a problem shared with the existing API), and if you only care about the whole match this does needless allocation (though maybe you should have used silent capture groups in that case). Also, by not having an explicit Match type representing the sequence, we miss out on some nice opportunities, like providing a Named(name string) Submatch convenience.

A few smaller things:

I'd propose replacing Start and End with a single method:

// Indexes returns the location of this submatch.
// The match itself is at b[start:end], where b is the original input.
// If this submatch is empty, it returns 0, 0, false.
func (s Submatch) Indexes() (start, end int, ok bool)

This whole approach is dipping our toes in a bit of a v2 API, albeit in a backwards compatible way, and it feels odd to only do the iterator part. That suggests we should also support single matches:

func (re *Regexp) First(b []byte) []Submatch
func (re *Regexp) FirstString(s string) []Submatch

This also raises the question of whether we should use generics, which shrinks Submatch a bit and avoids some of the "will or it won't it allocate":

// All returns the sequence of matches of the regular expression on
// the input text.
//
// Submatch 0 is the match of the entire expression,
// submatch 1 is the match of the first parenthesized subexpression,
// and so on.
func (re *Regexp) All(text []byte) iter.Seq[[]Submatch[[]byte]]

// AllString returns the sequence of matches of the regular expression
// on the input text.
func (re *Regexp) AllString(text string) iter.Seq[[]Submatch[string]]

type Submatch[T []byte | string] struct { ... }

// String returns the matched substring.
//
// If this match came from a []byte, this may allocate.
func (s Submatch[T]) String() string

// Match returns the matched substring as its original type.
func (s Submatch[T]) Match() T

// Indexes returns the location of this submatch.
// The match itself is at b[start:end], where b is the original input.
// If this submatch is empty, it returns 0, 0, false.
func (s Submatch[T]) Indexes() (start, end int, ok bool)

@aclements
Copy link
Member

// A Substring is valid only for the duration of the iteration that
// produced it.
type Substring struct {}

There's a tension between allocation and aliasing here that worries me. This also appears in @rsc's proposal at the top and in whether iteration reuses the []Submatch slice in the variation I proposed.

@rsc's proposal doesn't say anything about whether slices are reused between iterations and, by omission, that tells me that they are not. That's certainly the safer choice. The compiler is at least sometimes able to eliminate this allocation, and I'm sure could do a better job than it does today: https://go.dev/play/p/MJTonvzkfkM?v=gotip

@adonovan
Copy link
Member

This feels like a false symmetry. What happens if I ask for a submatch of a submatch?
There are really two levels here: matches and submatches (aka capture groups). I'm not positive, but I think you flattened them into one concept because a match is also a match of the whole regexp. But I think its better to separate these concepts.

True, my proposed API raises the question of sub-sub-matches, and the answer is simply "there are none". You could define a different type for capturing group, but it would be identical to Substring but lacking the {Num,}Submatch methods. I suppose a it could also provide a Name() string method, for convenience; but we could equally add a (Substring).NamedGroup(string) (Substring, bool) to Substring, again documenting that a sub-Substring has no NamedGroups.

I'm not sure it's necessary to distinguish the parent and child types, but I don't feel strongly. If we were to do it, I would suggest keeping the abstract Substring data type, and adding a second type for the child:

type Group struct { ... }
func (Group) Name() string
func (Group) String() string
func (Group) Bytes() []byte
func (Group) AppendTo(slice []byte) []byte
func (Group) Indices() (start, end int)

I think we have often over-revealed the representation of core data types, either because of the lack of abstract iteration, or motivated by efficiency (especially in the pre-inlining days) but ultimately foreclosing desirable optimizations or reorganizations of the implementation.

// Submatch 0 is the match of the entire expression,
// submatch 1 is the match of the first parenthesized subexpression

The need to know that "zero means the whole thing" is IMHO an unfortunate flaw in the original API. Now that we have iterators and method inlining, we can do better. Regexps without submatches are I think the common case, yet this API would make them harder to use.

func (re *Regexp) All(text []byte) iter.Seq[[]Submatch]

We should not expose the complete slice of submatches, as it either forces an allocation for each match (which, as Russ points out, can easily add up to more than the input text), or demands that the loop body only "borrow" the slice, which is more prone to error. The {Num,}Submatch API is efficient and safe. (One could also provide an iterator, but, as mentioned above, efficient random access is much more important and common than looping over submatches.)

I'd propose replacing Start and End with a single method:

I had that initially, and decided against it just because it forces the need for a statement. But one never asks for one without the other. I don't feel strongly either way.

// Indexes returns the location of this submatch.
// The match itself is at b[start:end], where b is the original input.
// If this submatch is empty, it returns 0, 0, false.
func (s Submatch) Indexes() (start, end int, ok bool)

This approach creates Submatch values even when there was no submatch, requiring all methods (e.g. String) to quietly return zero on a non-matching Submatch. The approach I proposed doesn't even create the Submatch unless it is valid.

func (re *Regexp) First(b []byte) []Submatch
func (re *Regexp) FirstString(s string) []Submatch

These seem fine to me, but we could also add them later.

@aclements
Copy link
Member

I would suggest keeping the abstract Substring data type, and adding a second type for the child:
type Group struct { ... }

Note that the regexp package does not use the term "group". It calls these "submatches" (and it calls the syntactic part of the regular expression itself a "subexpression"). That's why I called my type Submatch.

The need to know that "zero means the whole thing" is IMHO an unfortunate flaw in the original API.

Yes, but it's also standard across a huge range of languages and regexp packages that subexpressions are numbered starting at "1" and that "0" refers to the whole thing. This convention also permeates substitutions, for example Regexp.Expand starts submatches at $1 and sed starts at \1. We may be able to drop specifically "zero means the whole thing", but we can't drop that submatch numbering starts at 1.

We should not expose the complete slice of submatches, as it either forces an allocation for each match (which, as Russ points out, can easily add up to more than the input text), or demands that the loop body only "borrow" the slice, which is more prone to error. The {Num,}Submatch API is efficient and safe.

I may be missing something, but I don't see how to implement Substring.Submatch without the same allocation or aliasing problems as the []Submatch API.

This approach creates Submatch values even when there was no submatch, requiring all methods (e.g. String) to quietly return zero on a non-matching Submatch. The approach I proposed doesn't even create the Submatch unless it is valid.

I'm not sure I quite agree. In your approach, Substring.Submatch still has to return some Submatch along with false, though presumably just the zero value. We still have to define what the methods of the zero value of Submatch do, which is either going to be panicking or returning a zero value. This also prevents using Substring.Submatch in an expression context, which I imagine is what people would usually want to do (e.g., m.Submatch(3).String()).

@adonovan
Copy link
Member

Note that the regexp package does not use the term "group". It calls these "submatches" (and it calls the syntactic part of the regular expression itself a "subexpression"). That's why I called my type Submatch.

Fair enough; I picked that name casually, figuring we'll always haggle over the names as the last step. ;-)

We may be able to drop specifically "zero means the whole thing", but we can't drop that submatch numbering starts at 1.

I'll take your word for it; that's unfortunate.

I may be missing something, but I don't see how to implement Substring.Submatch without the same allocation or aliasing problems as the []Submatch API.

The sketch implementation in the attached CL allocates a single []int array for the entire scan over the input, and it is shared by all Substrings, which is why they each become invalid after the end of the iteration that produced them. I haven't measured it but I suspect this could be a major reduction in allocation, given how careful the package is about allocations otherwise. The key is that the array doesn't escape the loop.

This also prevents using Substring.Submatch in an expression context, which I imagine is what people would usually want to do (e.g., m.Submatch(3).String()).

Good point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Active
Development

No branches or pull requests