Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Added named regular expressions. #1448

Closed
wants to merge 1 commit into from

Conversation

kmsquire
Copy link
Member

PCRE supports named regular expressions (ala perl/python/ruby), and they could already be used for back referencing. This pull request adds a get_name_table method to pcre.jl, a capture_dict variable to RegexMatches to allow access to the named captures, and minor related updates to Regex, show, etc. It also adds a blurb in the manual.

@ghost ghost assigned StefanKarpinski Oct 26, 2012
@ViralBShah
Copy link
Member

Could you add some tests?

@ViralBShah
Copy link
Member

Oops, I missed the tests in the diff. I see the tests are already in place.

@StefanKarpinski
Copy link
Member

I've considered adding this before, but hesitated due to uncertainty about the design, so this is a good kick in the pants to talk about the design of it. My major issue is that it significantly complicates the representation and code paths for normal regex workflows. In particular, this concerns the shit out of me because it's in the code path for all regex matches and even in the fast case where there are no names, it ends up gratuitously creating a Dict object (which is not exactly cheap). Since one is likely to use regular expressions to parse massive data files, this is very problematic.

One possible solution is to have separate types: indexed regexes vs. named regexes and indexed regex matches vs. named regex matches. Then only using regex names incurs these additional complications and even there, you have fewer branches and checks in the matching path which is the really performance-critical one.

However, ideally, regex matching should do even less work than it currently does, and named regexes can actually help with that. Typically what you want to do is something like this:

m = match(r"(a+b+).*(c+d+)", str)
if m != nothing
  foo = m.captures[1]
  bar = m.captures[2]
  # do something with foo and bar
end

So basically, you want to have the matching substrings bound to local variables. Sweet! Then there's no need for that annoying array, m.captures, let alone a hash of names – lets just use local variables. So you kind of want something like this:

@match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
  # foo and bar are bound inside here
  # do something with foo and bar
end

However, that may not be the optimal syntax, especially since there's no nice way to have an else clause except awkwardly sticking another block after the end:

@match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
  # foo and bar are bound inside here
  # do something with foo and bar
end begin
  # foo and bar are not bound here
  # handle not matching
end

See #88, #1289 and discussion in #1288. One possible hack would be to make begin else end in macro calls expand to begin end begin end so that it's just a prettier way to write the above slightly awkward macro call.

cc: @JeffBezanson

@kmsquire
Copy link
Member Author

I'm glad this patch sparked a discussion! I like the match block syntax. Taking off from one of your first comments, what about having different regex types depending on the regular expression:

abstract type Regex
type NormalRegex <: Regex
...
end

type CaptureRegex <: Regex
...
end

type NamedCaptureRegex <: Regex
...
end

function regex(pat::String, opts::Integer, study::Bool)
    # initialize regex...
    ...
    if has_captures(re)
        if has_named_captures(re)
            NamedCaptureRegex(pat, opts, re, ex)
        else
            CaptureRegex(pat, opts, re, ex)
        end
    else
        NormalRegex(pat, opts, re, ex)
    end
end

This kind of fakes regexes dispatch based on value, but, e.g., would allow NormalRegexes matches to be boolean, but CaptureRegexes matches could produce tuples, and NamedCaptureRegexes could produce named entities. The latter two could both allow your block syntax. I imagine it might also be nice to use these to directly produce a DataFrame entry or concrete type (or perhaps even a dictionary, for some applications), although the block syntax might be enough to do this.

Kevin

@ViralBShah
Copy link
Member

@StefanKarpinski What are we planning to do about this? This has been around for 4 months.

@StefanKarpinski
Copy link
Member

We're not going to merge this pull-request as-is, so this has become more of a feature request. There's a lot of Regex re-design that's needed. If we had else clauses on for loops, I could immediately switch the recommended regex usage over to the for-loop style, which would be a start.

@kmsquire
Copy link
Member Author

Why not switch it over to the for-loop style now, and forget the else clauses for now? For most uses, the else clauses wouldn't be necessary.

@StefanKarpinski
Copy link
Member

I guess we could do that too. Doesn't address the original issue here, which is the named captures business. I guess I could write that @match macro though, although the end begin thing is ugly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants