RFC: Added named regular expressions. #1448

kmsquire · 2012-10-26T06:08:24Z

PCRE supports named regular expressions (ala perl/python/ruby), and they could already be used for back referencing. This pull request adds a get_name_table method to pcre.jl, a capture_dict variable to RegexMatches to allow access to the named captures, and minor related updates to Regex, show, etc. It also adds a blurb in the manual.

ViralBShah · 2012-10-28T01:32:48Z

Could you add some tests?

ViralBShah · 2012-10-28T01:33:36Z

Oops, I missed the tests in the diff. I see the tests are already in place.

StefanKarpinski · 2012-10-28T17:24:15Z

I've considered adding this before, but hesitated due to uncertainty about the design, so this is a good kick in the pants to talk about the design of it. My major issue is that it significantly complicates the representation and code paths for normal regex workflows. In particular, this concerns the shit out of me because it's in the code path for all regex matches and even in the fast case where there are no names, it ends up gratuitously creating a Dict object (which is not exactly cheap). Since one is likely to use regular expressions to parse massive data files, this is very problematic.

One possible solution is to have separate types: indexed regexes vs. named regexes and indexed regex matches vs. named regex matches. Then only using regex names incurs these additional complications and even there, you have fewer branches and checks in the matching path which is the really performance-critical one.

However, ideally, regex matching should do even less work than it currently does, and named regexes can actually help with that. Typically what you want to do is something like this:

m = match(r"(a+b+).*(c+d+)", str)
if m != nothing
  foo = m.captures[1]
  bar = m.captures[2]
  # do something with foo and bar
end

So basically, you want to have the matching substrings bound to local variables. Sweet! Then there's no need for that annoying array, m.captures, let alone a hash of names – lets just use local variables. So you kind of want something like this:

@match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
  # foo and bar are bound inside here
  # do something with foo and bar
end

However, that may not be the optimal syntax, especially since there's no nice way to have an else clause except awkwardly sticking another block after the end:

@match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
  # foo and bar are bound inside here
  # do something with foo and bar
end begin
  # foo and bar are not bound here
  # handle not matching
end

See #88, #1289 and discussion in #1288. One possible hack would be to make begin else end in macro calls expand to begin end begin end so that it's just a prettier way to write the above slightly awkward macro call.

cc: @JeffBezanson

kmsquire · 2012-10-29T05:51:19Z

I'm glad this patch sparked a discussion! I like the match block syntax. Taking off from one of your first comments, what about having different regex types depending on the regular expression:

abstract type Regex
type NormalRegex <: Regex
...
end

type CaptureRegex <: Regex
...
end

type NamedCaptureRegex <: Regex
...
end

function regex(pat::String, opts::Integer, study::Bool)
    # initialize regex...
    ...
    if has_captures(re)
        if has_named_captures(re)
            NamedCaptureRegex(pat, opts, re, ex)
        else
            CaptureRegex(pat, opts, re, ex)
        end
    else
        NormalRegex(pat, opts, re, ex)
    end
end

This kind of fakes ~~regexes~~ dispatch based on value, but, e.g., would allow NormalRegexes matches to be boolean, but CaptureRegexes matches could produce tuples, and NamedCaptureRegexes could produce named entities. The latter two could both allow your block syntax. I imagine it might also be nice to use these to directly produce a DataFrame entry or concrete type (or perhaps even a dictionary, for some applications), although the block syntax might be enough to do this.

Kevin

ViralBShah · 2013-02-20T14:34:17Z

@StefanKarpinski What are we planning to do about this? This has been around for 4 months.

StefanKarpinski · 2013-02-20T15:25:08Z

We're not going to merge this pull-request as-is, so this has become more of a feature request. There's a lot of Regex re-design that's needed. If we had else clauses on for loops, I could immediately switch the recommended regex usage over to the for-loop style, which would be a start.

kmsquire · 2013-02-20T16:40:45Z

Why not switch it over to the for-loop style now, and forget the else clauses for now? For most uses, the else clauses wouldn't be necessary.

StefanKarpinski · 2013-02-20T16:45:43Z

I guess we could do that too. Doesn't address the original issue here, which is the named captures business. I guess I could write that @match macro though, although the end begin thing is ugly.

Added named regular expressions.

2e075f0

ghost assigned StefanKarpinski Oct 26, 2012

kmsquire mentioned this pull request Nov 4, 2012

Support comprehensions with unknown-length iterables #1457

Closed

kmsquire mentioned this pull request May 19, 2013

RFC: Added do-expression syntax for regex matches. #3146

Closed

JeffBezanson closed this May 21, 2013

kmsquire mentioned this pull request May 20, 2015

Named subpatterns in regex #11362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Added named regular expressions. #1448

RFC: Added named regular expressions. #1448

kmsquire commented Oct 26, 2012

ViralBShah commented Oct 28, 2012

ViralBShah commented Oct 28, 2012

StefanKarpinski commented Oct 28, 2012

kmsquire commented Oct 29, 2012

ViralBShah commented Feb 20, 2013

StefanKarpinski commented Feb 20, 2013

kmsquire commented Feb 20, 2013

StefanKarpinski commented Feb 20, 2013

RFC: Added named regular expressions. #1448

RFC: Added named regular expressions. #1448

Conversation

kmsquire commented Oct 26, 2012

ViralBShah commented Oct 28, 2012

ViralBShah commented Oct 28, 2012

StefanKarpinski commented Oct 28, 2012

kmsquire commented Oct 29, 2012

ViralBShah commented Feb 20, 2013

StefanKarpinski commented Feb 20, 2013

kmsquire commented Feb 20, 2013

StefanKarpinski commented Feb 20, 2013