-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Added named regular expressions. #1448
Conversation
Could you add some tests? |
Oops, I missed the tests in the diff. I see the tests are already in place. |
I've considered adding this before, but hesitated due to uncertainty about the design, so this is a good kick in the pants to talk about the design of it. My major issue is that it significantly complicates the representation and code paths for normal regex workflows. In particular, this concerns the shit out of me because it's in the code path for all regex matches and even in the fast case where there are no names, it ends up gratuitously creating a Dict object (which is not exactly cheap). Since one is likely to use regular expressions to parse massive data files, this is very problematic. One possible solution is to have separate types: indexed regexes vs. named regexes and indexed regex matches vs. named regex matches. Then only using regex names incurs these additional complications and even there, you have fewer branches and checks in the matching path which is the really performance-critical one. However, ideally, regex matching should do even less work than it currently does, and named regexes can actually help with that. Typically what you want to do is something like this: m = match(r"(a+b+).*(c+d+)", str)
if m != nothing
foo = m.captures[1]
bar = m.captures[2]
# do something with foo and bar
end So basically, you want to have the matching substrings bound to local variables. Sweet! Then there's no need for that annoying array, @match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
# foo and bar are bound inside here
# do something with foo and bar
end However, that may not be the optimal syntax, especially since there's no nice way to have an @match r"(?<foo>a+b+).*(?<bar>c+d+)" begin
# foo and bar are bound inside here
# do something with foo and bar
end begin
# foo and bar are not bound here
# handle not matching
end See #88, #1289 and discussion in #1288. One possible hack would be to make cc: @JeffBezanson |
I'm glad this patch sparked a discussion! I like the match block syntax. Taking off from one of your first comments, what about having different regex types depending on the regular expression: abstract type Regex
type NormalRegex <: Regex
...
end
type CaptureRegex <: Regex
...
end
type NamedCaptureRegex <: Regex
...
end
function regex(pat::String, opts::Integer, study::Bool)
# initialize regex...
...
if has_captures(re)
if has_named_captures(re)
NamedCaptureRegex(pat, opts, re, ex)
else
CaptureRegex(pat, opts, re, ex)
end
else
NormalRegex(pat, opts, re, ex)
end
end This kind of fakes Kevin |
@StefanKarpinski What are we planning to do about this? This has been around for 4 months. |
We're not going to merge this pull-request as-is, so this has become more of a feature request. There's a lot of Regex re-design that's needed. If we had else clauses on for loops, I could immediately switch the recommended regex usage over to the for-loop style, which would be a start. |
Why not switch it over to the for-loop style now, and forget the else clauses for now? For most uses, the else clauses wouldn't be necessary. |
I guess we could do that too. Doesn't address the original issue here, which is the named captures business. I guess I could write that |
PCRE supports named regular expressions (ala perl/python/ruby), and they could already be used for back referencing. This pull request adds a
get_name_table
method topcre.jl
, acapture_dict
variable toRegexMatch
es to allow access to the named captures, and minor related updates toRegex
,show
, etc. It also adds a blurb in the manual.