-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a regexp crate to the Rust distribution #42
Conversation
A nice implementation strategy to support Unicode is to implement a VM that | ||
matches characters instead of bytes. Indeed, my implementation does this. | ||
However, the public API of a regular expression library should expose *byte | ||
indices* corresponding to match locations (which ought to be guaranteed to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The APIs in std::str
expose byte indices too, so this is well supported in Rust-land.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Fixed.
found this difficult to do with zero-runtime cost. Either way, the ability to | ||
statically declare a regexp is pretty cool I think. | ||
|
||
Note that the syntax extension is the reason for the `regexp_re` crate. It's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo
, have foo_macros
or foo_synext
or something. (I'd personally be ok with foo_macros
, e.g. regexp_macros
in this case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like foo_macros
too. (I'll change this once there's a consensus?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used foo_mac
but foo_macros
seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;
. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase
is feature gated.
Essentially, I wouldn't worry too much about the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I changed the name for now to regexp_macros
. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re
. Happy to comply with anything though.
[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people | ||
favor a native Rust implementation if it's to be included in the Rust | ||
distribution. (Does the `re!` macro require it? If so, that's a huge | ||
advantage.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Fixed.
This looks amazing, fantastic work! |
But doesn't I certainly don't want any public |
If a procedural* macro is defined in a crate, then it will want to pull in *It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in. |
@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation. |
A couple of other ideas that I have had with regards to regular expressions are:
I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not. Anyway, these lead to something like this:
expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):
This allows nicer usage:
I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on. |
case---but I'm just hazarding a guess here. (If we go this route, then we'd | ||
probably also have to expose the regexp parser and AST and possibly the | ||
compiler and instruction set to make writing your own backend easier. That | ||
sounds restrictive with respect to making performance improvements in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could expose it as an #[unstable]
or even #[experimental]
interface: i.e. subject to change, but it's possible to use if you really need it.
@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.) I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.) If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the |
@chris-morgan, sadly |
@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P |
@alexcrichton Is that a current wart that should be fixed, or will that remain the same? |
Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be |
@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting. Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:
We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger. |
@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)
I'd also be happy with |
Meta: should the RFC include a discussion/justification of the name? |
Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later. I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate. |
@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.) I've added some stuff about the name to the RFC. |
@bjz I agree. I would actually prefer that it be called |
I personally prefer crate |
Google trends for the regexpr, regexp and regex:
|
The `\w` character class and the zero-width word boundary assertion `\b` are | ||
defined in terms of the ASCII character set. I'm not aware of any | ||
implementation that defines these in terms of proper Unicode character classes. | ||
Do we want to be the first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\w
and \d
and \s
all default to Unicode under Python 3. So there's a little bit of precedent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)
DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM. It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions. If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority. (As for the name, I vote for |
@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think). |
Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.) |
I find it slightly odd that Would |
@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.
That sounds right. @huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though. This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;) |
The types mean that you can never accidentally use the return of
That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.) |
I've been working on what @chris-morgan suggested: real compilation to native Rust with the pub enum MaybeNative {
Dynamic(~[Inst]),
Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
} This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable. The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.) The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.
There's also a similarly big jump in performance on the |
I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries. Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given. |
How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much? |
indeed, i was thinking similarly: If I use |
@sfackler I don't (yet) have a comparison with the old As a baseline, if I compile These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case. @seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like |
Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with |
We discussed this in today's meeting and decided to merge it. The only caveat we'd like to attach is that the entire crate is |
@BurntSushi: >I'm not exactly sure why they are separate in the public API though.< I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory. |
Fix links to promise function vs Promise type
Links to an existing implementation, documentation and benchmarks are in the RFC. This RFC is meant to resolve issue #3591.
I apologize in advance if I've made any amateur mistakes. I'm still fairly new to the Rust world (~1 month), so I'm sure I still have some misunderstandings about the language lurking somewhere.