-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a regular expressions library to the distribution #3591
Comments
What about writing a regular expression engine in Rust? |
That could be fine. |
For anyone who stumbles across this issue looking for a regex library, we do have a pcre binding in cargo. |
@erickt is that related to @uasi's https://github.com/uasi/rust-pcre ? |
@killerswan neat! that seems to be an independent implementation. I'll ping @uasi to merge our code. |
nominating for feature-complete |
(also, as I mention elsewhere but apparently not in this bug, I prefer re2 for this) |
(bug triage) Milestone is correct; could be a good medium-sized starter project for somebody. |
I'm still interested in this. Been working on an cre2 binding. Not 100% sure how to handle functions that expect c enums. |
I would like to claim this issue to do as a project for my university class. |
@ktt3ja , I haven't been working on this. |
I've been working on this, but it's pretty slow going. See https://github.com/glennsl/rust-re It has a good set of features already, but is currently pretty naively implemented, horribly slow and not very nice to look at. I'm trying to learn rust as I go, and am currently trying to figure out how to optimize it without wandering into unsafe territory. Feel free to fork it or implement something completely separate. A different perspective would just be good as far as I'm concerned. |
There is a wiki page summarising some research about this: https://github.com/mozilla/rust/wiki/Lib-re |
Nvm, I'll be working on something else for my project. |
I've been working on this for my class. The project is here https://github.com/ferristseng/regex-rust. I've been trying to add the features outlined in the ECMA-262 standard. If anyone has any suggestions or would like to help, let me know (this is my first time working on a regular expression engine). |
cc @ssbr |
Evidently there's some duplication of work going on, even if it is early work for everyone. @ferristseng want to join up? I can contribute what I suspect is a more rigorous attempt at a parser (I basically copied the ECMA spec to the letter, with one deviation, so it should be correct -- I did skip a few bits though, and replaced them with empty failing functions), and a large set of tests (150MB worth, I am trying to shrink it down and make them runnable) I've taken out of Firefox. Unfortunately my free time is more limited than I'd like, so I have yet to get around to a good set of runtime implementations. That's the exciting bit I've been trying to get to once all the other pieces are ready. OTOH I suspect that since this is for school you can't have any outside contributors, which would be disappointing. |
Yeah I would definitely like to join up. The class is finished now, so any outside contributions would be welcome. I'm not sure of the quality of the code though, but hopefully there is some useful stuff in there. |
Very cool, @ferristseng ! As a start, you might be interested in the testsuite I've adapted from python (see https://github.com/glennsl/rust-re/blob/master/tests.rs). Should be pretty easy to adapt to your own code. I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte. There is one issue that needs to be addressed in a larger context, however: Unicode. The readme on my repository roughly outlines what's needed for the different levels of Unicode TR18 compliance (there's also a third level of compliance, but that's special interest territory). Normalization and case folding, I think, are the two most important points that need to be addressed and implemented separately (#9084 addresses the case folding). |
@glennsl Note that we already have |
Could this discussion be held publicly? I am interested in it. |
@Kimundi Ah, I've missed that, but that should do. Thanks. @ssbr Yeah I figured you'd want to be involved, and was thinking you, me and @ferristseng could sort out the implementation details and coordinate without bothering everyone else. Sorry for not being very clear on that. |
I don't really like the focus on replicating ECMA-262 because JavaScript has awful Unicode support, especially in regular expressions. It doesn't even have basic Unicode character class support - it's entirely anglo-centric. |
@thestinger the Unicode support is fixed in ES6 (with the |
Are you guys still working on this? I've been hacking on an implementation of my own (https://github.com/lfairy/rose). At this stage, it's mostly a cleaned-up version of @glennsl's library. Pike VM is good. It isn't terribly fast, but it's simple and predictable, which is what we want in the stdlib. On Unicode support: we really don't need to handle normalization -- no other popular engine implements that. Input can be normalized outside the library anyway. In addition, neither Python nor grep perform full case folding (Unicode Level 2). AFAIK, only RE2 and PCRE have this feature, the latter very recently (http://bugs.exim.org/show_bug.cgi?id=1208). Taking this into account, I think we should aim for Level 1 compliance only. |
I've been working on cleaning up my code, and adding more features. The code is probably more confusing than it should be. Something that arose in our discussion was whether or not we should support features like backreferences and assertions. It might be a good idea to support them whether we have a fall back method, where we default to using a recursive backtracking algorithm, or find some other way. The downside to this is depending on what we choose, it will certainly add some complexity and overhead to the code. |
@ferristseng I had a quick look over your code -- I agree that it could do with some cleaning up ;) As a first step, I suggest making
Backreferences obviously require backtracking, but I think you can do forward (lookahead) assertions using Pike VM. Split a separate thread for every assertion; the overall expression matches iff any non-assertion thread matches and all assertion threads match. You can implement this in the VM by adding an By the way, my email is lambda dot fairy at gmail dot com. I'd love to join in the discussion. |
I'm currently working on this: https://github.com/BurntSushi/re2 It's already at or near feature parity with RE2 (including Unicode classes), but I haven't written tests or benchmarks yet. (And I still need to make the public API.) I'm hoping to submit a PR within the next few days. |
OK, so I'm pretty close to getting to a point where I'm happy with my regexp implementation. It includes an Given that there is interest in having a regexp library included in the Rust distribution, I'd be happy to submit mine for consideration. Is this something that will have to go through the RFC process? Or can I just submit a PR? AFAIK, my implementation is a pretty faithful port of RE2 with a standard interface. I'm working on doco here: http://burntsushi.net/rustdoc/regexp/ |
That looks really impressive, very nice! In the interest of alerting everyone to a pending regular expression library implementation, it may be nice to have an RFC, but I don't think that the RFC needs to be too long, and I imagine that it will not take too long to get accepted. |
Fantastic, thank you! I'll finish up my doco and write up an RFC as soon as I can. |
@BurntSushi That's some impressive work! |
RFC submitted: rust-lang/rfcs#42 |
Apologies if this is in the wrong place. Perhaps it is irrelevant because this is almost implemented... I thought a new regex implementation probably wasn't complete without a mention of Perl 6 and I couldn't find it anywhere in any Rust discussions. Anyway, I was under the impression that Perl 6 is attempting to fix many issues with current regex and thus should be paid attention to when implementing a new language which includes regex. I just wanted to point it out in case it was an accidental omission. The following documents goes over more details of it if you are interested. I hope they are useful. http://perlcabal.org/syn/S05.html |
@mdinger I hadn't seen what Perl 6 was doing in this space. I just spent some time looking through your links. I'll say it definitely looks cool---the idea of specifying a regexp with a grammar does seem more readable for more complex regexps. But I think it's a separate issue from providing your average run-of-the-mill regexps that everyone already knows and uses. It looks like a prime case for trying out an EDSL in Rust, actually. I also say this because the regexps being offered up right now are strictly less expressive than Perl's. They are a pretty faithful port of RE2, which means no backreferences or generalized zero-width assertions (i.e., lookbehind and lookahead). So the need to simplify is probably quite a bit less than with a regexp engine as complex and powerful as Perl's. |
Very good. As long as it isn't missed because of ignorance. Perl 6 has been in development for a long time without ever getting traction so this might not have been well known or used; it's still perfect for someone else to take advantage of their work. If this is worth pursuing in any fashion, where should it be noted? In the wiki/Libs linked below? On the wiki in another category? Posted to a mailing list? I'm not developing this, I just wanted the correct people to be aware in case they want to pursue it. Sounds useful to me. |
Also adds a regex_macros crate, which provides natively compiled regular expressions with a syntax extension. Closes rust-lang#3591. RFC: 0007-regexps
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code. Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout. I know the addition looks huge at first, but consider these things: 1. More than half the number of lines is dedicated to Unicode character classes. 2. Of the ~4,500 lines remaining, 1,225 of them are comments. 3. Another ~800 are tests. 4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
do not run symlink tests on Windows hosts Fixes rust-lang/miri#3587
The exact engine we use needs further discussion. Probably we don't want to use yarr because it requires the nitro jit, which is a big dependency and could be undesirable for various reasons. Ideally, we would have a nice syntax extension that precompiles the regexes (in addition to runtime compilation options).
The text was updated successfully, but these errors were encountered: