Add a regular expressions library to the distribution #3591

brson · 2012-09-25T21:02:45Z

The exact engine we use needs further discussion. Probably we don't want to use yarr because it requires the nitro jit, which is a big dependency and could be undesirable for various reasons. Ideally, we would have a nice syntax extension that precompiles the regexes (in addition to runtime compilation options).

eholk · 2012-09-25T22:07:50Z

What about writing a regular expression engine in Rust?

brson · 2012-09-25T22:33:53Z

That could be fine.

erickt · 2012-09-28T15:26:29Z

For anyone who stumbles across this issue looking for a regex library, we do have a pcre binding in cargo.

killerswan · 2012-10-05T03:37:18Z

@erickt is that related to @uasi's https://github.com/uasi/rust-pcre ?

erickt · 2012-10-05T04:11:03Z

@killerswan neat! that seems to be an independent implementation. I'll ping @uasi to merge our code.

graydon · 2013-04-25T14:15:49Z

nominating for feature-complete

graydon · 2013-04-25T14:16:06Z

(also, as I mention elsewhere but apparently not in this bug, I prefer re2 for this)

catamorphism · 2013-06-24T15:50:01Z

(bug triage) Milestone is correct; could be a good medium-sized starter project for somebody.

nickdesaulniers · 2013-06-25T19:04:51Z

I'm still interested in this. Been working on an cre2 binding. Not 100% sure how to handle functions that expect c enums.

ktt3ja · 2013-11-01T06:00:40Z

I would like to claim this issue to do as a project for my university class.

nickdesaulniers · 2013-11-01T16:30:17Z

@ktt3ja , I haven't been working on this.

glennsl · 2013-11-03T03:19:59Z

I've been working on this, but it's pretty slow going. See https://github.com/glennsl/rust-re

It has a good set of features already, but is currently pretty naively implemented, horribly slow and not very nice to look at. I'm trying to learn rust as I go, and am currently trying to figure out how to optimize it without wandering into unsafe territory.

Feel free to fork it or implement something completely separate. A different perspective would just be good as far as I'm concerned.

huonw · 2013-11-03T03:39:42Z

There is a wiki page summarising some research about this: https://github.com/mozilla/rust/wiki/Lib-re

ktt3ja · 2013-11-07T05:03:24Z

Nvm, I'll be working on something else for my project.

ferristseng · 2013-12-07T18:46:57Z

I've been working on this for my class. The project is here https://github.com/ferristseng/regex-rust. I've been trying to add the features outlined in the ECMA-262 standard. If anyone has any suggestions or would like to help, let me know (this is my first time working on a regular expression engine).

huonw · 2013-12-07T22:22:35Z

cc @ssbr

ssbr · 2013-12-07T23:17:42Z

Evidently there's some duplication of work going on, even if it is early work for everyone. @ferristseng want to join up? I can contribute what I suspect is a more rigorous attempt at a parser (I basically copied the ECMA spec to the letter, with one deviation, so it should be correct -- I did skip a few bits though, and replaced them with empty failing functions), and a large set of tests (150MB worth, I am trying to shrink it down and make them runnable) I've taken out of Firefox. Unfortunately my free time is more limited than I'd like, so I have yet to get around to a good set of runtime implementations. That's the exciting bit I've been trying to get to once all the other pieces are ready.

OTOH I suspect that since this is for school you can't have any outside contributors, which would be disappointing.

ferristseng · 2013-12-08T18:34:29Z

Yeah I would definitely like to join up. The class is finished now, so any outside contributions would be welcome. I'm not sure of the quality of the code though, but hopefully there is some useful stuff in there.

glennsl · 2013-12-09T00:51:10Z

Very cool, @ferristseng ! As a start, you might be interested in the testsuite I've adapted from python (see https://github.com/glennsl/rust-re/blob/master/tests.rs). Should be pretty easy to adapt to your own code.

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

There is one issue that needs to be addressed in a larger context, however: Unicode. The readme on my repository roughly outlines what's needed for the different levels of Unicode TR18 compliance (there's also a third level of compliance, but that's special interest territory). Normalization and case folding, I think, are the two most important points that need to be addressed and implemented separately (#9084 addresses the case folding).

Kimundi · 2013-12-09T08:49:56Z

@glennsl Note that we already have nfd and nfkd normalization in the stdlib.

ssbr · 2013-12-09T12:49:00Z

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

Could this discussion be held publicly? I am interested in it.

glennsl · 2013-12-09T17:30:26Z

@Kimundi Ah, I've missed that, but that should do. Thanks.

@ssbr Yeah I figured you'd want to be involved, and was thinking you, me and @ferristseng could sort out the implementation details and coordinate without bothering everyone else. Sorry for not being very clear on that.

thestinger · 2013-12-09T20:43:27Z

I don't really like the focus on replicating ECMA-262 because JavaScript has awful Unicode support, especially in regular expressions. It doesn't even have basic Unicode character class support - it's entirely anglo-centric.

eddyb · 2013-12-10T11:39:07Z

@thestinger the Unicode support is fixed in ES6 (with the u flag), maybe that should be the baseline.

lambda-fairy · 2013-12-23T07:44:05Z

Are you guys still working on this? I've been hacking on an implementation of my own (https://github.com/lfairy/rose). At this stage, it's mostly a cleaned-up version of @glennsl's library.

Pike VM is good. It isn't terribly fast, but it's simple and predictable, which is what we want in the stdlib.

On Unicode support: we really don't need to handle normalization -- no other popular engine implements that. Input can be normalized outside the library anyway.

In addition, neither Python nor grep perform full case folding (Unicode Level 2). AFAIK, only RE2 and PCRE have this feature, the latter very recently (http://bugs.exim.org/show_bug.cgi?id=1208).

Taking this into account, I think we should aim for Level 1 compliance only.

ferristseng · 2013-12-23T19:35:53Z

I've been working on cleaning up my code, and adding more features. The code is probably more confusing than it should be.

Something that arose in our discussion was whether or not we should support features like backreferences and assertions. It might be a good idea to support them whether we have a fall back method, where we default to using a recursive backtracking algorithm, or find some other way. The downside to this is depending on what we choose, it will certainly add some complexity and overhead to the code.

lambda-fairy · 2013-12-24T02:44:24Z

@ferristseng I had a quick look over your code -- I agree that it could do with some cleaning up ;)

As a first step, I suggest making CharClass immutable, and moving the optimization logic from negate into the constructor. The API might look like this:

impl CharClass {
    pub fn new(ranges: ~[(char, char)]);
    pub fn negate(&self) -> self;
    // ...
}

Backreferences obviously require backtracking, but I think you can do forward (lookahead) assertions using Pike VM. Split a separate thread for every assertion; the overall expression matches iff any non-assertion thread matches and all assertion threads match. You can implement this in the VM by adding an AssertionSplit(Position, Position) instruction and keeping separate vectors for assertion and non-assertion threads.

By the way, my email is lambda dot fairy at gmail dot com. I'd love to join in the discussion.

BurntSushi · 2014-04-04T16:14:58Z

I'm currently working on this: https://github.com/BurntSushi/re2

It's already at or near feature parity with RE2 (including Unicode classes), but I haven't written tests or benchmarks yet. (And I still need to make the public API.) I'm hoping to submit a PR within the next few days.

BurntSushi · 2014-04-12T19:12:50Z

OK, so I'm pretty close to getting to a point where I'm happy with my regexp implementation. It includes an re! macro for compile time safe regexps. There's also over 400 tests (all passing) and lots of benchmarks, including regex-dna for the shootout.

Given that there is interest in having a regexp library included in the Rust distribution, I'd be happy to submit mine for consideration.

Is this something that will have to go through the RFC process? Or can I just submit a PR? AFAIK, my implementation is a pretty faithful port of RE2 with a standard interface. I'm working on doco here: http://burntsushi.net/rustdoc/regexp/

alexcrichton · 2014-04-12T20:25:01Z

That looks really impressive, very nice!

In the interest of alerting everyone to a pending regular expression library implementation, it may be nice to have an RFC, but I don't think that the RFC needs to be too long, and I imagine that it will not take too long to get accepted.

BurntSushi · 2014-04-12T20:34:11Z

Fantastic, thank you! I'll finish up my doco and write up an RFC as soon as I can.

Valloric · 2014-04-12T21:56:22Z

@BurntSushi That's some impressive work!

BurntSushi · 2014-04-13T04:01:21Z

RFC submitted: rust-lang/rfcs#42

mdinger · 2014-04-17T06:46:25Z

Apologies if this is in the wrong place. Perhaps it is irrelevant because this is almost implemented...

I thought a new regex implementation probably wasn't complete without a mention of Perl 6 and I couldn't find it anywhere in any Rust discussions. Anyway, I was under the impression that Perl 6 is attempting to fix many issues with current regex and thus should be paid attention to when implementing a new language which includes regex. I just wanted to point it out in case it was an accidental omission.

The following documents goes over more details of it if you are interested. I hope they are useful.

http://perlcabal.org/syn/S05.html
http://en.wikipedia.org/wiki/Perl_6_rules

BurntSushi · 2014-04-18T00:03:58Z

@mdinger I hadn't seen what Perl 6 was doing in this space. I just spent some time looking through your links. I'll say it definitely looks cool---the idea of specifying a regexp with a grammar does seem more readable for more complex regexps. But I think it's a separate issue from providing your average run-of-the-mill regexps that everyone already knows and uses. It looks like a prime case for trying out an EDSL in Rust, actually.

I also say this because the regexps being offered up right now are strictly less expressive than Perl's. They are a pretty faithful port of RE2, which means no backreferences or generalized zero-width assertions (i.e., lookbehind and lookahead). So the need to simplify is probably quite a bit less than with a regexp engine as complex and powerful as Perl's.

mdinger · 2014-04-18T00:30:23Z

Very good. As long as it isn't missed because of ignorance. Perl 6 has been in development for a long time without ever getting traction so this might not have been well known or used; it's still perfect for someone else to take advantage of their work.

If this is worth pursuing in any fashion, where should it be noted? In the wiki/Libs linked below? On the wiki in another category? Posted to a mailing list?
https://github.com/mozilla/rust/wiki/Libs

I'm not developing this, I just wanted the correct people to be aware in case they want to pursue it. Sounds useful to me.

Also adds a regex_macros crate, which provides natively compiled regular expressions with a syntax extension. Closes rust-lang#3591. RFC: 0007-regexps

Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code. Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout. I know the addition looks huge at first, but consider these things: 1. More than half the number of lines is dedicated to Unicode character classes. 2. Of the ~4,500 lines remaining, 1,225 of them are comments. 3. Another ~800 are tests. 4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.

do not run symlink tests on Windows hosts Fixes rust-lang/miri#3587

kud1ing mentioned this issue Sep 28, 2012

Add missing shootout benchmarks #2776

Closed

huonw mentioned this issue Aug 19, 2013

Test harness needs a more flexible testname white/blacklist argument #2866

Closed

derekchiang mentioned this issue Dec 11, 2013

Implement du uutils/coreutils#24

Closed

BurntSushi mentioned this issue Apr 13, 2014

add a regexp crate to the Rust distribution rust-lang/rfcs#42

Merged

BurntSushi mentioned this issue Apr 23, 2014

add regexp crate to Rust distribution (implements RFC 7) #13700

Merged

BurntSushi added a commit to BurntSushi/rust that referenced this issue Apr 25, 2014

Add a regex crate to the Rust distribution.

b8b7484

Also adds a regex_macros crate, which provides natively compiled regular expressions with a syntax extension. Closes rust-lang#3591. RFC: 0007-regexps

bors closed this as completed in #13700 Apr 25, 2014

RalfJung pushed a commit to RalfJung/rust that referenced this issue May 11, 2024

Auto merge of rust-lang#3591 - RalfJung:win-symlink-trouble, r=RalfJung

d3f4d06

do not run symlink tests on Windows hosts Fixes rust-lang/miri#3587

Add a regular expressions library to the distribution #3591

Add a regular expressions library to the distribution #3591

Comments

brson commented Sep 25, 2012

eholk commented Sep 25, 2012

brson commented Sep 25, 2012

erickt commented Sep 28, 2012

killerswan commented Oct 5, 2012

erickt commented Oct 5, 2012

graydon commented Apr 25, 2013

graydon commented Apr 25, 2013

catamorphism commented Jun 24, 2013

nickdesaulniers commented Jun 25, 2013

ktt3ja commented Nov 1, 2013

nickdesaulniers commented Nov 1, 2013

glennsl commented Nov 3, 2013

huonw commented Nov 3, 2013

ktt3ja commented Nov 7, 2013

ferristseng commented Dec 7, 2013

huonw commented Dec 7, 2013

ssbr commented Dec 7, 2013

ferristseng commented Dec 8, 2013

glennsl commented Dec 9, 2013

Kimundi commented Dec 9, 2013

ssbr commented Dec 9, 2013

glennsl commented Dec 9, 2013

thestinger commented Dec 9, 2013

eddyb commented Dec 10, 2013

lambda-fairy commented Dec 23, 2013

ferristseng commented Dec 23, 2013

lambda-fairy commented Dec 24, 2013

BurntSushi commented Apr 4, 2014

BurntSushi commented Apr 12, 2014

alexcrichton commented Apr 12, 2014

BurntSushi commented Apr 12, 2014

Valloric commented Apr 12, 2014

BurntSushi commented Apr 13, 2014

mdinger commented Apr 17, 2014

BurntSushi commented Apr 18, 2014

mdinger commented Apr 18, 2014